## Introduction to Natural Language Processing

In this workbook, at a high-level we will learn about text tokenization; text normalization such as lowercasing, stemming; part-of-speech tagging; Named entity recognition; Sentiment analysis; Topic modeling; Word embeddings





# Text-PreProcessing
The Basics of NLP for Text
In this article, we’ll cover the following topics to text-preprocessing:

1. Sentence Tokenization
2. Word Tokenization
3. Regular expression
4. Text Lemmatization and Stemming
5. Ngram
6. Stop Words


-----


**punkt ** This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

**averaged_perceptron_tagger ** contains the pre-trained English [Part-of-Speech (POS]]

**โมดูล nltk (Natural Language Toolkit)** เป็นโมดูลในภาษาไพทอนที่ช่วยในการประมวลภาษาธรรมชาติและโมดูลนี้เป็นที่นิยมกันในโลกนักพัฒนาภาษาไพทอน โดยใช้ Apache License, Version 2.0 และรองรับทั้ง Python 2 และ Python 3.

In [1]:
import nltk
nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
#Tokenization -- Paragraphs into sentences;
from nltk.tokenize import sent_tokenize 
  
text = "Hello All. Welcome to medium. This article is about NLP using NLTK."
print("SENTENCE AS TOKENS:")
print(sent_tokenize(text))
print("No of Sentence Tokens:",len(sent_tokenize(text)))

SENTENCE AS TOKENS:
['Hello All.', 'Welcome to medium.', 'This article is about NLP using NLTK.']
No of Sentence Tokens: 3


In [4]:
import nltk.data 
etext = 'Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.  And sometimes sentences can start with non-capitalized words.  i is a good variable name.'
english_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
english_tokenizer.tokenize(etext)



['Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.',
 'And sometimes sentences can start with non-capitalized words.',
 'i is a good variable name.']

In [5]:
#ภาษาเยอรมัน
german_tokenizer = nltk.data.load('tokenizers/punkt/PY3/german.pickle')  
gtext = 'Wie geht es Ihnen? Mir geht es gut.'
german_tokenizer.tokenize(gtext) 

['Wie geht es Ihnen?', 'Mir geht es gut.']

# ภาษาที่มีใน punkt

czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle                turkish.pickle
estonian.pickle  italian.pickle

In [6]:
#ภาษาเยอรมัน
german_tokenizer = nltk.data.load('tokenizers/punkt/PY3/german.pickle')  
gtext = 'Wie geht es Ihnen? Mir geht es gut.'
german_tokenizer.tokenize(gtext) 

['Wie geht es Ihnen?', 'Mir geht es gut.']

# ตัดคำ (word_tokenize)

In [7]:
#Tokenization --Text into word tokens;
from nltk.tokenize import word_tokenize 
  
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00. "
print("SENTENCE AS TOKENS:")
print(word_tokenize(text))
print("No of Sentence Tokens:",len(word_tokenize(text)))


SENTENCE AS TOKENS:
['Hello', 'All', '.', 'Welcome', 'to', 'medium', '.', 'This', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.', 'Subscribe', 'with', '$', '4.00', '.']
No of Sentence Tokens: 20


In [9]:
#Treebank หรือ คลังต้นไม้ คือ คลังข้อความที่ในแต่ละประโยคได้กำกับโครงสร้างวากยสัมพันธ์ สังเกตผลลัพธ์ที่ได้จากการทำ wordtokenizer กับ treebank
from nltk.tokenize import TreebankWordTokenizer 
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00. "
tokenizer = TreebankWordTokenizer() 
tokenizer.tokenize(text) 

['Hello',
 'All.',
 'Welcome',
 'to',
 'medium.',
 'This',
 'article',
 'is',
 'about',
 'NLP',
 'using',
 'NLTK.',
 'Subscribe',
 'with',
 '$',
 '4.00',
 '.']

###n-grams vs tokens

##### n-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.

##### Tokens do not have any conditions on contiguity

# Regular expression (re) 
ใน python. ... regex เป็นรูปแบบการเขียนที่นิยมใช้กันทั่วไปในการแสดงรูปแบบของตัวหนังสือ หากใช้ regex แล้วจะทำให้สามารถค้นหากลุ่มตัวหนังสือที่มีรูปแบบตามที่ต้องการจากข้อความหรือกลุ่มตัวอักษรได้ 

ดูเพิ่มเติมที่ https://www.bualabs.com/archives/3070/what-is-regular-expression-regex-regexp-teach-how-to-regex-python-nlp-ep-7/

In [10]:
#Using pure python
import re 
def generate_ngrams(text, n):
    # Convert to lowercases
    text = text.lower()
    
    # Replace all none alphanumeric characters with spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in text.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)
generate_ngrams(text, n=3)

Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.


['hello everyone welcome',
 'everyone welcome to',
 'welcome to intro',
 'to intro to',
 'intro to machine',
 'to machine learning',
 'machine learning applications',
 'learning applications we',
 'applications we are',
 'we are now',
 'are now learning',
 'now learning important',
 'learning important basics',
 'important basics of',
 'basics of nlp']

In [11]:
#Using NLTK import ngrams

import re
from nltk.util import ngrams

text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]
output = list(ngrams(tokens, 3))
print(output)

[('hello', 'everyone', 'welcome'), ('everyone', 'welcome', 'to'), ('welcome', 'to', 'intro'), ('to', 'intro', 'to'), ('intro', 'to', 'machine'), ('to', 'machine', 'learning'), ('machine', 'learning', 'applications'), ('learning', 'applications', 'we'), ('applications', 'we', 'are'), ('we', 'are', 'now'), ('are', 'now', 'learning'), ('now', 'learning', 'important'), ('learning', 'important', 'basics'), ('important', 'basics', 'of'), ('basics', 'of', 'nlp')]


In [12]:
#Text Normalization

#Case Conversion
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00."
lowert = text.lower()
uppert = text.upper()

print("To Lower Case:",lowert)
print("To Upper Case:",uppert)


To Lower Case: hello all. welcome to medium. this article is about nlp using nltk. subscribe with $4.00.
To Upper Case: HELLO ALL. WELCOME TO MEDIUM. THIS ARTICLE IS ABOUT NLP USING NLTK. SUBSCRIBE WITH $4.00.


# #stemming
ดูรายละเอียด https://www.bualabs.com/archives/2952/what-is-stemming-what-is-lemmatization-different-stemming-lemmatization-nlp-ep-3/


In [13]:

#Porterstemmer is a famous stemming approach
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer()
sentence = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

ps_sent = [ps.stem(words_sent) for words_sent in sent]
print(ps_sent)
print(len(ps_sent))

After Word Tokenization:
 ['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people', 'are', "n't", 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', '.', 'Ordinary', 'people', 'are', 'relentlessly', 'spied', 'on', ',', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', '.', 'While', 'I', "'d", 'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', ',', 'I', "'d", 'not', 'ask', 'for', 'it', 'until', 'there', "'s", 'reciprocity', '.']
Total No of Word Tokens:  69
['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'are', "n't", 'paid', 'for', 'their', 'particip', 'in', 'veri', 'lucr', 'network', 'scheme', '.', 'ordinari', 'peopl', 'are', 'relentlessli', 'spi', 'on', ',', 'and', 'not', 'compens', 'for', 'inform', 'taken', 'from', 'them', '.', 'whi

In [14]:
#Porter stemmer is a famous stemming approach

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
ps = PorterStemmer() 
 
words = ["hike", "hikes", "hiked", "hiking", "hikers", "hiker", "universal", "universe", "university","alumnus", "alumni", "alumnae"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

hike  :  hike
hikes  :  hike
hiked  :  hike
hiking  :  hike
hikers  :  hiker
hiker  :  hiker
universal  :  univers
universe  :  univers
university  :  univers
alumnus  :  alumnu
alumni  :  alumni
alumnae  :  alumna


In [15]:
#another stemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
   
sb = SnowballStemmer("english")
sentence = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

sb_sent = [sb.stem(words_sent) for words_sent in sent]
print(sb_sent)
print(len(sb_sent))

After Word Tokenization:
 ['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people', 'are', "n't", 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', '.', 'Ordinary', 'people', 'are', 'relentlessly', 'spied', 'on', ',', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', '.', 'While', 'I', "'d", 'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', ',', 'I', "'d", 'not', 'ask', 'for', 'it', 'until', 'there', "'s", 'reciprocity', '.']
Total No of Word Tokens:  69
['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'are', "n't", 'paid', 'for', 'their', 'particip', 'in', 'veri', 'lucrat', 'network', 'scheme', '.', 'ordinari', 'peopl', 'are', 'relentless', 'spi', 'on', ',', 'and', 'not', 'compens', 'for', 'inform', 'taken', 'from', 'them', '.', 'whi

#WordNet Lemmatization

In [16]:
#WordNet Lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#without POS tagging
text = "She jumped into the river and breathed heavily"
wordnet = WordNetLemmatizer()
tokenizer = word_tokenize(text)

for token in tokenizer:
    print(token,"--->",wordnet.lemmatize(token))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
She ---> She
jumped ---> jumped
into ---> into
the ---> the
river ---> river
and ---> and
breathed ---> breathed
heavily ---> heavily


In [None]:
#Lemmatizer with POS tag
from nltk import word_tokenize,pos_tag

for token,tag in pos_tag(word_tokenize(text)):
    pos=tag[0].lower()
        
    if pos not in ['a', 'r', 'n', 'v']:
        pos='n'
    
    print(token,"--->",wordnet.lemmatize(token,pos))

She ---> She
jumped ---> jump
into ---> into
the ---> the
river ---> river
and ---> and
breathed ---> breathe
heavily ---> heavily


#merge all the tokens to form a long text sequence 

In [None]:
#from nltk.stem import PorterStemmer 
#from nltk.tokenize import word_tokenize 
#import re
   
ps = PorterStemmer() 
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)


#Tokenize and stem the words
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]

i=0
while i<len(tokens):
  tokens[i]=ps.stem(tokens[i])
  i=i+1

#merge all the tokens to form a long text sequence 
text2 = ' '.join(tokens) 

print(text2)

Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.
hello everyon welcom to intro to machin learn applic We are now learn import basic of nlp


#stopwords

Stop Words คือ คำทั่ว ๆ ไป ที่เราพบบ่อย ๆ ในประโยค หรือ เอกสาร ต่ไม่ค่อยช่วยในการสื่อความหมายสักเท่าไร ทำให้เราสามารถลบคำเหล่านั้นออกไปจากรายการคำศัพท์ได้เลย กรองทิ้งไปจากเอกสารได้เลย เช่น a, an, the, also, just, quite, unless, etc. คำเหล่านี้เรียกว่า Stop Words.  
ดูเพิ่มเติมที่ https://colab.research.google.com/github/gnoparus/bualabs/blob/master/nbs/26a_stop_words.ipynb

In [None]:
#Stopwords removal 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

text = "Hello All. Welcome to medium. This article is about NLP using NLTK."

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(text) 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 

['Hello', 'All', '.', 'Welcome', 'to', 'medium', '.', 'This', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.']
['Hello', 'All', '.', 'Welcome', 'medium', '.', 'This', 'article', 'NLP', 'using', 'NLTK', '.']


In [None]:
#Part-of-Speech tagging

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "Medium welcomes you and this article is about NLP using NLTK."

sent = nltk.word_tokenize(text)
print(sent)
postag = nltk.pos_tag(sent)
print(postag)

['Medium', 'welcomes', 'you', 'and', 'this', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.']
[('Medium', 'NNP'), ('welcomes', 'VBZ'), ('you', 'PRP'), ('and', 'CC'), ('this', 'DT'), ('article', 'NN'), ('is', 'VBZ'), ('about', 'IN'), ('NLP', 'NNP'), ('using', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]


In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

example = "Hello India. Narendra Modi visited China. Daniel Owns a Ford car."

token = word_tokenize(example)
postag = nltk.pos_tag(token)
ner = nltk.ne_chunk(postag, binary= False)
print(ner)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
(S
  (PERSON Hello/NNP)
  (GPE India/NNP)
  ./.
  (PERSON Narendra/NNP Modi/NNP)
  visited/VBD
  (GPE China/NNP)
  ./.
  (PERSON Daniel/NNP Owns/NNP)
  a/DT
  Ford/NNP
  car/NN
  ./.)


# Named-Entity Recognition
เมื่อได้คำนามมาแล้วจากการทำ POS เราจะมาเรียนรู้ Named-Entity Recognition ทำ Named-Entity Tagging ว่าคำ ๆ นี้ เป็น ชื่อสิ่งที่อยู่ในโลกความเป็นจริงหรือไม่ ประเภทอะไร เช่น ชื่อคน สถานที่ องค์กร

ดูตัวอย่างเพิ่มเติมที่ https://www.bualabs.com/archives/4112/what-is-part-of-speech-tagging-what-is-named-entity-recognition-tagging-tutorial-pos-tagging-ner-thai-language-pythainlp-ep-4/

In [None]:
#Named entity recognition

#spaCy is an NLP Framework -- easy to use and having ability to use neural networks

import en_core_web_sm
nlp = en_core_web_sm.load()

text = 'GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside 40 million developers.'

doc = nlp(text)
print(doc.ents)
print([(X.text, X.label_) for X in doc.ents])

(GitHub, 40 million)
[('GitHub', 'ORG'), ('40 million', 'CARDINAL')]


#Sentiment analysis
คือ “การวิเคราะห์ความรู้สึก”
เป็นการวิเคราะห์อารมณ์และความรู้สึกจากข้อความ เพื่อบ่งบอกความรู้สึกของผู้คนที่มีต่อบางสิ่งบางอย่าง แบ่งได้เป็น
Positive = เป็นในทางที่ดี
Negative = เป็นในทางที่ไม่ดี
Neutral = เป็นกลาง

Example : Get data from twitter
https://pypi.org/project/twython/

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

s2 = 'This was the best, most awesome movie EVER MADE!!!'
print("polarity score for s2:")
sia.polarity_scores(s2)

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
polarity score for s2:




{'compound': 0.8877, 'neg': 0.0, 'neu': 0.425, 'pos': 0.575}

In [17]:

#Porterstemmer is a famous stemming approach
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer()
sentence = "When you get married, you're already fond of what you've seen before, right? So, if you notice a new side and like it after that, it would be double."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

ps_sent = [ps.stem(words_sent) for words_sent in sent]
print(ps_sent)
print(len(ps_sent))

After Word Tokenization:
 ['When', 'you', 'get', 'married', ',', 'you', "'re", 'already', 'fond', 'of', 'what', 'you', "'ve", 'seen', 'before', ',', 'right', '?', 'So', ',', 'if', 'you', 'notice', 'a', 'new', 'side', 'and', 'like', 'it', 'after', 'that', ',', 'it', 'would', 'be', 'double', '.']
Total No of Word Tokens:  37
['when', 'you', 'get', 'marri', ',', 'you', "'re", 'alreadi', 'fond', 'of', 'what', 'you', "'ve", 'seen', 'befor', ',', 'right', '?', 'So', ',', 'if', 'you', 'notic', 'a', 'new', 'side', 'and', 'like', 'it', 'after', 'that', ',', 'it', 'would', 'be', 'doubl', '.']
37


In [18]:
#another stemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
   
sb = SnowballStemmer("english")
sentence = "When you get married, you're already fond of what you've seen before, right? So, if you notice a new side and like it after that, it would be double."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

sb_sent = [sb.stem(words_sent) for words_sent in sent]
print(sb_sent)
print(len(sb_sent))

After Word Tokenization:
 ['When', 'you', 'get', 'married', ',', 'you', "'re", 'already', 'fond', 'of', 'what', 'you', "'ve", 'seen', 'before', ',', 'right', '?', 'So', ',', 'if', 'you', 'notice', 'a', 'new', 'side', 'and', 'like', 'it', 'after', 'that', ',', 'it', 'would', 'be', 'double', '.']
Total No of Word Tokens:  37
['when', 'you', 'get', 'marri', ',', 'you', 're', 'alreadi', 'fond', 'of', 'what', 'you', 've', 'seen', 'befor', ',', 'right', '?', 'so', ',', 'if', 'you', 'notic', 'a', 'new', 'side', 'and', 'like', 'it', 'after', 'that', ',', 'it', 'would', 'be', 'doubl', '.']
37
