# Natural Language Tool Kit (NLTK)

In [1]:
#!pip install nltk

### Tokenization
Tokenization is a process of breaking down a given paragraph of text into a list of sentence or words. When paragraph is broken down into list of sentences, it is called sentence tokenization.
Similarly, if the sentences are further broken down into list of words, it is known as Word tokenization.

Let's understand this with an example. Below is a given paragraph, let's see how tokenization works on it:

"India (Hindi: Bhārat), officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia."

* Sentence Tokenize:

    ['India (Hindi: Bhārat), officially the Republic of India, is a country in South Asia.',
    
  'It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world.',
  
  'Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land      borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.',
  
  'In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia.']


* Word tokenize:

['India', '(', 'Hindi', ':', 'Bhārat', ')', ',', 'officially', 'the', 'Republic', 'of', 'India', ',', 'is', 'a', 'country', 'in', 'South',
 'Asia', '.', 'It', 'is', 'the', 'seventh-largest', 'country', 'by', 'area', ',', 'the', 'second-most', 'populous', 'country', ',', 'and',
 'the', 'most', 'populous', 'democracy', 'in', 'the', 'world', '.', 'Bounded', 'by', 'the', 'Indian',
 'Ocean',
 'on',
 'the',
 'south',
 ',',
 'the',
 'Arabian',
 'Sea',
 'on',
 'the',
 'southwest',
 ',',
 'and',
 'the',
 'Bay',
 'of',
 'Bengal',
 'on',
 'the',
 'southeast',
 ',',
 'it',
 'shares',
 'land',
 'borders',
 'with',
 'Pakistan',
 'to',
 'the',
 'west',
 ';',
 'China',
 ',',
 'Nepal',
 ',',
 'and',
 'Bhutan',
 'to',
 'the',
 'north',
 ';',
 'and',
 'Bangladesh',
 'and',
 'Myanmar',
 'to',
 'the',
 'east',
 '.',
 'In',
 'the',
 'Indian',
 'Ocean',
 ',',
 'India',
 'is',
 'in',
 'the',
 'vicinity',
 'of',
 'Sri',
 'Lanka',
 'and',
 'the',
 'Maldives',
 ';',
 'its',
 'Andaman',
 'and',
 'Nicobar',
 'Islands',
 'share',
 'a',
 'maritime',
 'border',
 'with',
 'Thailand',
 'and',
 'Indonesia',
 '.']


Hope this example clears up the concept of tokenization. We will understand why it is done when we will dive into text analysis.




####  Word Tokenization 

- Example 
- 'I am learning Natural Language processing' is being converted into 
['I', 'am', 'learning', 'Natural', 'Language', 'processing']

#### Sentence Tokenization

- Example 
- "God is Great! I won a lottery." is bening converted into ["God is Great!", "I won a lottery"]

##### Word Tokenization

In [2]:
from nltk.tokenize import word_tokenize

In [3]:
# Define your text or import from other source
text = 'I am learning Natural Language processing'

In [4]:
# tokenizing
print (word_tokenize(text))

['I', 'am', 'learning', 'Natural', 'Language', 'processing']


##### Sentence Tokenization

In [5]:
from nltk.tokenize import sent_tokenize
#text = "Good. Morning! How are you?."
#text = "Good Morning! How are you"
text = " Our Company annual growth rate is 25.50%. Good job Mr.Bajaj"

In [6]:
print(sent_tokenize(text))

[' Our Company annual growth rate is 25.50%.', 'Good job Mr.Bajaj']


# Regular Expressions

In [7]:
from nltk.tokenize import regexp_tokenize

In [8]:
# Sample text
text = "NLP is fun and Can deal with texts and sounds, but can't deal with images. We have session at 11AM!.We can earn lot of $"

In [9]:
# Print word by word that contains all small case and starts from samll a to z
regexp_tokenize(text,"[a-z]+")

['is',
 'fun',
 'and',
 'an',
 'deal',
 'with',
 'texts',
 'and',
 'sounds',
 'but',
 'can',
 't',
 'deal',
 'with',
 'images',
 'e',
 'have',
 'session',
 'at',
 'e',
 'can',
 'earn',
 'lot',
 'of']

In [10]:
# # Print word by word that contains all caps and from caps A to Z
regexp_tokenize(text,"[A-Z]+")

['NLP', 'C', 'W', 'AM', 'W']

In [11]:
# extra quote ' get's you word like can't, don't
regexp_tokenize(text,"[a-z']+")

['is',
 'fun',
 'and',
 'an',
 'deal',
 'with',
 'texts',
 'and',
 'sounds',
 'but',
 "can't",
 'deal',
 'with',
 'images',
 'e',
 'have',
 'session',
 'at',
 'e',
 'can',
 'earn',
 'lot',
 'of']

In [12]:
# Everything in one line
regexp_tokenize(text,"[\a-z']+")

["NLP is fun and Can deal with texts and sounds, but can't deal with images. We have session at 11AM!.We can earn lot of $"]

In [13]:
# Anything starts with caret is not equal. 
regexp_tokenize(text,"[^a-z']+")

['NLP ',
 ' ',
 ' ',
 ' C',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ', ',
 ' ',
 ' ',
 ' ',
 ' ',
 '. W',
 ' ',
 ' ',
 ' ',
 ' 11AM!.W',
 ' ',
 ' ',
 ' ',
 ' ',
 ' $']

In [14]:
# Only numbers
regexp_tokenize(text,"[0-9]+")

['11']

In [15]:
# Without numbers
regexp_tokenize(text,"[^0-9]+")

["NLP is fun and Can deal with texts and sounds, but can't deal with images. We have session at ",
 'AM!.We can earn lot of $']

In [16]:
regexp_tokenize(text,"[$]")

['$']

#### POS (Parts Of Speech)
POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition.
- ` Example `
- Abbr	Meaning
- CC=>   	 coordinating conjunction
- CD=>   	cardinal digit
- DT=>   	determiner
- EX=>   	existential there
- FW=>   	foreign word
- IN=>   	preposition/subordinating conjunction
- JJ=>   	adjective (large)
- JJR=>   	adjective, comparative (larger)
- JJS=>   	adjective, superlative (largest)
- LS=>   	list market
- MD=>   	modal (could, will)
- NN=>   	noun, singular (cat, tree)
- NNS=>   	noun plural (desks)
- NNP=>   	proper noun, singular (sarah)
- NNPS=>   	proper noun, plural (indians or americans)
- PDT=>   	predeterminer (all, both, half)
- POS=>   	possessive ending (parent\ 's)
- PRP=>   	personal pronoun (hers, herself, him,himself)
- PRP$=>   	possessive pronoun (her, his, mine, my, our )
- RB=>   	adverb (occasionally, swiftly)
- RBR=>   	adverb, comparative (greater)
- RBS=>   	adverb, superlative (biggest)
- RP=>   	particle (about)
- TO=>   	infinite marker (to)
- UH=>   	interjection (goodbye)
- VB=>   	verb (ask)
- VBG=>   	verb gerund (judging)
- VBD=>   	verb past tense (pleaded)
- VBN=>   	verb past participle (reunified)
- VBP=>   	verb, present tense not 3rd person singular(wrap)
- VBZ=>   	verb, present tense with 3rd person singular (bases)
- WDT=>   	wh-determiner (that, what)
- WP=>   	wh- pronoun (who)
- WRB=>   	wh- adverb (how)

In [17]:
import nltk
#nltk.download('averaged_perceptron_tagger')


data =' We will see an example of POS tagging.'

pos = nltk.pos_tag(nltk.word_tokenize(data))

pos

[('We', 'PRP'),
 ('will', 'MD'),
 ('see', 'VB'),
 ('an', 'DT'),
 ('example', 'NN'),
 ('of', 'IN'),
 ('POS', 'NNP'),
 ('tagging', 'NN'),
 ('.', '.')]

* Chunking


After using parts of speech, Chunking can be used to make data more structured by giving a specific set of rules.  Let's understand more about chunking by following example :


##### Use Case of Chunking
Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention

In [18]:
data =' We will see an example of POS tagging.'

pos = nltk.pos_tag(nltk.word_tokenize(data))

# now once the POS tag has been done. Let's say we want to further structure data such that Nouns are
# categorized under one specific node defined by us :

my_node = "MN: {<NNP>*<NN>}"

chunk  =nltk.RegexpParser(my_node)

result = chunk.parse(pos)

print(result)
#result.draw()  # this generates a graphical picture

(S
  We/PRP
  will/MD
  see/VB
  an/DT
  (MN example/NN)
  of/IN
  (MN POS/NNP tagging/NN)
  ./.)


* Graphical representation

<img src="chunk.PNG">


We can see that both NN and NNP are now categorised into "MN" (as the given tag_name). 

So, whenever we need to categorise different tags into one tag, we can use chunking for this purpose.

## Stop Words
Stop words are such words which are very common in occurrence such as ‘a’,’an’,’the’, ‘at’ etc. We ignore such words during the preprocessing part since they do not give any important information and would just take additional space. We can make our custom list of stop words as well if we want. Different libraries have different stop words list. Let’s see the stop words list for NLTK:

In [19]:
# import stopwords
from nltk.corpus import stopwords

#If you get error download stopwords as below
#nltk.download('stopwords')

In [20]:
stop_words = stopwords.words('english')

In [21]:
print (stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [22]:
len(stop_words)

179

In [23]:
# Another way
stopset = set(nltk.corpus.stopwords.words('english'))

In [24]:
# Adding custome stopwords
stopset.update(('new','wonder'))
len(stopset)

181

#### Similar to the stopwords, we can also ignore punctuations in our sentences.

In [25]:
# import string
import string

In [26]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [27]:
# Remove stopwords and punctuations from the above set os texts

import nltk
import string
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
punct =string.punctuation

In [28]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [29]:
# Lets check those punctuations
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [30]:
#our text
text = "India (Hindi: Bhārat), officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia."

# Empty list to load clean data
cleaned_text = []

for word in nltk.word_tokenize(text):
    if word not in punct:
        if word not in stop_words:
            cleaned_text.append(word)
    
print ('Original Length  == >', len(text))
print ('length of cleaned text ==>', len(cleaned_text))
print ('\n',cleaned_text )


Original Length  == > 614
length of cleaned text ==> 57

 ['India', 'Hindi', 'Bhārat', 'officially', 'Republic', 'India', 'country', 'South', 'Asia', 'It', 'seventh-largest', 'country', 'area', 'second-most', 'populous', 'country', 'populous', 'democracy', 'world', 'Bounded', 'Indian', 'Ocean', 'south', 'Arabian', 'Sea', 'southwest', 'Bay', 'Bengal', 'southeast', 'shares', 'land', 'borders', 'Pakistan', 'west', 'China', 'Nepal', 'Bhutan', 'north', 'Bangladesh', 'Myanmar', 'east', 'In', 'Indian', 'Ocean', 'India', 'vicinity', 'Sri', 'Lanka', 'Maldives', 'Andaman', 'Nicobar', 'Islands', 'share', 'maritime', 'border', 'Thailand', 'Indonesia']


## Cases

In [31]:
# Convert into Lower case
print (text.lower())

# Convert into Upper case
print (text.upper())

india (hindi: bhārat), officially the republic of india, is a country in south asia. it is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. bounded by the indian ocean on the south, the arabian sea on the southwest, and the bay of bengal on the southeast, it shares land borders with pakistan to the west; china, nepal, and bhutan to the north; and bangladesh and myanmar to the east. in the indian ocean, india is in the vicinity of sri lanka and the maldives; its andaman and nicobar islands share a maritime border with thailand and indonesia.
INDIA (HINDI: BHĀRAT), OFFICIALLY THE REPUBLIC OF INDIA, IS A COUNTRY IN SOUTH ASIA. IT IS THE SEVENTH-LARGEST COUNTRY BY AREA, THE SECOND-MOST POPULOUS COUNTRY, AND THE MOST POPULOUS DEMOCRACY IN THE WORLD. BOUNDED BY THE INDIAN OCEAN ON THE SOUTH, THE ARABIAN SEA ON THE SOUTHWEST, AND THE BAY OF BENGAL ON THE SOUTHEAST, IT SHARES LAND BORDERS WITH PAKISTAN TO THE WEST; CHINA, NEPA

## Stemming

- Stemming means mapping a group of words to the same stem by removing prefixes or suffixes without giving any value to the “grammatical meaning” of the stem formed after the process.

e.g.

computation --> comput

computer --> comput 

hobbies --> hobbi

We can see that stemming tries to bring the word back to their base word but the base word may or may not have correct grammatical meanings.

There are few types of stemmers available in NLTK package. We will talk about popular below two
- 1)	Porter Stemmer 
- 2)	Lancaster Stemmer

Let’s see how to use both of them: 


In [32]:
import nltk

from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer

In [33]:
lancaster = LancasterStemmer()

porter = PorterStemmer()

Snowball = SnowballStemmer('english')


print('Porter stemmer')
print(porter.stem("hobby"))
print(porter.stem("hobbies"))
print(porter.stem("computer"))
print(porter.stem("computation"))
print("**************************")  

print('lancaster stemmer')
print(lancaster.stem("hobby"))
print(lancaster.stem("hobbies"))
print(lancaster.stem("computer"))
print(lancaster.stem("computation"))
print("**************************")  

Porter stemmer
hobbi
hobbi
comput
comput
**************************
lancaster stemmer
hobby
hobby
comput
comput
**************************


In [34]:
# Lets see with a new sentence

sentence = "I was going to the office on my bike when i saw a car passing by hit the tree."

token = list(nltk.word_tokenize(sentence))

for stemmer in (Snowball, lancaster, porter):
    stemm = [stemmer.stem(t) for t in token]
    print(" ".join(stemm))

i was go to the offic on my bike when i saw a car pass by hit the tree .
i was going to the off on my bik when i saw a car pass by hit the tre .
I wa go to the offic on my bike when i saw a car pass by hit the tree .


lancaster algorithm is faster than porter but it is more complex.
Porter stemmer is the oldest algorithm present and was the most popular to use.

Snowball stemmer, also known as  porter2, is the updated version of the Porter stemmer and is currently the most popular stemming algorithm.

Snowball stemmer is available for multiple languages as well.

In [35]:
# one more simple example of porter
print(porter.stem("running"))
print(porter.stem("runs"))
print(porter.stem("ran"))

run
run
ran


### Lemmatization


Lemmatization also does the same thing as stemming and try to bring a word to its base form, but unlike stemming it do keep in account the actual meaning of the base word i.e. the base word belongs to any specific language. The ‘base word’ is known as ‘Lemma’.

We use WordNet Lemmatizer for Lemmatization in nltk.

In [36]:
from nltk.stem import WordNetLemmatizer

In [37]:
lemma = WordNetLemmatizer()

print(lemma.lemmatize('running'))
print(lemma.lemmatize('runs'))
print(lemma.lemmatize('ran'))

running
run
ran


Here, we can see the lemma has changed for the words with same base. 

This is because, we haven’t given any context to the Lemmatizer.

Generally, it is given by passing the POS tags for the words in a sentence.
e.g.


In [38]:
print(lemma.lemmatize('running',pos='v'))
print(lemma.lemmatize('runs',pos='v'))
print(lemma.lemmatize('ran',pos='v'))

run
run
run


Lemmatizer is very complex and takes a lot of time to calculate.

So, it should only when the real meaning of words or the context is necessary for processing, else stemming should be preferred.

It completely depends on the type of problem you are trying to solve.

In [39]:
# One more example using both stemming and lemma
#text = "studies studying cries cry"
text = "Bring King Going Anything Sing Ring Nothing Thing"

# Stemming
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer  = PorterStemmer()

tokenization = nltk.word_tokenize(text)

for w in tokenization:
    print ("Stemming for {} is {}".format(w,porter_stemmer.stem(w))) 

Stemming for Bring is bring
Stemming for King is king
Stemming for Going is go
Stemming for Anything is anyth
Stemming for Sing is sing
Stemming for Ring is ring
Stemming for Nothing is noth
Stemming for Thing is thing


In [40]:
# Lemma 

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

tokenization = nltk.word_tokenize(text)

for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w))) 

Lemma for Bring is Bring
Lemma for King is King
Lemma for Going is Going
Lemma for Anything is Anything
Lemma for Sing is Sing
Lemma for Ring is Ring
Lemma for Nothing is Nothing
Lemma for Thing is Thing


In [41]:
# Lemmatization takes more time to give the response

# Wordnet

- Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to find the meaning of words, synonym or antonym. One can define it as a semantically oriented dictionary of English.

In [42]:
from nltk.corpus import wordnet

In [43]:
# Synset: It is also called as synonym set or collection of synonym words. 
# Let us check a example

syns = wordnet.synsets("cat")

print(syns)

[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]


In [44]:
# One more example
syns = wordnet.synsets("cat")
print(syns)

[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]


In [45]:
# Lets find sysnonms and antonyms using python code
from nltk.corpus import wordnet
synonyms = []
antonyms = []

for syn in wordnet.synsets("active"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print('Synonyms =>',set(synonyms))
print('Antonyms =>',set(antonyms))

Synonyms => {'dynamic', 'fighting', 'active_agent', 'participating', 'active_voice', 'combat-ready', 'alive', 'active'}
Antonyms => {'quiet', 'stative', 'dormant', 'passive_voice', 'passive', 'inactive', 'extinct'}
