## Natural Language Processing: Stemming, Lemmatizing, tokenization, Vectorization and bag of words

In [1]:
import nltk

In [4]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [5]:
paragraph="""In 3000 years of our history, people from all over the world have come and invaded us, captured our lands and conquered our minds. Yet, we have not conquered anyone. Because, we respect the freedom of others, and that is the reason for his first vision of Freedom. India got its first vision of this in the Indian Rebellion in the year 1857, when we started the war of Independence. It is this freedom that we must protect and nurture and build on.

His Second Vision: Development

We have been a developing nation for fifty years, and so it is time we see ourselves as a developed nation. In terms of GDP, we are among the top five nations of the world. Our poverty levels are falling. Our achievements are being globally recognised today. Yet we lack the self-confidence to see ourselves as a developed nation.

His Third Vision: India must stand up to the World

India must stand up to the world. Unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand.
"""

In [6]:
#tokenizing the sentences
sentences=nltk.sent_tokenize(paragraph)

In [7]:
sentences

['In 3000 years of our history, people from all over the world have come and invaded us, captured our lands and conquered our minds.',
 'Yet, we have not conquered anyone.',
 'Because, we respect the freedom of others, and that is the reason for his first vision of Freedom.',
 'India got its first vision of this in the Indian Rebellion in the year 1857, when we started the war of Independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'His Second Vision: Development\n\nWe have been a developing nation for fifty years, and so it is time we see ourselves as a developed nation.',
 'In terms of GDP, we are among the top five nations of the world.',
 'Our poverty levels are falling.',
 'Our achievements are being globally recognised today.',
 'Yet we lack the self-confidence to see ourselves as a developed nation.',
 'His Third Vision: India must stand up to the World\n\nIndia must stand up to the world.',
 'Unless India stands up to the world, no one will res

In [9]:
sentences[-1]

'Both must go hand-in-hand.'

In [10]:
#Tokenizing words
words=nltk.word_tokenize(paragraph)

In [12]:
len(words)

225

In [15]:
# Stemming
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stemmer=PorterStemmer()

In [20]:
stop_words=stopwords.words("english")  
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[stemmer.stem(word) for word in words if word not in stop_words]
    sentences[i]=' '.join(words)
    print(sentences[i])
        
    
    
    

In 3000 year histori , peopl world come invad us , captur land conquer mind .
yet , conquer anyon .
becaus , respect freedom other , reason first vision freedom .
india got first vision indian rebellion year 1857 , start war independ .
It freedom must protect nurtur build .
hi second vision : develop We develop nation fifti year , time see develop nation .
In term gdp , among top five nation world .
our poverti level fall .
our achiev global recognis today .
yet lack self-confid see develop nation .
hi third vision : india must stand world india must stand world .
unless india stand world , one respect us .
onli strength respect strength .
We must strong militari power also econom power .
both must go hand-in-hand .


In [21]:
# stemming ...Intermediate representation may produce a word which may not be meaningful
# lemmatization .."   "   "   " may be meaningful
from nltk.stem import WordNetLemmatizer

In [50]:
sentences=nltk.sent_tokenize(paragraph)
lemmatizer=WordNetLemmatizer()

In [23]:
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    sentences[i]=' '.join(words)
    print(sentences[i])

In 3000 year history , people world come invaded u , captured land conquered mind .
Yet , conquered anyone .
Because , respect freedom others , reason first vision Freedom .
India got first vision Indian Rebellion year 1857 , started war Independence .
It freedom must protect nurture build .
His Second Vision : Development We developing nation fifty year , time see developed nation .
In term GDP , among top five nation world .
Our poverty level falling .
Our achievement globally recognised today .
Yet lack self-confidence see developed nation .
His Third Vision : India must stand World India must stand world .
Unless India stand world , one respect u .
Only strength respect strength .
We must strong military power also economic power .
Both must go hand-in-hand .


In [40]:
# Bag of words - document term matrix
import string

In [51]:
corpus=[]
for i in range(len(sentences)):
    review=nltk.word_tokenize(sentences[i])
    review=[lemmatizer.lemmatize(word) for word in review if word not in stop_words]
    review=" ".join(review).lower()
    #remove punctuation from each sentence
    for s in string.punctuation:
        review=review.replace(s,"")
    corpus.append(review)  # prepare a corpus of reviews  
        
        
    

In [52]:
corpus

['in 3000 year history  people world come invaded u  captured land conquered mind ',
 'yet  conquered anyone ',
 'because  respect freedom others  reason first vision freedom ',
 'india got first vision indian rebellion year 1857  started war independence ',
 'it freedom must protect nurture build ',
 'his second vision  development we developing nation fifty year  time see developed nation ',
 'in term gdp  among top five nation world ',
 'our poverty level falling ',
 'our achievement globally recognised today ',
 'yet lack selfconfidence see developed nation ',
 'his third vision  india must stand world india must stand world ',
 'unless india stand world  one respect u ',
 'only strength respect strength ',
 'we must strong military power also economic power ',
 'both must go handinhand ']

In [57]:
# create bag of words
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
X=cv.fit_transform(corpus).toarray()

In [58]:
X

array([[0, 1, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [59]:
X.shape

(15, 73)

In [60]:
cv.get_feature_names()

['1857',
 '3000',
 'achievement',
 'also',
 'among',
 'anyone',
 'because',
 'both',
 'build',
 'captured',
 'come',
 'conquered',
 'developed',
 'developing',
 'development',
 'economic',
 'falling',
 'fifty',
 'first',
 'five',
 'freedom',
 'gdp',
 'globally',
 'go',
 'got',
 'handinhand',
 'his',
 'history',
 'in',
 'independence',
 'india',
 'indian',
 'invaded',
 'it',
 'lack',
 'land',
 'level',
 'military',
 'mind',
 'must',
 'nation',
 'nurture',
 'one',
 'only',
 'others',
 'our',
 'people',
 'poverty',
 'power',
 'protect',
 'reason',
 'rebellion',
 'recognised',
 'respect',
 'second',
 'see',
 'selfconfidence',
 'stand',
 'started',
 'strength',
 'strong',
 'term',
 'third',
 'time',
 'today',
 'top',
 'unless',
 'vision',
 'war',
 'we',
 'world',
 'year',
 'yet']

In [61]:
import pandas as pd

In [63]:
pd.DataFrame(X,columns=cv.get_feature_names())

Unnamed: 0,1857,3000,achievement,also,among,anyone,because,both,build,captured,...,time,today,top,unless,vision,war,we,world,year,yet
0,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,1,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,1,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0
6,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
