This notebook is going to cover some basic concepts of NLP including

1) for ML we use the libraries such as nltk, spacy, textblob

2) for DL we use Tensorflow

The following concepts are going to be covered here:

1) Tokenization: breaking down corpus into words and phrases.

2) stopwords: removing irrelavant words

3) Stemming: getting the stemmed words

4) lammatization: getting the original word that is meaningful

In [None]:
!pip install nltk

In [2]:
paragraph="""
Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] (listen); born 17 September 1950)[b] is an Indian politician serving as the 14th and current Prime Minister of India since 2014. Modi was the Chief Minister of Gujarat from 2001 to 2014 and is the Member of Parliament from Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest serving prime minister from outside the Indian National Congress.

Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at age eight. He has reminisced about helping out after school at his father's tea stall at the Vadnagar railway station. At age 18, Modi was married to Jashodaben Chimanlal Modi, whom he abandoned soon after. He first publicly acknowledged her as his wife more than four decades later when required to do so by Indian law, but has made no contact with her since. Modi has asserted he had travelled in northern India for two years after leaving his parental home, visiting a number of religious centres, but few details of his travels have emerged. Upon his return to Gujarat in 1971, he became a full-time worker for the RSS. After the state of emergency was declared by prime minister Indira Gandhi in 1975, Modi went into hiding. The RSS assigned him to the BJP in 1985 and he held several positions within the party hierarchy until 2001, rising to the rank of general secretary.[c]
"""


In [3]:
paragraph

"\nNarendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] (listen); born 17 September 1950)[b] is an Indian politician serving as the 14th and current Prime Minister of India since 2014. Modi was the Chief Minister of Gujarat from 2001 to 2014 and is the Member of Parliament from Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest serving prime minister from outside the Indian National Congress.\n\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at age eight. He has reminisced about helping out after school at his father's tea stall at the Vadnagar railway station. At age 18, Modi was married to Jashodaben Chimanlal Modi, whom he abandoned soon after. He first publicly acknowledged her as his wife more than four decades later when required to do so b

In [6]:
import nltk
from nltk.stem import PorterStemmer # to find the original words
from nltk.corpus import stopwords # to get rid of unwanted/irrelevant words

### 1.Tokenization:
this is to break down corpus into words and phrases for further analysis.

In [5]:
nltk.download('punkt') # Punkt package is used to break down Corpus into words. Here we are downloading the package to the local machine.
documents=nltk.sent_tokenize(paragraph) # to convert corpus into sentences/documents

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
documents
# Here it breaks down the Corpus into documents/sentences

In [9]:
documents[0]

'\nNarendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] (listen); born 17 September 1950)[b] is an Indian politician serving as the 14th and current Prime Minister of India since 2014.'

### 2. Stemming:
it is used to reduce words to their base or root form 

In [11]:
#Intializing an instance of the class PorterStemmer
stemmer=PorterStemmer()

In [14]:
stemmer.stem('finalised')

'finalis'

### 2. Lemmetization:
it is used to reduce words to their base or root form but these words are meaningful unlike Stemming as Stemming just removes the suffixes off the words. However, Lemmetization requires more computional power.

In [51]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [52]:
Lemmatizer=WordNetLemmatizer()

In [53]:
Lemmatizer.lemmatize('goes')

'go'

### 3: Text Cleaning:
that is to remove special characters from the paragraph. for that we are going to use regular expression

In [32]:
# this regular expression will replace all the special characters other than a-z or A-Z with a space.
# it also convers text into lower case as capital and non capital words are considered seperately.
import re
corpus=[]
for i in documents:
  corpus.append((re.sub(r'[^a-zA-Z]',' ',i)).lower())
corpus

[' narendra damodardas modi  gujarati    n  end   d mod   d s  modi    listen   born    september       b  is an indian politician serving as the   th and current prime minister of india since      ',
 'modi was the chief minister of gujarat from      to      and is the member of parliament from varanasi ',
 'he is a member of the bharatiya janata party  bjp  and of the rashtriya swayamsevak sangh  rss   a right wing hindu nationalist paramilitary volunteer organisation ',
 'he is the longest serving prime minister from outside the indian national congress ',
 'modi was born and raised in vadnagar in northeastern gujarat  where he completed his secondary education ',
 'he was introduced to the rss at age eight ',
 'he has reminisced about helping out after school at his father s tea stall at the vadnagar railway station ',
 'at age     modi was married to jashodaben chimanlal modi  whom he abandoned soon after ',
 'he first publicly acknowledged her as his wife more than four decades l

In [39]:
stemmer.stem('married')

'marri'

In [45]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### 4: Word Tokenization + Stopwords + Stemming/Lemmatization:
here we will breaking down sentences into words, removing irrelevant words and then performing stemming or lemmatization to reduce the words into their base/root words to reduce dimensions

In [None]:
# Word tokenization: breaking down sentences into words
# Stopwords: removing irrelavant words 
# Stemming/Lemmatization: reducing words into their base/root words
for i in corpus:
  words=nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      print(stemmer.stem(word))


In [None]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
Lemmatizer=WordNetLemmatizer()

for i in corpus:
  words=nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      print(Lemmatizer.lemmatize(word))

### 5: Converting words into vectors:
a) Bag of Words(BOW): We will be converting words into vectors/features

In [85]:
# Applying stopwords and lemmitization
import re
corpus=[]
for i in documents:
  review=re.sub(r'[^a-zA-Z]',' ',i)
  review=review.lower()
  review=review.split()
  review=[Lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
  review=' '.join(review)
  corpus.append(review)



In [86]:
corpus

['narendra damodardas modi gujarati n end mod modi listen born september b indian politician serving th current prime minister india since',
 'modi chief minister gujarat member parliament varanasi',
 'member bharatiya janata party bjp rashtriya swayamsevak sangh r right wing hindu nationalist paramilitary volunteer organisation',
 'longest serving prime minister outside indian national congress',
 'modi born raised vadnagar northeastern gujarat completed secondary education',
 'introduced r age eight',
 'reminisced helping school father tea stall vadnagar railway station',
 'age modi married jashodaben chimanlal modi abandoned soon',
 'first publicly acknowledged wife four decade later required indian law made contact since',
 'modi asserted travelled northern india two year leaving parental home visiting number religious centre detail travel emerged',
 'upon return gujarat became full time worker r',
 'state emergency declared prime minister indira gandhi modi went hiding',
 'r assig

In [57]:
from sklearn.feature_extraction.text import CountVectorizer


In [70]:
cv=CountVectorizer(binary=True) # binary is True for binary bag of words to get values in 0 and 1

### we have already done tokenization, Text cleaning, stopwords and lemmatization, and now we are converting words into vectors.

In [87]:

X=cv.fit_transform(corpus)

In [88]:
cv.vocabulary_

{'narendra': 56,
 'damodardas': 16,
 'modi': 55,
 'gujarati': 32,
 'end': 24,
 'mod': 54,
 'listen': 48,
 'born': 8,
 'september': 86,
 'indian': 40,
 'politician': 68,
 'serving': 87,
 'th': 96,
 'current': 15,
 'prime': 70,
 'minister': 53,
 'india': 39,
 'since': 89,
 'chief': 10,
 'gujarat': 31,
 'member': 52,
 'parliament': 66,
 'varanasi': 103,
 'bharatiya': 6,
 'janata': 43,
 'party': 67,
 'bjp': 7,
 'rashtriya': 75,
 'swayamsevak': 94,
 'sangh': 82,
 'right': 80,
 'wing': 108,
 'hindu': 37,
 'nationalist': 58,
 'paramilitary': 64,
 'volunteer': 105,
 'organisation': 62,
 'longest': 49,
 'outside': 63,
 'national': 57,
 'congress': 13,
 'raised': 73,
 'vadnagar': 102,
 'northeastern': 59,
 'completed': 12,
 'secondary': 84,
 'education': 20,
 'introduced': 42,
 'age': 2,
 'eight': 21,
 'reminisced': 77,
 'helping': 34,
 'school': 83,
 'father': 25,
 'tea': 95,
 'stall': 91,
 'railway': 72,
 'station': 93,
 'married': 51,
 'jashodaben': 44,
 'chimanlal': 11,
 'abandoned': 0,
 'so

In [90]:
corpus[0]

'narendra damodardas modi gujarati n end mod modi listen born september b indian politician serving th current prime minister india since'

In [91]:
X[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0]])