--------------------------------------------
--------------------------------------------
# NLP - Text Preprocessing using NLTK
--------------------------------------------
--------------------------------------------

In [1]:
# !conda update -n base -c conda-forge conda
# !conda install -c conda-forge spacy
# !conda install -c conda-forge cupy
# !python -m spacy download en_core_web_sm
# !pip install spacy
# !pip install nltk

In [2]:
# Basic Imports
import pandas as pd, numpy as np

# NLTK Imports
import nltk
from nltk.stem import PorterStemmer       # For Stemming
from nltk.stem import WordNetLemmatizer   # For Lemmatization
from nltk.corpus import stopwords

In [3]:
# Warning Imports
import warnings
warnings.filterwarnings("ignore")

# Sklearn Imports
from sklearn.feature_extraction.text import CountVectorizer
import spacy

# Regular Expressions
import re

# # NLTK Downloads 
# # ==============
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nitan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nitan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nitan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# OBJECTS CREATIONS FOR STEMMER AND LEMMATIZER
# ============================================
# Stemming - Porter Stemmer
# --------   --------------
stemmer = PorterStemmer() # Defining object

# Lemmatization - WordNetLemmatizer
# -------------   -----------------
lemmatizer = WordNetLemmatizer()


In [5]:
paragraph = """
Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] (listen); born 17 September 1950)[a] is an Indian politician serving as the 14th and current prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament from Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the first prime minister to have been born after India's independence in 1947 and the second prime minister not belonging to the Indian National Congress to have won two consecutive majorities in the Lok Sabha, or the lower house of India's parliament. He is also the longest serving prime minister from a non-Congress party.

Born and raised in Vadnagar, a small town in northeastern Gujarat, Modi completed his secondary education there. He was introduced to the RSS at age eight. He has drawn attention to having to work as a child in his father's tea stall on the Vadnagar railway station platform, a description that has not been reliably corroborated. At age 18, Modi was married to Jashodaben Chimanlal Modi, whom he abandoned soon after. He left his parental home where she had come to live. He first publicly acknowledged her as his wife more than four decades later when required to do so by Indian law, but has made no contact with her since. Modi has asserted he had travelled in northern India for two years after leaving his parental home, visiting a number of religious centres, but few details of his travels have emerged. Upon his return to Gujarat in 1971, he became a full-time worker for the RSS. After the state of emergency was declared by prime minister Indira Gandhi in 1975, Modi went into hiding. The RSS assigned him to the BJP in 1985 and he held several positions within the party hierarchy until 2001, rising to the rank of general secretary.[b]

Modi was appointed Chief Minister of Gujarat in 2001 due to Keshubhai Patel's failing health and poor public image following the earthquake in Bhuj. Modi was elected to the legislative assembly soon after. His administration has been considered complicit in the 2002 Gujarat riots in which 1044 people were killed, three-quarters of whom were Muslim,[c] or otherwise criticised for its management of the crisis. A Supreme Court of India–appointed Special Investigation Team found no evidence to initiate prosecution proceedings against Modi personally.[d] While his policies as chief minister—credited with encouraging economic growth—have received praise, his administration has been criticised for failing to significantly improve health, poverty and education indices in the state.[e]

Modi led the BJP in the 2014 general election which gave the party a majority in the lower house of Indian parliament, the Lok Sabha, the first time for any single party since 1984. Modi's administration has tried to raise foreign direct investment in the Indian economy and reduced spending on healthcare and social welfare programmes. Modi has attempted to improve efficiency in the bureaucracy; he has centralised power by abolishing the Planning Commission. He began a high-profile sanitation campaign, controversially initiated a demonetisation of high-denomination banknotes and transformation of taxation regime, and weakened or abolished environmental and labour laws.

Under Modi's tenure, India has experienced democratic backsliding.[12][13][f] Following his party's victory in the 2019 general election, his administration revoked the special status of Jammu and Kashmir, introduced the Citizenship Amendment Act and three controversial farm laws, which prompted widespread protests and sit-ins across the country, resulting in a formal repeal of the latter. Described as engineering a political realignment towards right-wing politics, Modi remains a figure of controversy domestically and internationally over his Hindu nationalist beliefs and his handling of the 2002 Gujarat riots, cited as evidence of an exclusionary social agenda.[g]

"""

In [6]:
# Tokenization - convert paragraph into sentences and then focus on words
# ============   ========================================================

sentences = nltk.sent_tokenize(paragraph)
print(sentences[0:5])

['\nNarendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] (listen); born 17 September 1950)[a] is an Indian politician serving as the 14th and current prime minister of India since 2014.', 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament from Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.', "He is the first prime minister to have been born after India's independence in 1947 and the second prime minister not belonging to the Indian National Congress to have won two consecutive majorities in the Lok Sabha, or the lower house of India's parliament.", 'He is also the longest serving prime minister from a non-Congress party.']


In [7]:
corpus = []
for i in range(len(sentences)):
    """
    The below step replaces all character other than a-z and A-Z
        with blank spaces " ".
    """
    review = re.sub('[^a-zA-Z]',' ', sentences[i])
    review = review.lower()
    #review = review.split()
    corpus.append(review)
print(corpus[0:5]) 

[' narendra damodardas modi  gujarati    n  end   d mod   d s  modi    listen   born    september       a  is an indian politician serving as the   th and current prime minister of india since      ', 'modi was the chief minister of gujarat from      to      and is the member of parliament from varanasi ', 'he is a member of the bharatiya janata party  bjp  and of the rashtriya swayamsevak sangh  rss   a right wing hindu nationalist paramilitary volunteer organisation ', 'he is the first prime minister to have been born after india s independence in      and the second prime minister not belonging to the indian national congress to have won two consecutive majorities in the lok sabha  or the lower house of india s parliament ', 'he is also the longest serving prime minister from a non congress party ']


-------------------
## <font color = 'red'> Stemming </font>
-------------------

In [8]:
# Printing the Stop Words
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
# Printing the Stemmed words
for sen in corpus:
    words = nltk.word_tokenize(sen)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

narendra
damodarda
modi
gujarati
n
end
mod
modi
listen
born
septemb
indian
politician
serv
th
current
prime
minist
india
sinc
modi
chief
minist
gujarat
member
parliament
varanasi
member
bharatiya
janata
parti
bjp
rashtriya
swayamsevak
sangh
rss
right
wing
hindu
nationalist
paramilitari
volunt
organis
first
prime
minist
born
india
independ
second
prime
minist
belong
indian
nation
congress
two
consecut
major
lok
sabha
lower
hous
india
parliament
also
longest
serv
prime
minist
non
congress
parti
born
rais
vadnagar
small
town
northeastern
gujarat
modi
complet
secondari
educ
introduc
rss
age
eight
drawn
attent
work
child
father
tea
stall
vadnagar
railway
station
platform
descript
reliabl
corrobor
age
modi
marri
jashodaben
chimanl
modi
abandon
soon
left
parent
home
come
live
first
publicli
acknowledg
wife
four
decad
later
requir
indian
law
made
contact
sinc
modi
assert
travel
northern
india
two
year
leav
parent
home
visit
number
religi
centr
detail
travel
emerg
upon
return
gujarat
becam
full

-------------------
## <font color = 'red'> Lemmatization </font>
-------------------

In [10]:
# Printing the Lemmatized words
for sen in corpus:
    words = nltk.word_tokenize(sen)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))

narendra
damodardas
modi
gujarati
n
end
mod
modi
listen
born
september
indian
politician
serving
th
current
prime
minister
india
since
modi
chief
minister
gujarat
member
parliament
varanasi
member
bharatiya
janata
party
bjp
rashtriya
swayamsevak
sangh
r
right
wing
hindu
nationalist
paramilitary
volunteer
organisation
first
prime
minister
born
india
independence
second
prime
minister
belonging
indian
national
congress
two
consecutive
majority
lok
sabha
lower
house
india
parliament
also
longest
serving
prime
minister
non
congress
party
born
raised
vadnagar
small
town
northeastern
gujarat
modi
completed
secondary
education
introduced
r
age
eight
drawn
attention
work
child
father
tea
stall
vadnagar
railway
station
platform
description
reliably
corroborated
age
modi
married
jashodaben
chimanlal
modi
abandoned
soon
left
parental
home
come
live
first
publicly
acknowledged
wife
four
decade
later
required
indian
law
made
contact
since
modi
asserted
travelled
northern
india
two
year
leaving
pare

In [11]:
# Object for Count Vectorizer
# ---------------------------
# cv = CountVectorizer()# For normal bag-of-words
cv = CountVectorizer(binary=True) # For Binary bag-of-words
X = cv.fit_transform(corpus)
print(X.toarray()[0][0:20])
print(cv.vocabulary_)

[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0]
{'narendra': 182, 'damodardas': 64, 'modi': 179, 'gujarati': 114, 'end': 89, 'mod': 178, 'listen': 166, 'born': 34, 'september': 252, 'is': 149, 'an': 13, 'indian': 138, 'politician': 208, 'serving': 253, 'as': 17, 'the': 279, 'th': 276, 'and': 14, 'current': 63, 'prime': 215, 'minister': 177, 'of': 191, 'india': 137, 'since': 257, 'was': 299, 'chief': 41, 'gujarat': 113, 'from': 107, 'to': 283, 'member': 176, 'parliament': 199, 'varanasi': 295, 'he': 120, 'bharatiya': 31, 'janata': 152, 'party': 200, 'bjp': 33, 'rashtriya': 229, 'swayamsevak': 271, 'sangh': 247, 'rss': 245, 'right': 242, 'wing': 311, 'hindu': 129, 'nationalist': 184, 'paramilitary': 197, 'volunteer': 298, 'organisation': 194, 'first': 100, 'have': 118, 'been': 27, 'after': 7, 'independence': 136, 'in': 135, 'second': 249, 'not': 189, 'belonging': 30, 'national': 183, 'congress': 50, 'won': 314, 'two': 290, 'consecutive': 51, 'majorities': 172, 'lok': 168, 'sabha': 246, 'or': 1

In [14]:
print(sentences)

['\nNarendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] (listen); born 17 September 1950)[a] is an Indian politician serving as the 14th and current prime minister of India since 2014.', 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament from Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.', "He is the first prime minister to have been born after India's independence in 1947 and the second prime minister not belonging to the Indian National Congress to have won two consecutive majorities in the Lok Sabha, or the lower house of India's parliament.", 'He is also the longest serving prime minister from a non-Congress party.', 'Born and raised in Vadnagar, a small town in northeastern Gujarat, Modi completed his secondary education there.', 'He was introduced to the RSS at age eight.', "He has drawn attention

In [12]:
# =============================
# Apply Stopwords and Lemmatize
# =============================
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]',' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = " ".join(review)
    corpus.append(review)
display(corpus[0])

'narendra damodardas modi gujarati n end mod modi listen born september indian politician serving th current prime minister india since'

In [13]:
print(corpus)

['narendra damodardas modi gujarati n end mod modi listen born september indian politician serving th current prime minister india since', 'modi chief minister gujarat member parliament varanasi', 'member bharatiya janata party bjp rashtriya swayamsevak sangh r right wing hindu nationalist paramilitary volunteer organisation', 'first prime minister born india independence second prime minister belonging indian national congress two consecutive majority lok sabha lower house india parliament', 'also longest serving prime minister non congress party', 'born raised vadnagar small town northeastern gujarat modi completed secondary education', 'introduced r age eight', 'drawn attention work child father tea stall vadnagar railway station platform description reliably corroborated', 'age modi married jashodaben chimanlal modi abandoned soon', 'left parental home come live', 'first publicly acknowledged wife four decade later required indian law made contact since', 'modi asserted travelled n

In [16]:
# ================================
# Using the CountVectorizer
# ================================

cv = CountVectorizer(binary=True)   # For binary bag-of-words
X = cv.fit_transform(corpus)        # Fitting
X_array = X.toarray()               # Converting X to array
print(X_array)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [17]:
# For only Trigrams
# =================
cv_tri = CountVectorizer(binary=True, ngram_range=(3,3)) # For binary bag-of-words
X_tri = cv_tri.fit_transform(corpus)
cv_tri.vocabulary_

{'narendra damodardas modi': 195,
 'damodardas modi gujarati': 54,
 'modi gujarati end': 187,
 'gujarati end mod': 108,
 'end mod modi': 75,
 'mod modi listen': 178,
 'modi listen born': 189,
 'listen born september': 158,
 'born september indian': 29,
 'september indian politician': 262,
 'indian politician serving': 136,
 'politician serving th': 217,
 'serving th current': 264,
 'th current prime': 284,
 'current prime minister': 53,
 'prime minister india': 226,
 'minister india since': 175,
 'modi chief minister': 184,
 'chief minister gujarat': 35,
 'minister gujarat member': 174,
 'gujarat member parliament': 104,
 'member parliament varanasi': 169,
 'member bharatiya janata': 168,
 'bharatiya janata party': 23,
 'janata party bjp': 146,
 'party bjp rashtriya': 208,
 'bjp rashtriya swayamsevak': 26,
 'rashtriya swayamsevak sangh': 241,
 'swayamsevak sangh right': 279,
 'sangh right wing': 259,
 'right wing hindu': 252,
 'wing hindu nationalist': 303,
 'hindu nationalist paramili

In [20]:
# For Bigrams and Trigrams
# ========================
cv_bi_tri = CountVectorizer(binary=True, ngram_range=(2,3)) # For binary bag-of-words
X_bi_tri = cv_bi_tri.fit_transform(corpus)
cv_bi_tri.vocabulary_

{'narendra damodardas': 398,
 'damodardas modi': 113,
 'modi gujarati': 381,
 'gujarati end': 222,
 'end mod': 157,
 'mod modi': 363,
 'modi listen': 385,
 'listen born': 325,
 'born september': 61,
 'september indian': 532,
 'indian politician': 280,
 'politician serving': 444,
 'serving th': 536,
 'th current': 577,
 'current prime': 111,
 'prime minister': 458,
 'minister india': 357,
 'india since': 269,
 'narendra damodardas modi': 399,
 'damodardas modi gujarati': 114,
 'modi gujarati end': 382,
 'gujarati end mod': 223,
 'end mod modi': 158,
 'mod modi listen': 364,
 'modi listen born': 386,
 'listen born september': 326,
 'born september indian': 62,
 'september indian politician': 533,
 'indian politician serving': 281,
 'politician serving th': 445,
 'serving th current': 537,
 'th current prime': 578,
 'current prime minister': 112,
 'prime minister india': 461,
 'minister india since': 358,
 'modi chief': 375,
 'chief minister': 71,
 'minister gujarat': 354,
 'gujarat membe

-----------------

## <font color = 'red'> TF-IDF (Term Frequency - Inverse Document Frequency) </font>

-----------------

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
# Defining TF-IDF vectorizer object
# ---------------------------------
cv = TfidfVectorizer(ngram_range=(3,3)) # Just Tri-gram  - Can also add ``max_features = 3``
X_tfidf = cv.fit_transform(corpus)
display(X_tfidf[0].toarray())

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.24253563,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.24253563, 0.24253563,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [25]:
# Defining TF-IDF vectorizer object
# ---------------------------------
cv = TfidfVectorizer(ngram_range=(3,3), max_features=10) # Just Tri-gram  - Can also add ``max_features = 3``
X_tfidf = cv.fit_transform(corpus)
display(X_tfidf[0].toarray())

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])