## Bag of Words

In [2]:
paragraph = "Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[a] is an Indian politician serving as the current prime minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998.[b] In 2001, Modi was appointed Chief Minister of Gujarat and elected to the legislative assembly soon after. His administration is considered complicit in the 2002 Gujarat riots,[c] and has been criticised for its management of the crisis. According to official records, a little over 1,000 people were killed, three-quarters of whom were Muslim; independent sources estimated 2,000 deaths, mostly Muslim.[13] A Special Investigation Team appointed by the Supreme Court of India in 2012 found no evidence to initiate prosecution proceedings against him.[d] While his policies as chief minister were credited for encouraging economic growth, his administration was criticised for failing to significantly improve health, poverty and education indices in the state.[e]"

In [3]:
# Step1: sentence tokenization
import nltk
from nltk import sent_tokenize
sentences = nltk.sent_tokenize(paragraph)

In [4]:
sentences

['Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[a] is an Indian politician serving as the current prime minister of India since 26 May 2014.',
 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi.',
 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.',
 'He is the longest-serving prime minister outside the Indian National Congress.',
 '[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education.',
 'He was introduced to the RSS at the age of eight.',
 'At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so.',
 'Modi became a full-time worker for the RSS in Gujarat in 1971.',
 'The RSS assigned him to the BJP in 198

In [5]:
# step2: clean the corpus, remove everything except alphabet and convert everything to lowercase for unique vocabulary creation
import re
corpus = []
for i in range(len(sentences)):
    temp = re.sub('[^a-zA-Z]',' ',sentences[i])
    temp = temp.lower()
    corpus.append(temp)

In [6]:
corpus

['narendra damodardas modi  gujarati    n  end   d mod   d s  modi      born    september       a  is an indian politician serving as the current prime minister of india since    may      ',
 'modi was the chief minister of gujarat from      to      and is the member of parliament  mp  for varanasi ',
 'he is a member of the bharatiya janata party  bjp  and of the rashtriya swayamsevak sangh  rss   a right wing hindu nationalist paramilitary volunteer organisation ',
 'he is the longest serving prime minister outside the indian national congress ',
 '    modi was born and raised in vadnagar in northeastern gujarat  where he completed his secondary education ',
 'he was introduced to the rss at the age of eight ',
 'at the age of     he was married to jashodaben modi  whom he abandoned soon after  only publicly acknowledging her four decades later when legally required to do so ',
 'modi became a full time worker for the rss in gujarat in      ',
 'the rss assigned him to the bjp in    

In [14]:
# remove stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
new_corpus = []
for doc in corpus:
    # word tokenization
    words = nltk.word_tokenize(doc)
    temp = []
    for word in words:
        if word not in set(stopwords.words('english')):
            temp.append(word)
    new_corpus.append(' '.join(temp))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/krishangopal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
print(corpus[0])
print(new_corpus[0])

narendra damodardas modi  gujarati    n  end   d mod   d s  modi      born    september       a  is an indian politician serving as the current prime minister of india since    may      
narendra damodardas modi gujarati n end mod modi born september indian politician serving current prime minister india since may


In [16]:
# BOW is a technique for word embedding that will be used to create features so it's present inside sklean feature extraction library as count vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv =CountVectorizer()

In [18]:
# let's apply BOW to entire corpus
bow = cv.fit_transform(new_corpus)

In [20]:
dir(bow)

['A',
 'H',
 'T',
 '__abs__',
 '__add__',
 '__array_priority__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__idiv__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__pow__',
 '__radd__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmatmul__',
 '__rmul__',
 '__round__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_add_dense',
 '_add_sparse',
 '_arg_min_or_max',
 '_arg_min_or_max_axis',
 '_ascontainer',
 '_asfptype',
 '_asindices',
 '_binopt',
 '_bsr_container',
 '_container',
 '_coo_container',
 '_cs_matrix__ge

In [22]:
# to see the fetures we can use the object of count vectorizer
cv.get_feature_names_out()

array(['abandoned', 'according', 'acknowledging', 'administration', 'age',
       'appointed', 'assembly', 'assigned', 'became', 'becoming',
       'bharatiya', 'bjp', 'born', 'chief', 'completed', 'complicit',
       'congress', 'considered', 'court', 'credited', 'crisis',
       'criticised', 'current', 'damodardas', 'deaths', 'decades',
       'economic', 'education', 'eight', 'elected', 'encouraging', 'end',
       'estimated', 'evidence', 'failing', 'found', 'four', 'full',
       'general', 'growth', 'gujarat', 'gujarati', 'health', 'hierarchy',
       'hindu', 'improve', 'independent', 'india', 'indian', 'indices',
       'initiate', 'introduced', 'investigation', 'janata', 'jashodaben',
       'killed', 'later', 'legally', 'legislative', 'little', 'longest',
       'management', 'married', 'may', 'member', 'minister', 'mod',
       'modi', 'mostly', 'mp', 'muslim', 'narendra', 'national',
       'nationalist', 'northeastern', 'official', 'organisation',
       'outside', 'param

In [23]:
bow


<15x119 sparse matrix of type '<class 'numpy.int64'>'
	with 151 stored elements in Compressed Sparse Row format>

In [29]:
# BOW is a feature matrix based on the vocabulary so we can print it for document
bow[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
        1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [31]:
# To see the vocabulary
cv.vocabulary_
#'narendra': 71 -> 71 is the index number of this word

{'narendra': 71,
 'damodardas': 23,
 'modi': 67,
 'gujarati': 41,
 'end': 31,
 'mod': 66,
 'born': 12,
 'september': 101,
 'indian': 48,
 'politician': 83,
 'serving': 102,
 'current': 22,
 'prime': 85,
 'minister': 65,
 'india': 47,
 'since': 104,
 'may': 63,
 'chief': 13,
 'gujarat': 40,
 'member': 64,
 'parliament': 79,
 'mp': 69,
 'varanasi': 115,
 'bharatiya': 10,
 'janata': 53,
 'party': 80,
 'bjp': 11,
 'rashtriya': 91,
 'swayamsevak': 110,
 'sangh': 98,
 'rss': 97,
 'right': 94,
 'wing': 117,
 'hindu': 44,
 'nationalist': 73,
 'paramilitary': 78,
 'volunteer': 116,
 'organisation': 76,
 'longest': 60,
 'outside': 77,
 'national': 72,
 'congress': 16,
 'raised': 90,
 'vadnagar': 114,
 'northeastern': 74,
 'completed': 14,
 'secondary': 99,
 'education': 27,
 'introduced': 51,
 'age': 4,
 'eight': 28,
 'married': 62,
 'jashodaben': 54,
 'abandoned': 0,
 'soon': 105,
 'publicly': 88,
 'acknowledging': 2,
 'four': 36,
 'decades': 25,
 'later': 56,
 'legally': 57,
 'required': 93,
 

## N-grams


In [40]:
cv_ngrams = CountVectorizer(analyzer='word',ngram_range=(1,3))
# consider all 1 word, 2 word and 3 word ngrams

In [41]:
bow_ng = cv_ngrams.fit_transform(new_corpus)

In [42]:
cv_ngrams.get_feature_names_out()

array(['abandoned', 'abandoned soon', 'abandoned soon publicly',
       'according', 'according official', 'according official records',
       'acknowledging', 'acknowledging four',
       'acknowledging four decades', 'administration',
       'administration considered', 'administration considered complicit',
       'administration criticised', 'administration criticised failing',
       'age', 'age eight', 'age married', 'age married jashodaben',
       'appointed', 'appointed chief', 'appointed chief minister',
       'appointed supreme', 'appointed supreme court', 'assembly',
       'assembly soon', 'assigned', 'assigned bjp', 'assigned bjp rose',
       'became', 'became full', 'became full time', 'becoming',
       'becoming general', 'becoming general secretary', 'bharatiya',
       'bharatiya janata', 'bharatiya janata party', 'bjp',
       'bjp rashtriya', 'bjp rashtriya swayamsevak', 'bjp rose',
       'bjp rose party', 'born', 'born raised', 'born raised vadnagar',
       '

In [43]:
cv_ngrams.vocabulary_

{'narendra': 240,
 'damodardas': 76,
 'modi': 219,
 'gujarati': 133,
 'end': 98,
 'mod': 216,
 'born': 42,
 'september': 327,
 'indian': 156,
 'politician': 275,
 'serving': 330,
 'current': 73,
 'prime': 281,
 'minister': 206,
 'india': 151,
 'since': 338,
 'may': 200,
 'narendra damodardas': 241,
 'damodardas modi': 77,
 'modi gujarati': 231,
 'gujarati end': 134,
 'end mod': 99,
 'mod modi': 217,
 'modi born': 226,
 'born september': 45,
 'september indian': 328,
 'indian politician': 159,
 'politician serving': 276,
 'serving current': 331,
 'current prime': 74,
 'prime minister': 282,
 'minister india': 212,
 'india since': 154,
 'since may': 339,
 'narendra damodardas modi': 242,
 'damodardas modi gujarati': 78,
 'modi gujarati end': 232,
 'gujarati end mod': 135,
 'end mod modi': 100,
 'mod modi born': 218,
 'modi born september': 228,
 'born september indian': 46,
 'september indian politician': 329,
 'indian politician serving': 160,
 'politician serving current': 277,
 'servi