<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Creating Bag of Words Model from Scratch</p>

In [None]:
'''Bag of Words model is one of the three most commonly used word embedding approaches with TF-IDF and Word2Vec 
being the other two'''

In [1]:
'''
 Suppose we have a corpus with three sentences:

"I like to play football"
"Did you go outside to play tennis"
"John and I play tennis"
'''

'\n Suppose we have a corpus with three sentences:\n\n"I like to play football"\n"Did you go outside to play tennis"\n"John and I play tennis"\n'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _1: Tokenize the Sentences</p>

In [None]:
'''
The first step in this regard is to convert the sentences in our corpus into tokens or individual words
Sentence 1	Sentence 2	Sentence 3
I	        Did	        John
like	    you         and
to	        go	        I
play	    outside	    play
football	to	    tennis
            play	
            tennis
'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _2: Create a Dictionary of Word Frequency</p>

In [None]:
'''
The next step is to create a dictionary that contains all the words in our corpus 
as keys and the frequency of the occurrence of the words as value
Word	Frequency
I	        2
like	    1
to	        2
play	    3
football	1
Did	        1
you	        1
go	        1
outside	    1
tennis	    2
John	    1
and	        1
Let's sort our word frequency dictionary:
Word	Frequency
play	    3
tennis	    2
to	        2
I	        2
football	1
Did	        1
you	        1
go	        1
outside 	1
like	    1
John	    1
and     	1
'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _3: Creating the Bag of Words Model</p>

In [None]:
'''To create the bag of words model, we need to create a matrix'''

<p style="font-family:Roboto; font-size: 28px; color: magenta"> Bag of Words Model in Python</p>

In [2]:
import nltk  
import numpy as np  
import random  
import string

import bs4 as bs  
import urllib.request  
import re  

In [None]:
'''In the first step, we will scrape the Wikipedia article on Natural Language Processing'''
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')  
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')
# we filter the text within the paragraph text
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:  
    article_text += para.text

In [4]:
'''The next step is to split the corpus into individual sentences'''
corpus = nltk.sent_tokenize(article_text)

In [5]:
for i in range(len(corpus )):
    # we first convert our text into lower case
    corpus [i] = corpus [i].lower()
    # then will remove the punctuation from our text
    corpus [i] = re.sub(r'\W',' ',corpus [i])
    # We will remove the empty spaces from the text using regex
    corpus [i] = re.sub(r'\s+',' ',corpus [i])

In [6]:
print(len(corpus))

34


In [None]:
# Let's print one sentence from our corpus: 
print(corpus[30])

as an example george lakoff offers a methodology to build natural language processing nlp algorithms through the perspective of cognitive science along with the findings of cognitive linguistics 50 with two defining aspects ties with cognitive linguistics are part of the historical heritage of nlp but they have been less frequently addressed since the statistical turn during the 1990s 


In [8]:
'''The next step is to tokenize the sentences in the corpus and create a dictionary 
that contains words and their corresponding frequencies in the corpus'''
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

In [9]:
'''Let us filter down to the 200 most frequently occurring words. To do so, we can make use of Python's heap library.'''
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

In [10]:
'''The final step is to convert the sentences in our corpus into their corresponding vector representation'''
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [12]:
'''Our model is in the form of a list of lists. We can convert our model into matrix form using this script:'''
sentence_vectors = np.asarray(sentence_vectors)
sentence_vectors

array([[0, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])