# Bag of Words Model in Python

The first thing we need to create our Bag of Words model is a dataset.real-world datasets are huge with millions of words. The best way to find a random corpus is Wikipedia.

In the first step, we will scrape the Wikipedia article on Natural Language Processing. But first, let's import the required libraries:

In [23]:
import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\patel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We are going to use BeautifulSoup library to scrape data from wikipedia.

Next, we need to scrape the Wikipedia article on natural language processing.


In [3]:
html_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')

html_data = html_data.read()

article = bs.BeautifulSoup(html_data, 'lxml')

article_paragraphs = article.find_all('p')

article_text = ''

for paragraph in article_paragraphs:
    article_text += paragraph.text

print(article_text)

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence[clarification needed].
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved p

In the script above, we import the HTML data for the Wikipedia article. From the HTML data, we filter the text within the paragraph text. Finally, we create a complete corpus by concatenating all the paragraphs.

The next step is to split the corpus into individual sentences. To do so, we will use the `sent_tokenize` function from the NLTK library.

In [12]:
corpus = nltk.sent_tokenize(article_text)

Our text contains punctuations. We don't want punctuations to be the part of our word frequency dictionary. In the following script, we first convert our text into lower case and then will remove the punctuation from our text. Removing punctuation can result in multiple empty spaces. We will remove the empty spaces from the text using regex.

In [13]:
for i in range(len(corpus)):
    corpus[i] = corpus[i].lower()
    corpus[i] = re.sub(r'\W',' ',corpus[i])
    corpus[i] = re.sub(r'\s+',' ',corpus[i])

In [14]:
print(len(corpus))

47


Let's print one sentence from our corpus:

In [15]:
print(corpus[30])

in the 2010s representation learning and deep neural network style machine learning methods became widespread in natural language processing due in part to a flurry of results showing that such techniques 4 5 can achieve state of the art results in many natural language tasks for example in language modeling 6 parsing 7 8 and many others 


You can see that the text doesn't contain any special character or multiple empty spaces.

Now we have our own corpus. The next step is to tokenize the sentences in the corpus and create a dictionary that contains words and their corresponding frequencies in the corpus. Look at the following script:

In [17]:
wordfreq = {}

for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1
            
for k,v in wordfreq.items():
    print(f'{k} -> {v}')

natural -> 19
language -> 24
processing -> 16
nlp -> 3
is -> 9
a -> 30
subfield -> 1
of -> 56
linguistics -> 3
computer -> 3
science -> 1
information -> 3
engineering -> 1
and -> 24
artificial -> 2
intelligence -> 4
concerned -> 1
with -> 6
the -> 61
interactions -> 1
between -> 2
computers -> 2
human -> 4
languages -> 2
in -> 28
particular -> 1
how -> 2
to -> 25
program -> 1
process -> 1
analyze -> 1
large -> 3
amounts -> 2
data -> 10
challenges -> 1
frequently -> 2
involve -> 1
speech -> 4
recognition -> 2
understanding -> 1
generation -> 1
history -> 1
generally -> 3
started -> 1
1950s -> 1
although -> 1
work -> 3
can -> 5
be -> 5
found -> 2
from -> 5
earlier -> 1
periods -> 1
1950 -> 1
alan -> 1
turing -> 2
published -> 3
an -> 4
article -> 1
titled -> 1
computing -> 1
machinery -> 1
which -> 8
proposed -> 1
what -> 1
now -> 2
called -> 2
test -> 1
as -> 13
criterion -> 1
clarification -> 1
needed -> 1
georgetown -> 1
experiment -> 1
1954 -> 1
involved -> 1
fully -> 1
automatic -> 

In the script above we created a dictionary called wordfreq. Next, we iterate through each sentence in the corpus. The sentence is tokenized into words. Next, we iterate through each word in the sentence. If the word doesn't exist in the wordfreq dictionary, we will add the word as the key and will set the value of the word as 1. Otherwise, if the word already exists in the dictionary, we will simply increment the key count by 1.

In [18]:
print(len(wordfreq))

536


depending upon the task at hand, not all of the words are useful. In huge corpora, you can have millions of words. We can filter the most frequently occurring words. Our corpus has 535 words in total. Let us filter down to the 200 most frequently occurring words. To do so, we can make use of Python's heap library.

Look at the following script:


In [19]:
import heapq

most_freq = heapq.nlargest(200, wordfreq, key = wordfreq.get)

print(len(most_freq))

200


Now our most_freq list contains 200 most frequently occurring words along with their frequency of occurrence.

The final step is to convert the sentences in our corpus into their corresponding vector representation. The idea is straightforward, for each word in the most_freq dictionary if the word exists in the sentence, a 1 will be added for the word, else 0 will be added.

In [20]:
sentence_vectors = []

for sentence_vector in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    
    sent_vec = []
    
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [21]:
for sentence_vector in sentence_vectors:
    print(sentence_vector)

[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In the script above we create an empty list `sentence_vectors` which will store vectors for all the sentences in the corpus. Next, we iterate through each sentence in the corpus and create an empty list `sent_vec` for the individual sentences. Similarly, we also tokenize the sentence. Next, we iterate through each word in the `most_freq` list and check if the word exists in the tokens for the sentence. If the word is a part of the sentence, 1 is appended to the individual sentence vector `sent_vec`, else 0 is appended. Finally, the sentence vector is added to the list sentence_vectors which contains vectors for all the sentences. Basically, this `sentence_vectors` is our bag of words model.

However, the bag of words model that we saw in the theory section was in the form of a matrix. Our model is in the form of a list of lists. We can convert our model into matrix form using this script:

In [24]:
sentence_vectors = np.asarray(sentence_vectors)

In [25]:
sentence_vectors

array([[1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       ...,
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0]])

You can see the Bag of Words model containing 0 and 1.