### The purpose of this notebook is to demonstrate the Bag of Words Algorithm. We are going to first import out libraries.

In [8]:
import nltk #https://www.nltk.org/
nltk.download('punkt')
import numpy as np #https://numpy.org/
import re
import itertools

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jenniferhajduk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
corpus = """Once upon a time in a quiet little town, there lived a small dog named Bella. Bella had always been curious about her mother, whom she had never met. 
Every day, she would wander around, sniffing for any scent that might lead her to her long-lost mother. 
She had heard tales of a magical forest where dogs could find their loved ones, and so, with hope in her heart, Bella embarked on a journey to find her mother.
As Bella ventured deeper into the forest, the tall trees whispered ancient secrets. The air was thick with the smell of moss and damp earth. She trotted along a narrow path, her paws sinking into the soft ground. 
Suddenly, Bella's ears perked up at the sound of a familiar bark. Could it be? She followed the sound, her tail wagging in excitement.
To her surprise, Bella found herself standing in front of a humble cottage nestled among the trees. 
A kind old woman with gentle eyes stood on the porch, watching Bella approach. The woman smiled warmly and beckoned the little dog closer. 
Bella's heart skipped a beat; something about the woman felt strangely familiar.
As Bella drew nearer, she caught a whiff of her mother's scent, mingled with the sweet aroma of freshly baked cookies. 
The old woman bent down and stroked Bella's head, murmuring words of comfort. It was then that Bella knew, deep in her soul, that she had finally found her mother.
Inside the cozy cottage, Bella's mother greeted her with open paws. 
They spent the day snuggled up together, sharing stories and catching up on all the missed years. Bella's heart swelled with joy, grateful for this unexpected reunion. 
She had searched far and wide, but it was in this unlikely place that she had found her mother's love.
As the sun began to set, painting the sky in hues of pink and orange, Bella and her mother sat side by side, watching the world go by. 
They knew they had a lot of catching up to do, and from that day forward, Bella and her mother were inseparable. 
The magical forest had brought them together, reminding them that love has a way of finding its own path, even in the most unlikely of places."""

In [10]:
print(corpus)

Once upon a time in a quiet little town, there lived a small dog named Bella. Bella had always been curious about her mother, whom she had never met. 
Every day, she would wander around, sniffing for any scent that might lead her to her long-lost mother. 
She had heard tales of a magical forest where dogs could find their loved ones, and so, with hope in her heart, Bella embarked on a journey to find her mother.
As Bella ventured deeper into the forest, the tall trees whispered ancient secrets. The air was thick with the smell of moss and damp earth. She trotted along a narrow path, her paws sinking into the soft ground. 
Suddenly, Bella's ears perked up at the sound of a familiar bark. Could it be? She followed the sound, her tail wagging in excitement.
To her surprise, Bella found herself standing in front of a humble cottage nestled among the trees. 
A kind old woman with gentle eyes stood on the porch, watching Bella approach. The woman smiled warmly and beckoned the little dog clo

### Lets tokenize our corpus and see what that looks like

In [11]:
sentences = nltk.sent_tokenize(corpus)
print(sentences)

['Once upon a time in a quiet little town, there lived a small dog named Bella.', 'Bella had always been curious about her mother, whom she had never met.', 'Every day, she would wander around, sniffing for any scent that might lead her to her long-lost mother.', 'She had heard tales of a magical forest where dogs could find their loved ones, and so, with hope in her heart, Bella embarked on a journey to find her mother.', 'As Bella ventured deeper into the forest, the tall trees whispered ancient secrets.', 'The air was thick with the smell of moss and damp earth.', 'She trotted along a narrow path, her paws sinking into the soft ground.', "Suddenly, Bella's ears perked up at the sound of a familiar bark.", 'Could it be?', 'She followed the sound, her tail wagging in excitement.', 'To her surprise, Bella found herself standing in front of a humble cottage nestled among the trees.', 'A kind old woman with gentle eyes stood on the porch, watching Bella approach.', 'The woman smiled warm

### We have a list of sentences. Each item in the list is a string that comprises a sentence from our corpus.

In [12]:
sentence_tokens = [re.sub(r'[^\w\s]', '', token) for token in sentences]
print(sentence_tokens)

['Once upon a time in a quiet little town there lived a small dog named Bella', 'Bella had always been curious about her mother whom she had never met', 'Every day she would wander around sniffing for any scent that might lead her to her longlost mother', 'She had heard tales of a magical forest where dogs could find their loved ones and so with hope in her heart Bella embarked on a journey to find her mother', 'As Bella ventured deeper into the forest the tall trees whispered ancient secrets', 'The air was thick with the smell of moss and damp earth', 'She trotted along a narrow path her paws sinking into the soft ground', 'Suddenly Bellas ears perked up at the sound of a familiar bark', 'Could it be', 'She followed the sound her tail wagging in excitement', 'To her surprise Bella found herself standing in front of a humble cottage nestled among the trees', 'A kind old woman with gentle eyes stood on the porch watching Bella approach', 'The woman smiled warmly and beckoned the little 

#### We have broken up our corpus into sentences and now we need to break up those sentences into words. We will then count the occurence of each word and then store that information in a dictionary. E.G {word: word_count..}

#### Let's flatten the list so that we have a list of all of the words

In [13]:
lists_of_words = [nltk.word_tokenize(sentence_token) for sentence_token in sentence_tokens]
words = list(itertools.chain(*lists_of_words))
print(words)


['Once', 'upon', 'a', 'time', 'in', 'a', 'quiet', 'little', 'town', 'there', 'lived', 'a', 'small', 'dog', 'named', 'Bella', 'Bella', 'had', 'always', 'been', 'curious', 'about', 'her', 'mother', 'whom', 'she', 'had', 'never', 'met', 'Every', 'day', 'she', 'would', 'wander', 'around', 'sniffing', 'for', 'any', 'scent', 'that', 'might', 'lead', 'her', 'to', 'her', 'longlost', 'mother', 'She', 'had', 'heard', 'tales', 'of', 'a', 'magical', 'forest', 'where', 'dogs', 'could', 'find', 'their', 'loved', 'ones', 'and', 'so', 'with', 'hope', 'in', 'her', 'heart', 'Bella', 'embarked', 'on', 'a', 'journey', 'to', 'find', 'her', 'mother', 'As', 'Bella', 'ventured', 'deeper', 'into', 'the', 'forest', 'the', 'tall', 'trees', 'whispered', 'ancient', 'secrets', 'The', 'air', 'was', 'thick', 'with', 'the', 'smell', 'of', 'moss', 'and', 'damp', 'earth', 'She', 'trotted', 'along', 'a', 'narrow', 'path', 'her', 'paws', 'sinking', 'into', 'the', 'soft', 'ground', 'Suddenly', 'Bellas', 'ears', 'perked', '

### Now let's create our dictionary with a word count for each word

In [14]:
vocabulary = {}
for word in words:
    if word in vocabulary.keys():
        vocabulary[word] += 1
    else:
        vocabulary[word] = 1
print(vocabulary)

{'Once': 1, 'upon': 1, 'a': 12, 'time': 1, 'in': 8, 'quiet': 1, 'little': 2, 'town': 1, 'there': 1, 'lived': 1, 'small': 1, 'dog': 2, 'named': 1, 'Bella': 10, 'had': 8, 'always': 1, 'been': 1, 'curious': 1, 'about': 2, 'her': 15, 'mother': 7, 'whom': 1, 'she': 5, 'never': 1, 'met': 1, 'Every': 1, 'day': 3, 'would': 1, 'wander': 1, 'around': 1, 'sniffing': 1, 'for': 2, 'any': 1, 'scent': 2, 'that': 6, 'might': 1, 'lead': 1, 'to': 4, 'longlost': 1, 'She': 4, 'heard': 1, 'tales': 1, 'of': 11, 'magical': 2, 'forest': 3, 'where': 1, 'dogs': 1, 'could': 1, 'find': 2, 'their': 1, 'loved': 1, 'ones': 1, 'and': 10, 'so': 1, 'with': 6, 'hope': 1, 'heart': 3, 'embarked': 1, 'on': 3, 'journey': 1, 'As': 3, 'ventured': 1, 'deeper': 1, 'into': 2, 'the': 18, 'tall': 1, 'trees': 2, 'whispered': 1, 'ancient': 1, 'secrets': 1, 'The': 4, 'air': 1, 'was': 3, 'thick': 1, 'smell': 1, 'moss': 1, 'damp': 1, 'earth': 1, 'trotted': 1, 'along': 1, 'narrow': 1, 'path': 2, 'paws': 2, 'sinking': 1, 'soft': 1, 'grou

### This dictionary is called our vocabulary. Our vocabulary for our Bag of Words model is composed of each unique word in our document. 
### The count of each unique word is the "score" of that word. 
### We now have our vocabualry for our bag of words model, we need to create vectors for each of the sentences. First we get the length of the vocabulary dictionary above: 

In [15]:
vec_length = len(vocabulary)
print(vec_length)

213



### Here is the first sentence in our corpus: "Once upon a time in a quiet little town there lived a small dog named Bella".
### Here is the beginning of our vocabulary dictionary: {'Once': 1, 'upon': 1, 'a': 12, 'time': 1, 'in': 9, 'quiet': 1, 'little': 2, 'town': 1, 'there': 1, 'lived': 1, 'small': 1, 'dog': 2, 'named': 1, 'Bella': 10...}
### There will be one vector of length 213 for each sentence in our corpus. Each of those 213 positions corresponds to each position in our dictionary.
### The first position in each vector will correspond to the word "Once", the second position in the vector will correspond to the word "upon", and so on
### If the word is present in that sentence, in that position, then we will denote this by flipping that position to 1
### Of course, there will be a lot of zeros in each vector because the sentences are fairly short and the vector length is relatively long and not every word appears in each sentences. 
### This will lead to vectors with a lot of zeros. This is referred to as a sparse vector.
### Imagine creating these vectors for millions of sentences and billions of words...
### Let's create the vectors


In [16]:
vector_list = []
for sentence_list in lists_of_words:
    vector = np.zeros(vec_length)
    for word in sentence_list:
        keys_list = list(vocabulary.keys())
        position = keys_list.index(word)
        vector[position] = 1
    print(vector)
    vector_list.append(vector)


[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0

### Now we have the basics of the Bag of Words model. We have extracted our vocabulary, scored each word using word frequency, and turned our document into vectors that a computer can completely understand how to compute how documents are similar. 
### Let's take a look at using existing toolsets that will make this easier so that we can not code this every time...

### Now we see each of the vectors that represent each sentence in our corpus. From here we can compute similarity between our corpus and others. 
### We dont have another corpus here but let's kill two birds with one stone and see how we can use existing libraries to do the work above for us and see how we can compare different documents. Go to vectorizer.ipynb