### Bag of Words Lab

Bag of words (BoW) is an important technique in text mining and information retrieval. BoW uses term-frequency vectors to represent the content of text documents which makes it possible to use mathematics and computer programs to analyze and compare text documents.

BoW contains the following information:

A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. Ironhack => ironhack), tense (e.g. had => have), singular form (e.g. students => student), etc.
The number of occurrences of each normalized term in each document.
For example, assume we have three text documents:

DOC 1: Ironhack is cool.

DOC 2: I love Ironhack.

DOC 3: I am a student at Ironhack.

The BoW of the above documents looks like below:

In [1]:
docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']

Define an empty array corpus that will contain the content strings of the docs. Loop docs and read the content of each doc into the corpus array.

In [2]:
import pandas as pd
import os

corpus = []    

In [3]:
for filename in os.listdir('./'):
    if filename.endswith('.txt'):
        with open(filename, 'r') as f:
                content = f.readline()
        corpus.append(content)

In [4]:
corpus

['Ironhack is cool.', 'I am a student at Ironhack.', 'I love Ironhack.']

Write your code below to process corpus (convert to lower case and remove special characters).

In [5]:
clean_list = [(lambda i: i.lower().strip('\.')) (i) for i in corpus]
clean_list

['ironhack is cool', 'i am a student at ironhack', 'i love ironhack']

Now define bag_of_words as an empty array. It will be used to store the unique terms in corpus.

In [6]:
bag_of_words = []

Loop through corpus. In each loop, do the following:

Break the string into an array of terms.
Create a sub-loop to iterate the terms array.
In each sub-loop, you'll check if the current term is already contained in bag_of_words. If not in bag_of_words, append it to the array

In [7]:
for sentence in clean_list:
    terms = sentence.split(' ')

In [8]:
for sentence in clean_list:
    terms = sentence.split(' ')
    for word in terms:
        if word not in bag_of_words:
            bag_of_words.append(word)

bag_of_words

['ironhack', 'is', 'cool', 'i', 'am', 'a', 'student', 'at', 'love']

Now we define an empty array called term_freq. Loop corpus for a second time. In each loop, create a sub-loop to iterate the terms in bag_of_words. Count how many times each term appears in each doc of corpus. Append the term-frequency array to term_freq.

In [9]:
term_freq = []
corpus_split = []

for i in clean_list:
    corpus_split.append(i.split(' '))

for sentence in corpus_split:
    sent_vec = []
    for word in bag_of_words:
        if word in sentence:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    term_freq.append(sent_vec)

In [10]:
term_freq

[[1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 1, 1, 1, 1, 1, 0],
 [1, 0, 0, 1, 0, 0, 0, 0, 1]]

In [11]:
## Result printed in Ironhack Lab

[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]

[[1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 1, 1, 0, 0, 0, 0],
 [1, 0, 0, 1, 0, 1, 1, 1, 1]]

['ironhack is cool'], 

['i am a student at ironhack'], 

['i love ironhack'] 

['ironhack', 'is', 'cool', 'i', 'am', 'a', 'student', 'at', 'love']

['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']

### Bonus Question

Optimize your solution for the above question by removing stop words from the BoW. For your convenience, a list of stop words is defined for you in the next cell.

Requirements:

Combine all your previous codes to the cell below.
Improve your solution by ignoring stop words in bag_of_words.

In [12]:
stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']

In [13]:
bow_with_no_stopwords = [word for word in bag_of_words if word not in stop_words]
bow_with_no_stopwords

['ironhack', 'cool', 'student', 'love']

In [14]:
term_freq_2 = []
corpus_split_2 = []

for i in clean_list:
    corpus_split_2.append(i.split(' '))

for sentence in corpus_split_2:
    sent_vec_2 = []
    for word in bow_with_no_stopwords:
        if word in sentence:
            sent_vec_2.append(1)
        else:
            sent_vec_2.append(0)
    term_freq_2.append(sent_vec_2)

In [15]:
term_freq_2

[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]

### Additional Challenge for the Nerds

We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.

In [16]:
!pip install sklearn



In [17]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer()

# Fit the bag-of-words model
bag = vectorizer.fit_transform(clean_list)

# Get unique words / tokens found in all the documents. The unique words / tokens represents
# the features

print(vectorizer.get_feature_names_out())

# Associate the indices with each unique word

print(vectorizer.vocabulary_)

# Print the numerical feature vector

print(bag.toarray())

['am' 'at' 'cool' 'ironhack' 'is' 'love' 'student']
{'ironhack': 3, 'is': 4, 'cool': 2, 'am': 0, 'student': 6, 'at': 1, 'love': 5}
[[0 0 1 1 1 0 0]
 [1 1 0 1 0 0 1]
 [0 0 0 1 0 1 0]]
