# Bag of Words Lab

## Introduction

**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). BoW uses term-frequency vectors to represent the content of text documents which makes it possible to use mathematics and computer programs to analyze and compare text documents.

BoW contains the following information:

1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.
1. The number of occurrences of each normalized term in each document.

For example, assume we have three text documents:

DOC 1: **Ironhack is cool.**

DOC 2: **I love Ironhack.**

DOC 3: **I am a student at Ironhack.**

The BoW of the above documents looks like below:

| TERM | DOC 1 | DOC 2 | Doc 3 |
|---|---|---|---|
| a | 0 | 0 | 1 |
| am | 0 | 0 | 1 |
| at | 0 | 0 | 1 |
| cool | 1 | 0 | 0 |
| i | 0 | 1 | 1 |
| ironhack | 1 | 1 | 1 |
| is | 1 | 0 | 0 |
| love | 0 | 1 | 0 |
| student | 0 | 0 | 1 |


The term-frequency array of each document in BoW can be considered a high-dimensional vector. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. **Two documents are considered identical if their vector representations have close [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).**

In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.

In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far.

## The Challenge

We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.

*What is a corpus (plural: corpora)? Read the reference in the README file.*

Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:

```python
bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']

term_freq = [
    [0, 0, 0, 1, 0, 1, 1, 0, 0],
    [0, 0, 0, 0, 1, 1, 0, 1, 0],
    [1, 1, 1, 0, 1, 1, 0, 0, 1],
]
```

Now let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`.

In [137]:
docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']

Define an empty array `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc into the `corpus` array.

In [138]:
corpus = []

# # Write your code here

import os
for x in docs:
    with open(x) as f:
        corpus.append(f.read())
        



Print `corpus`.

In [139]:
print(corpus)

['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']


You expected to see:

```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```

But you actually saw:

```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```

This is because you haven't done two important steps:

1. Remove punctuation from the strings

1. Convert strings to lowercase

Write your code below to process `corpus` (convert to lower case and remove special characters).

In [140]:
# Write your code here
# corpus = [line.lower().rstrip('.') for line in corpus]
# print(corpus)

import re
# my_list= ["on@3", "two#", "thre%e"]
corpus = [re.sub('[^a-zA-Z0-9\s]+', '', _).lower() for _ in corpus]
corpus

['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']

Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`.

In [141]:
bag_of_words = []


Loop through `corpus`. In each loop, do the following:

1. Break the string into an array of terms. 
1. Create a sub-loop to iterate the terms array. 
  * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array.

In [142]:
# Write your code here

for x in corpus:
    for y in x.split(' '):
        if y not in bag_of_words:
            bag_of_words.append(y)
          

Print `bag_of_words`. You should see: 

```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```

If not, fix your code in the previous cell.

In [143]:
print(bag_of_words)

['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']


Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`.

In [146]:
term_freq = []

for x in corpus:
    y = []
    for term in bag_of_words:
        if term in x.split(' '):
            y.append(1)
        else:
            y.append(0)
    term_freq.append(y)
    
term_freq
# Write your code here

[[1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 1, 1, 0, 0, 0, 0],
 [1, 0, 0, 1, 0, 1, 1, 1, 1]]

Print `term_freq`. You should see:

```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```

In [None]:
print(term_freq)

**If your output is correct, congratulations! You've solved the challenge!**

If not, go back and check for errors in your code.

## Bonus Question

Optimize your solution for the above question by removing stop words from the BoW. For your convenience, a list of stop words is defined for you in the next cell. With the stop words removed, your output should look like:

```
bag_of_words = [am', 'at', 'cool', ironhack', 'is', 'love', 'student']

term_freq = [
	[0, 0, 1, 1, 1, 0, 0],
 	[0, 0, 0, 1, 0, 1, 0],
 	[1, 1, 0, 1, 0, 0, 1]
]
```

**Requirements:**

1. Combine all your previous codes to the cell below.
1. Improve your solution by ignoring stop words in `bag_of_words`.

After you're done, your `bag_of_words` should be:

```['ironhack', 'is', 'cool', 'love', 'am', 'student', 'at']```

And your `term_freq` should be:

```[[1, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 1, 1, 1]]```

In [None]:
stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']

# Write your code below


## Additional Challenge for the Nerds

We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.

**Notes:**

* To install Scikit-Learn, use `pip install sklearn`. 

* Scikit-Learn removes stop words by default. You don't need to manually remove stop words.

* Scikit-Learn's output has slightly different format from the output example demonstrated above. It's ok, you don't need to convert the Scikit-Learn output.

The Scikit-Learn output will look like below:

```python
# BoW:
{u'love': 5, u'ironhack': 3, u'student': 6, u'is': 4, u'cool': 2, u'am': 0, u'at': 1}

# term_freq:
[[0 0 1 1 1 0 0]
 [0 0 0 1 0 1 0]
 [1 1 0 1 0 0 1]]
 ```