In [1]:
# Imports and settings

from string import punctuation

from common.setup_notebook import set_css_style
set_css_style()

# <center> Text as numerical features

Text is unstructured, and in order for it to be used in a model, we typically need to find a numerical representation of it.

## The Bag of Words framework

The Bag of Words framework (aka the "BoW") is possibly the simplest numerical representation of strings of text one could envisage.

Wikipedia ([[1]](#1)) claims that an early reference to this name is present in Z Harris' paper *Distributional Structure*, of 1954 ([[2]](#2)). 

In BoW, a text is simply transformed into a "bag" (a multiset, that is, a set allowing for multiple occurrences) of the words composing it: this methods is very simplistic in that it disregards grammar and word order. 

You have a corpus of sentences. What you do is you take all the unique words in the corpus and for each of the sentences you count the occurrences of each of those words.

**Example**:

Given a corpus composed by the two texts

1. "John likes watching movies. Mary likes movies too."
2. "John also likes watching football games."

The list of unique words in it is ["John", "likes", "watching", "movies", "also", "football", "games", "Mary", "too"], there's 9 words.

The two texts get encoded into the lists, 9-items long, of the occurrences counts of all of those words. Respecting the order we chose for the list of unique words, we have:

1. [1, 2, 1, 2, 0, 0, 0, 1, 1]
2. [1, 1, 1, 0, 1, 1, 1, 0, 0]

This is because "John" (first item) appears once in the first text, "likes" appears twice in the first text, and so on.

We can play around with this a bit!

### Playing with bags of words

We now ask you to give us three sentences, down here.

In [8]:
print('Give me three sentences')

s1 = input("First sentence: ")
s2 = input("Second sentence: ")
s3 = input("Third sentence: ")

Give me three sentences
First sentence: John likes watching movies. Mary likes movies too.
Second sentence: John also likes watching football games.
Third sentence: John likes movies.


Then we build the list of unique words you provided in total ...

In [38]:
# Concatenate sentences, replace punctuation with space and split on space
# Do the same for each single sentence (for later use)
s = s1 + s2 + s3
for sign in punctuation:
    s = s.replace(sign, ' ')
    s1 = s1.replace(sign, ' ')
    s2 = s2.replace(sign, ' ')
    s3 = s3.replace(sign, ' ')
    
# Create the unique words list
unique_words = list(set(s.split()))

print('unique words are: ', unique_words)

unique words are:  ['John', 'too', 'watching', 'movies', 'games', 'Mary', 'football', 'likes', 'also']


Finally, for each sentence provided, we now compute its BoW representation:

In [40]:
s1_bow, s2_bow, s3_bow = [], [], []

for word in unique_words:
    s1_bow.append(s1.count(word))
    s2_bow.append(s2.count(word))
    s3_bow.append(s3.count(word))

print('First sentence in BoW: ', s1_bow)
print('First sentence in BoW: ', s2_bow)
print('First sentence in BoW: ', s3_bow)

First sentence in BoW:  [1, 1, 1, 2, 0, 1, 0, 2, 0]
First sentence in BoW:  [1, 0, 1, 0, 1, 0, 1, 1, 1]
First sentence in BoW:  [1, 0, 0, 1, 0, 0, 0, 1, 0]


## The TF-IDF framework

> TODO

## References

1. <a name="1"></a> [Wikipedia on the Bag of Words model](https://en.wikipedia.org/wiki/Bag-of-words_model)
2. <a name="2"></a> Z S Harris, [Distributional Structure](http://www.tandfonline.com/doi/pdf/10.1080/00437956.1954.11659520), *Word* 10.2-3, 1954