In [1]:
%run 02_data_preparation.ipynb

# Text representation methods

As Machine Learning methods take the data encoded as mathematical vectors, we need to convert our texts to this representation. There is no just a single way of doing that, and we are now going to check different possibilities. 

In [1]:
SENTENCES = ("The man who does not read has no advantage over the man who cannot read", 
             "He cannot read it because the letters are too small",
             "Big letters have an advantage")

Let's define the following helper function to extract all the unique words from the text:

In [2]:
from operator import or_
from functools import reduce


def split_sentence(sentence):
    return list(sentence.lower().split())


def extract_dictionary(sentences):
    """
    Creates an iterable with all the words occuring in given iterable
    of sentences, without duplicates. All the sentences are converted 
    to lowercase before splitting.
    :return: iterable with string instances
    """
    return list(reduce(or_, (set(split_sentence(sentence)) for sentence in sentences)))


DICTIONARY = extract_dictionary(SENTENCES)
DICTIONARY

['does',
 'not',
 'it',
 'big',
 'have',
 'small',
 'over',
 'read',
 'letters',
 'too',
 'an',
 'are',
 'man',
 'has',
 'because',
 'who',
 'the',
 'advantage',
 'he',
 'no',
 'cannot']

## Bag of words

The simplest way of encoding textual information in mathematical vector is to keep the information about the words which have been used. Typically, we firstly extract the list of all the unique words occurred in the dataset and assign each of them a number - a position in the dictionary which is then used as a position in the created vector describing the text. In the simplest case we put "1" as a value at the corresponding vector position no matter how many times this particular word occured in the sentence, but we can also put an absolute number of occurences as well, in order to emphasize the most frequent words. In both cases we lose many important information like words collocations, negations, etc. which for sure have a lot of meaning. Nevertheless, doing it this way we are able to encode some of the most general features of the text.

Let's have a look at this simple example below. We will assume we have only three sentences in our dataset:
- I have no time to see you.
- See you, please be on time!
- It was a pleasure to meet you!

It gives the following dictionary:

|No|Word|Occurences|
|---|---|---|
|0|I|1|
|1|have|1|
|2|no|1|
|3|time|2|
|4|to|2|
|5|see|2|
|6|you|3|
|7|please|1|
|8|be|1|
|9|on|1|
|10|it|1|
|11|was|1|
|12|a|1|
|13|pleasure|1|
|14|meet|1|

There are 15 different words overall, so our output vectors are going to have as many dimensions. Now we can see how all the sentences from out dataset are going to be represented as vectors:

|I have no time to see you.|
|---|
|I|have|no|time|to|see|you|please|be|on|it|was|a|pleasure|meet|
|**1**|**1**|**1**|**1**|**1**|**1**|**1**|0|0|0|0|0|0|0|0|

|See you, please be on time!|
|---|
|I|have|no|time|to|see|you|please|be|on|it|was|a|pleasure|meet|
|0|0|0|**1**|0|**1**|**1**|**1**|**1**|**1**|0|0|0|0|0|

|It was a pleasure to meet you!|
|---|
|I|have|no|time|to|see|you|please|be|on|it|was|a|pleasure|meet|
|0|0|0|0|**1**|0|**1**|0|0|0|**1**|**1**|**1**|**1**|**1**|

## TFIDF

In the "bag of words" approach we treat every single word in the exactly same way, however some of them contain more information than the others. For that reason, it might be a good idea to somehow let our ML models know they should be treated with higher importance. It is quite a common issue, so there is already an existing formula for encoding the data with keeping that in mind. We need to have the following terms to be introduced first:

$$n_{i,j} - \text{number of occurences of the term }t_{i}\text{ in the document }d_{j}$$

$$tf_{i,j}=\frac{n_{i,j}}{\sum_{k}n_{k,j}}$$

$$idf_{i}=\log\frac{|D|}{|\{d:t_{i}\in d\}|}$$

$$tfidf_{i,j}=tf_{i,j} \times idf_{i}$$

Intuitively, $tf_{i,j}$ describes the frequency of a particular word within one of the documents. Term $idf_{i,j}$, in turn, is inversly proportional to the number of documents this word or phrase is part of - it grows with the decrease of the number of such documents. If the word is present in every single document, then the word is probably non-informative and the value of $idf_{i,j}$ is zero, so the whole value of $tfidf_{i,j}$ is also zero, and this phrase won't be taken under consideration further.

To get a proper intution, let consider our example in terms of the values for each word from the dictionary. This time we are not going to do it manually, but let's write some code to do it automatically. 

In [4]:
import math
import numpy as np


def tf(i, j):
    """
    Calculates a value of tf_{i, j} term.
    :param i: index of the term in the dictionary
    :param j: index of the document in the collections of sentences
    :return: value of the tf for term and document
    """
    sentence_words = split_sentence(SENTENCES[j])
    return sentence_words.count(DICTIONARY[i]) / len(sentence_words)


def idf(i):
    """
    Calculates a value of idf_{i} term.
    :param i: index of the term in the dictionary
    :return: value of the idf for term
    """
    return math.log(len(SENTENCES) / sum(1 for sentence in SENTENCES 
                                         if DICTIONARY[i] in split_sentence(sentence)))


tf_values = pd.DataFrame(data=np.array(np.zeros((len(DICTIONARY), len(SENTENCES)))),
                         index=DICTIONARY, columns=SENTENCES)
idf_values = pd.DataFrame(np.array(np.zeros((1, len(DICTIONARY)))),
                          columns=DICTIONARY)
for i, word in enumerate(DICTIONARY):
    for j, sentence in enumerate(SENTENCES):
        tf_values[sentence][word] = tf(i, j)
    idf_values[word] = idf(i)

In [5]:
# Display TF term values
tf_values.style.background_gradient(cmap="Wistia")

Unnamed: 0,The man who does not read has no advantage over the man who cannot read,He cannot read it because the letters are too small,Big letters have an advantage
he,0.0,0.1,0.0
not,0.0666667,0.0,0.0
man,0.133333,0.0,0.0
does,0.0666667,0.0,0.0
has,0.0666667,0.0,0.0
an,0.0,0.0,0.2
small,0.0,0.1,0.0
too,0.0,0.1,0.0
have,0.0,0.0,0.2
read,0.133333,0.1,0.0


In [6]:
# Display IDF term values
idf_values.T.style.background_gradient(cmap="Wistia")

Unnamed: 0,0
he,1.09861
not,1.09861
man,1.09861
does,1.09861
has,1.09861
an,1.09861
small,1.09861
too,1.09861
have,1.09861
read,0.405465


In [7]:
# Multiply TF with IDF to get the TFIDF term values
tfidf_values = tf_values.multiply(idf_values.T[0], axis="index")
tfidf_values.style.background_gradient(cmap="Wistia")

Unnamed: 0,The man who does not read has no advantage over the man who cannot read,He cannot read it because the letters are too small,Big letters have an advantage
he,0.0,0.109861,0.0
not,0.0732408,0.0,0.0
man,0.146482,0.0,0.0
does,0.0732408,0.0,0.0
has,0.0732408,0.0,0.0
an,0.0,0.0,0.219722
small,0.0,0.109861,0.0
too,0.0,0.109861,0.0
have,0.0,0.0,0.219722
read,0.054062,0.0405465,0.0


 ## Word2vec
 
This approach is slightly different to the previous ones and unfortunately unapplicable to our case, as we don't want to model single words, but sentences or even whole documents. Nevertheless, it is interesting enough to talk a little bit more about it.

Word2vec method tries to find the meaning of the word by the contexts it usually appears in. If two different words are synonyms, they should often occur within the same surrounding phrases, so their output vectors will be quite close to each other. Internally, a shallow neural network is used to find *distributed representation* of the words, but we are not going to analyze the details, but let's rather focus on some properties such a vectorization method has.

Intuitevely, word2vec converts any word to a fixed-length vector. Two words, which are similar in their meanings, should be close to each other. An interesting property of this algorithm is the fact, is keeps the relationships between words.

![Word2vec - expectations](images/word2vec-verb.png)

There are serveral examples of the word pairs relationships that trained model was able to recognize:

![Word2vec - examples](images/word2vec-examples.png)

Somehow, word2vec is able to find the semantic relationships between words and that's a really interesting feature. There is a web application that allows to visualize these relationships, made by the researchers from Warsaw Univeristy:

https://lamyiowce.github.io/word2viz/


### Doc2vec

Word2vec method allows to vectorize single words, but it can be generalized to do the same for whole sentences or even documents. This family of models requires a lot of training data, definitely much more than we have collected, so we can't really use it. 

## Exercise

As an exercise let's implement a bag-of-words vectorization for given texts. In the file *exercise/exercise_02.py* you may find a basic structure for the vectorizer to be implemented. There are two methods - *fit*, which takes an iterable of the text documents and *vectorize*, that takes a single document as a string, and returns its bag-of-words representation as a numpy array. 

In [3]:
%run exercise/exercise_02.py

The following piece of code should fit the vectorizer on our dataset and then display them as bag-of-words vectors.

In [4]:
vectorizer = BagOfWordsVectorizer()
vectorizer.fit(SENTENCES)
for sentence in SENTENCES:
    bow = vectorizer.vectorize(sentence)
    print("Sentence: {}\n{}".format(sentence, bow))

Sentence: The man who does not read has no advantage over the man who cannot read
[ 1.  0.  1.  0.  0.  0.  1.  0.  1.  1.  0.  0.  0.  1.  0.  1.  1.  1.
  0.  1.  1.]
Sentence: He cannot read it because the letters are too small
[ 0.  1.  1.  1.  0.  1.  0.  1.  1.  0.  1.  0.  0.  0.  1.  0.  0.  0.
  1.  0.  1.]
Sentence: Big letters have an advantage
[ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  1.  1.  0.  1.  0.  0.  1.
  0.  0.  0.]
