# Feature Extraction

In [1]:
import math

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Let's define a simple corpus of 4 documents, each containg just one sentence. All sentences have been preprocessed -- that is, all words a lower case, punctuation has been removed, and stop words have been removed.

In [2]:
documents = ['cat climb tree',
             'saw cat climb shed',
             'have tree shed shed',
             'dog saw cat tree shed']

# Remember the number of documents for later
N = len(documents)

We first analyse all documents and build up a series of statistics we need through the tutorial:

- `word_list`: a list of all words, i.e., the vocabulary of the corpus

- `word_to_idx`: a dictionary that maps a word to an index; most algorithms don't consider words but indexes/numbers for efficiency reasons. Don't forget, algorithms don't care about words specifically. Note below that the index matches the position in the `word_list`. As such, we can not only map from a word to the index but also from an index to the respective word.

- `doc_counts`: a dictionary that keeps track in how many documents contain a certain word. This later simplifies the calculation of the inverse document frequency (idf)

In [3]:
word_list = []
word_to_idx = {}
doc_counts = {}

for doc in documents:
    # For each word in the document...
    for word in doc.split():
        # If we haven't seen this word yet...
        if word not in word_to_idx:
            # Add word to word list
            word_list.append(word)
            # Add word to index mapping; the index is just an increasing number
            word_to_idx[word] = len(word_to_idx)
    # for each UNIQUE word in the document...
    for word in set(doc.split()):
        if word not in doc_counts:
            doc_counts[word] = 1
        else:
            doc_counts[word] += 1
            
# Let's see how the 3 statistics look like            
print(word_to_idx)
print(word_list)
print(doc_counts)

{'tree': 2, 'climb': 1, 'saw': 3, 'have': 5, 'shed': 4, 'cat': 0, 'dog': 6}
['cat', 'climb', 'tree', 'saw', 'shed', 'have', 'dog']
{'tree': 3, 'climb': 2, 'saw': 2, 'have': 1, 'shed': 3, 'cat': 3, 'dog': 1}


## Document Word Matrix 

The most commen way to represent a corpus is the document word matrix, which is simply a more self-explanatory name for a feature set. The matrix contains N rows (N = number of documents) and M columns (M = size of vocabulary). A row is also called a *document vector* or *feature vector*.

### Word-count document word matrix (manually)

Here, an element indicates how often a word is contained in document

In [4]:
for doc in documents:
    feature_vector = [0] * len(word_list)  # We know that the length of the vector is the size of the vocabulary
    for w in doc.split():
        # We use the index mapping to find the correct index of a word.
        feature_vector[word_to_idx[w]] += 1
            
    print(feature_vector) # Each feature vector represents one document; let's see it
            

[1, 1, 1, 0, 0, 0, 0]
[1, 1, 0, 1, 1, 0, 0]
[0, 0, 1, 0, 2, 1, 0]
[1, 0, 1, 1, 1, 0, 1]


The 4 list are the for feature vectors and form together the document term matrix. The order of the columns matches order of the words in the vocabulary with respect to their index. For example, the first word (idx=0) is "cat", the second words (idx=1) is "climb", and so on.

### Word-count document word matrix (using scikit-learn)

`scikit-learn` provides the `CountVectorizer` to easily generate the document word matrix, with the word counts as its elements. `CountVectorizer` support a lot of parameters for configuration -- as we see later. Right now, we simply use the default parameter values.

In [5]:
count_vectorizer = CountVectorizer()

The method `fit_transform()` takes the corpus as input and performs the generation of the document word matrix.

In [6]:
tf = count_vectorizer.fit_transform(documents)
vocabulary = count_vectorizer.get_feature_names()

print(vocabulary)

['cat', 'climb', 'dog', 'have', 'saw', 'shed', 'tree']


We use `pandas` to conveniently display the document word matrix

In [7]:
from pandas import DataFrame

print(DataFrame(tf.A, columns=vocabulary).to_string())

   cat  climb  dog  have  saw  shed  tree
0    1      1    0     0    0     0     1
1    1      1    0     0    1     1     0
2    0      0    0     1    0     2     1
3    1      0    1     0    1     1     1


Note that the order of the columsn not line up perfectly. For example, here the last columns is for the word "tree", while above the last columns is for word "dog". 

### tf-idf document word matrix (manually)

Here, an element represent the tf-idf value of a word in a document (tf = term frequency; idf = inverse document frequency). The `tf-idf` value of a term $t_i$ in a document $d_j$ is defined as:

$$tfidf(t_i, d_j) = tf(t_i, d_j) \cdot idf(t_i, d_j)$$

with

$$ tf(t_i, d_j) = \frac{number\ of\ times\ t_i\ appears\ in\ d_j}{total\ number\ of\ terms\ in\ d_j} $$

and

$$ idf(t_i, d_j) = \log{\frac{total\ number\ of\ documents}{number\ of\ terms\ containing\ t_i}} $$


**IMPORTANT**: In the following, we calculate the term frequncy simply as $tf(t_i, d_j) = number\ of\ times\ t_i\ appears\ in\ d_j$. This is how `scikit-learn` does it, and we want to end up with the same result for comparison. 

In [8]:
for doc in documents:
    feature_vector = [0.0] * len(word_list)  # We know that the length of the vector is the size of the vocabulary
    # Split the document into list of words
    words = doc.split()
    # Generate a set of words from the list (i.e., no duplicates)
    word_set = set(words)
    # For each unique word in the document (i.e., the word set)
    for word in word_set:
        # Calculate tf; count() is an in-built method the returns the number of occurrences of an item in a list
        tf = words.count(word) # / len(words)   # Here we differ from the "official definition" of tf
        # Calculate the document frequence df which we already did when building the statistics
        df = doc_counts[word]
        # Calculate the idf; the +1 are for smoothing to match the results of the scikit-learn methods
        idf = math.log( (N + 1) / (df + 1) ) + 1
        # Finally, caluclate the tf-idf value
        tfidf = tf * idf
        feature_vector[word_to_idx[word]] = round(tfidf, 6)

    print(feature_vector)

[1.223144, 1.510826, 1.223144, 0.0, 0.0, 0.0, 0.0]
[1.223144, 1.510826, 0.0, 1.510826, 1.223144, 0.0, 0.0]
[0.0, 0.0, 1.223144, 0.0, 2.446287, 1.916291, 0.0]
[1.223144, 0.0, 1.223144, 1.510826, 1.223144, 0.0, 1.916291]


### tf-idf document word matrix (using scikit-learn)

Again, `scikit-learn` provides means to hide all these calculations; here, the `TfidfVectorizer`. Be default, `TfidfVectorizer` normalizes the final matrix so that all values are betweent 0 and 1. To get comparable results, we switch this of with `norm=None`.

In [9]:
tfidf_vectorizer = TfidfVectorizer(norm=None)

The method `fit_transform()` takes the corpus as input and performs the generation of the document word matrix.

In [13]:
tfidf_model = tfidf_vectorizer.fit_transform(documents)
vocabulary = tfidf_vectorizer.get_feature_names()
print(vocabulary)

['cat', 'climb', 'dog', 'have', 'saw', 'shed', 'tree']


In [14]:
print(DataFrame(tfidf_model.A, columns=vocabulary).to_string())

        cat     climb       dog      have       saw      shed      tree
0  1.223144  1.510826  0.000000  0.000000  0.000000  0.000000  1.223144
1  1.223144  1.510826  0.000000  0.000000  1.510826  1.223144  0.000000
2  0.000000  0.000000  0.000000  1.916291  0.000000  2.446287  1.223144
3  1.223144  0.000000  1.916291  0.000000  1.510826  1.223144  1.223144


All vectorizer allow users to specify a wide range of paramters. A very important one is `ngram_range` which allows to specify to not only use words/tokens (1-grams, unigrams) but also n-gram of larger sizes. Note that n-grams larger than 3 are typically not recommended.

In [16]:
ngram_count_vectorizer = CountVectorizer(ngram_range=(1,2))  # Default is ngram_range=(1,1), i.e., individual words

ngram_count_model = ngram_count_vectorizer.fit_transform(documents)
ngram_vocabulary = ngram_count_vectorizer.get_feature_names()

print(ngram_vocabulary)
#print(DataFrame(ngram_count_model.A, columns=ngram_vocabulary).to_string())

['cat', 'cat climb', 'cat tree', 'climb', 'climb shed', 'climb tree', 'dog', 'dog saw', 'have', 'have tree', 'saw', 'saw cat', 'shed', 'shed shed', 'tree', 'tree shed']


In [17]:
print(DataFrame(ngram_count_model.A, columns=ngram_vocabulary).to_string())

   cat  cat climb  cat tree  climb  climb shed  climb tree  dog  dog saw  have  have tree  saw  saw cat  shed  shed shed  tree  tree shed
0    1          1         0      1           0           1    0        0     0          0    0        0     0          0     1          0
1    1          1         0      1           1           0    0        0     0          0    1        1     1          0     0          0
2    0          0         0      0           0           0    0        0     1          1    0        0     2          1     1          1
3    1          0         1      0           0           0    1        1     0          0    1        1     1          0     1          1
