# Feature Extraction

Until now, we have seen, how to tokenize a document, extract attributes from its tokens and even how to create a bag of words. In our second lab, we used the `CountVectorizer` from `scikit-learn` in order to create a matrix that counts words in a document. Do you remember how that works? 

The process of vectorizing a text is performed given the fact that most algorithms are not designed to handle raw text. Therefore, we need to represent each text document in a mathematical form, so that calculations can be done. There are several ways for vectorizing text, the easiest one is the bag of words approach, where we create a vocabulary with all the words in our document collection. The idea is to create a matrix to represent which words of my vocabulary are present in each document.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import naive_bayes

**Question**
Use the text provided below to create a bag of words using the `CountVectorizer` function as we did in a previous session. 

**Steps** 
Print your vocabulary and print the array form of your vectorizer

In [3]:
docs = ['Elon Musk wants to build a Gigafactory',
        'UK is too risky after the Brexit for a Gigafactory',
        'Tesla wants to build a Gigafactory in Berlin',
        'Brexit has made it too risky for Tesla to put a Gigafactory in the UK.']

In [4]:
count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(docs)
print(count_vectorizer.get_feature_names())

['after', 'berlin', 'brexit', 'build', 'elon', 'for', 'gigafactory', 'has', 'in', 'is', 'it', 'made', 'musk', 'put', 'risky', 'tesla', 'the', 'to', 'too', 'uk', 'wants']


In [5]:
print(X.toarray())

[[0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1]
 [1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 1 1 0]
 [0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1]
 [0 0 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0]]


However we also know that counting is not the only way of representing text with numbers. We can also penalize tokens that occurr very often. Why?

Because in general, very frequent token sometimes are not as relevant as tokens that appear less often. For instance, if we want to retrieve documents that are relevant for a query: _The President Donald Trump_, one idea would be to retrieve all document containing all the words in the query. However, since our search retrieved still many documents, we might want to count the times that query words appear in the selected documents. But, _the_ and _president_, may still have many occurrences. Therefore, we should focus on documents that contain rather _donald trump_. But how do we get there? Calculating term frequency–inverse document frequency (tf-idf).

**Tf-Idf**: term frequency of a token, multiplied by the inverse document frequency (log[number of documents containing a token]).

- Tf: Term frequency: $\frac{freq(term)}{\# terms \in doc} $
- Idf: Inverse document frequency: $\log\frac{|D|}{\# d : term \in doc}$

Notice that we calculate the tf-idf for each term in each document. Let's calculate them for _Elon_ and _Gigafactory_ in the first document.

**Examples:**
- Tf-Idf(Elon) = $\frac{1}{7}*\log(\frac{4}{1}) = 0.14*0.6  = 0.084$
- Tf-Idf(Gigafactory) = $\frac{1}{7}*\log(\frac{4}{4}) = 0 $

**Question**
Use the same data and the `TfidfVectorizer`, create a matrix and explore several attributes as input for the vectorizer. 

**Steps** 
Use the `?` to explore the function of the vectorizer and refer to [this link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) for more options about the vectorizer. Increase the range of ngrams and observe the matrix. Can you see any difference?

In [6]:
vectorizer = TfidfVectorizer(ngram_range=(1, 1),)
train_tfidf_matrix = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names)
print(train_tfidf_matrix.toarray())

<bound method CountVectorizer.get_feature_names of TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)>
[[0.         0.         0.         0.39806    0.50488863 0.
  0.26347183 0.         0.         0.         0.         0.
  0.50488863 0.         0.         0.         0.         0.32226387
  0.         0.         0.39806   ]
 [0.40818453 0.         0.32181737 0.         0.         0.32181737
  0.21300762 0.         0.         0.40818453 0.         0.
  0.         0.         0.32181737 0.         0.32181737 0.
  0.32181737 0.32181737 0.        ]
 [0. 

Let's explore these vectorizers with real data. You probably remember our data of reviews on Yelp. Let's use it to compare both vectorizers.

In [None]:
data = pd.read_csv("yelp_polarity.txt", sep="\t", header=None)
display(data)

### Training a classifier

In order to train any model, you need to split your data. The reason for this is that you want to test the performance of your classifier at the end. And this can't be done on the same data you train. Therefore, you need to keep a small set of data that you never use until you test. Let's create our train and test sets:

In [None]:
text_train, text_test, label_train, label_test = train_test_split(data[0], data[1], 
                                                                  test_size=0.20, 
                                                                  random_state=1234, shuffle=True)

**Question**
Use only the train set to generate again a feature matrix using the `CountVectorizer`. This one will be used to train our classifier

**Hint:** Please notice that you need to instantiate again a new vectorizer.

In [None]:
polarity_count_vectorizer = CountVectorizer()
polarity_bow_matrix = polarity_count_vectorizer.fit_transform(text_train)

Yay!!! Finally we're ready to train our first model. In order to do so, we need to input the train features and their labels. In this case we will use a support vector machine (svm). If you haven't heard anything about SVMs yet, don't worry, this won't be difficult, we will use the one provided by sklearn so you don't have to implement it by yourself. They treat each documents as a vector. 

Imagine all your documents are sample point in a scatter plot. The job of SVMs is to draw a line (hyperplane) in the middle of two classes so that the hyperplane is the widest gap between both of them. 

Let's instantiate our classifier

In [None]:
classifier = svm.LinearSVC()

Now it's time to train...

In [None]:
classifier.fit(polarity_bow_matrix, label_train)

And after training we can test... But first, we need to convert our test data into numerical features. Now is your turn to test

**Question**
Vectorize your test set, in the same way we did with the train data. After vectorizing use the method `predict()` and put your test_matrix in the parentesis. This method returns a class prediction for each document in the matrix. 

Use NumPy to compare the original labels with the ones predicted by our model.

**Hint:** Check np.sum and np.equal

In [None]:
test_tfidf_matrix = polarity_count_vectorizer.transform(text_test)

In [None]:
test = classifier.predict(test_tfidf_matrix)

In [None]:
correct_answers = np.sum(np.equal(test, label_test))

In [None]:
accuracy = correct_answers / (len(test)*1.0) * 100

In [None]:
print(accuracy)

## Using word vectors from spaCy

spaCy offers pretrained vectors for each token. They come inside the models that we load. However, the small model doesn't include vectors. Let's look at what happen...

In [None]:
import spacy

from spacy.lang.en import English
nlp = English()

nlp = spacy.load('en_core_web_sm')

In [None]:
raw = "We didn't want to eat."

tokens = nlp(raw)
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) # oov = Out of Vocabulary

In [None]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

**Question**
Download `en_core_web_md`, which is the medium model for English and try to test the token similarity again.

**Hint**
Call your object different from nlp, so that you can compare without overwriting the nlp object (maybe nlp_medium is a good idea :)). We call these tokens `new_tokens`.

In [None]:
# !python -m spacy download en_core_web_md
# nlp_medium = spacy.load('en_core_web_md')

In [None]:
new_tokens = nlp_medium(raw)

In [None]:
for token1 in new_tokens:
    for token2 in new_tokens:
        print(token1.text, token2.text, token1.similarity(token2))

In [None]:
new_tokens.vector.shape

In [None]:
print(new_tokens[0].vector)

Tada!!! And here we finish again another lab session. This time meeting some word vectors.