# Feature Extraction

Until now, we have seen, how to tokenize a document, extract attributes from its tokens and even how to create a bag of words. In our second lab, we used the `CountVectorizer` from `scikit-learn` in order to create a matrix that counts words in a document. Do you remember how that works? 

The process of vectorizing a text is performed given the fact that most algorithms are not designed to handle raw text. Therefore, we need to represent each text document in a mathematical form, so that calculations can be done. There are several ways for vectorizing text, the easiest one is the bag of words approach, where we create a vocabulary with all the words in our document collection. The idea is to create a matrix to represent which words of my vocabulary are present in each document.

In [1]:
import pandas as pd
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import naive_bayes

**Question**
Use the text provided below to create a bag of words using the `CountVectorizer` function as we did in a previous session. 

**Steps** 
Print your vocabulary and print the array form of your vectorizer

In [2]:
docs = ['Elon Musk wants to build a Gigafactory',
        'UK is too risky after the Brexit for a Gigafactory',
        'Tesla wants to build a Gigafactory in Berlin',
        'Brexit has made it too risky for Tesla to put a Gigafactory in the UK.']

In [3]:
count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(docs)
print(count_vectorizer.get_feature_names())

['after', 'berlin', 'brexit', 'build', 'elon', 'for', 'gigafactory', 'has', 'in', 'is', 'it', 'made', 'musk', 'put', 'risky', 'tesla', 'the', 'to', 'too', 'uk', 'wants']


In [4]:
print(X.toarray())

[[0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1]
 [1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 1 1 0]
 [0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1]
 [0 0 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0]]


However we also know that counting is not the only way of representing text with numbers. We can also penalize tokens that occurr very often. Why?

Because in general, very frequent token sometimes are not as relevant as tokens that appear less often. For instance, if we want to retrieve documents that are relevant for a query: _The President Donald Trump_, one idea would be to retrieve all document containing all the words in the query. However, since our search retrieved still many documents, we might want to count the times that query words appear in the selected documents. But, _the_ and _president_, may still have many occurrences. Therefore, we should focus on documents that contain rather _donald trump_. But how do we get there? Calculating term frequency–inverse document frequency (tf-idf).

**Tf-Idf**: term frequency of a token, multiplied by the inverse document frequency (log[number of documents containing a token]).

- Tf: Term frequency: $\frac{freq(term)}{\# terms \in doc} $
- Idf: Inverse document frequency: $\log\frac{|D|}{\# d : term \in doc}$

Notice that we calculate the tf-idf for each term in each document. Let's calculate them for _Elon_ and _Gigafactory_ in the first document.

**Examples:**
- Tf-Idf(Elon) = $\frac{1}{7}*\log(\frac{4}{1}) = 0.14*0.6  = 0.084$
- Tf-Idf(Gigafactory) = $\frac{1}{7}*\log(\frac{4}{4}) = 0 $

**Question**
Use the same data and the `TfidfVectorizer`, create a matrix and explore several attributes as input for the vectorizer. 

**Steps** 
Use the `?` to explore the function of the vectorizer and refer to [this link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) for more options about the vectorizer. Increase the range of ngrams and observe the matrix. Can you see any difference?

In [5]:
vectorizer = TfidfVectorizer(ngram_range=(1, 1),)
train_tfidf_matrix = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names)
print(train_tfidf_matrix.toarray())

<bound method CountVectorizer.get_feature_names of TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)>
[[0.         0.         0.         0.39806    0.50488863 0.
  0.26347183 0.         0.         0.         0.         0.
  0.50488863 0.         0.         0.         0.         0.32226387
  0.         0.         0.39806   ]
 [0.40818453 0.         0.32181737 0.         0.         0.32181737
  0.21300762 0.         0.         0.40818453 0.         0.
  0.         0.         0.32181737 0.         0.32181737 0.
  0.32181737 0.32181737 0.        ]
 [0. 

Let's explore these vectorizers with real data. You probably remember our data of reviews on Yelp. Let's use it to compare both vectorizers.

In [6]:
data = pd.read_csv("yelp_polarity.txt", sep="\t", header=None)
display(data)

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


### Training a classifier

In order to train any model, you need to split your data. The reason for this is that you want to test the performance of your classifier at the end. And this can't be done on the same data you train. Therefore, you need to keep a small set of data that you never use until you test. Let's create our train and test sets:

In [7]:
text_train, text_test, label_train, label_test = train_test_split(data[0], data[1], 
                                                                  test_size=0.20, 
                                                                  random_state=1234, shuffle=True)

**Question**
Use only the train set to generate again a feature matrix using the `CountVectorizer`. This one will be used to train our classifier

**Hint:** Please notice that you need to instantiate again a new vectorizer.

In [8]:
polarity_count_vectorizer = CountVectorizer()
polarity_bow_matrix = polarity_count_vectorizer.fit_transform(text_train)

Yay!!! Finally we're ready to train our first model. In order to do so, we need to input the train features and their labels. In this case we will use a support vector machine (svm). If you haven't heard anything about SVMs yet, don't worry, this won't be difficult, we will use the one provided by sklearn so you don't have to implement it by yourself. They treat each documents as a vector. 

Imagine all your documents are sample point in a scatter plot. The job of SVMs is to draw a line (hyperplane) in the middle of two classes so that the hyperplane is the widest gap between both of them. 

Let's instantiate our classifier

In [9]:
classifier = svm.LinearSVC()

Now it's time to train...

In [10]:
classifier.fit(polarity_bow_matrix, label_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

And after training we can test... But first, we need to convert our test data into numerical features. Now is your turn to test

**Question**
Vectorize your test set, in the same way we did with the train data. After vectorizing use the method `predict()` and put your test_matrix in the parentesis. This method returns a class prediction for each document in the matrix. 

Use NumPy to compare the original labels with the ones predicted by our model.

**Hint:** Check np.sum and np.equal

In [11]:
test_tfidf_matrix = polarity_count_vectorizer.transform(text_test)

In [12]:
test = classifier.predict(test_tfidf_matrix)

In [13]:
correct_answers = np.sum(np.equal(test, label_test))

In [14]:
accuracy = correct_answers / (len(test)*1.0) * 100

In [15]:
print(accuracy)

85.5


## Using word vectors from spaCy

spaCy offers pretrained vectors for each token. They come inside the models that we load. However, the small model doesn't include vectors. Let's look at what happen...

In [16]:
import spacy

from spacy.lang.en import English
nlp = English()

nlp = spacy.load('en_core_web_sm')

In [17]:
raw = "We didn't want to eat."

tokens = nlp(raw)
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) # oov = Out of Vocabulary

We True 25.470768 True
did True 24.383423 True
n't True 25.47984 True
want True 21.264833 True
to True 24.335705 True
eat True 26.564693 True
. True 23.005432 True


In [18]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

We We 1.0
We did 0.020710269
We n't 0.09818799
We want 0.10925827
We to -0.034209188
We eat -0.011708566
We . -0.112583846
did We 0.020710269
did did 1.0
did n't 0.11590298
did want 0.16977178
did to 0.023720885
did eat 0.0140754245
did . -0.097842164
n't We 0.09818799
n't did 0.11590298
n't n't 1.0
n't want 0.09714964
n't to 0.06225885
n't eat 0.08804651
n't . -0.13261506
want We 0.10925827
want did 0.16977178
want n't 0.09714964
want want 1.0
want to 0.07914053


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


want eat 0.5724454
want . -0.01547255
to We -0.034209188
to did 0.023720885
to n't 0.06225885
to want 0.07914053
to to 1.0
to eat 0.06685362
to . 0.20728269
eat We -0.011708566
eat did 0.0140754245
eat n't 0.08804651
eat want 0.5724454
eat to 0.06685362
eat eat 1.0
eat . 0.19326732
. We -0.112583846
. did -0.097842164
. n't -0.13261506
. want -0.01547255
. to 0.20728269
. eat 0.19326732
. . 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


**Question**
Download `en_core_web_md`, which is the medium model for English and try to test the token similarity again.

**Hint**
Call your object different from nlp, so that you can compare without overwriting the nlp object (maybe nlp_medium is a good idea :)). We call these tokens `new_tokens`.

In [19]:
nlp_medium = spacy.load('en_core_web_md')

In [21]:
print(raw)
#print(len(new_tokens))
new_tokens = nlp_medium(raw)

We didn't want to eat.


In [22]:
for token1 in new_tokens:
    for token2 in new_tokens:
        print(token1.text, token2.text, token1.similarity(token2))

We We 1.0
We did 0.69173646
We n't 0.6942249
We want 0.72397226
We to 0.6098714
We eat 0.49724275
We . 0.42758408
did We 0.69173646
did did 1.0
did n't 0.80410373
did want 0.65793914
did to 0.52778155
did eat 0.47160104
did . 0.35375196
n't We 0.6942249
n't did 0.80410373
n't n't 1.0
n't want 0.80974674
n't to 0.57629895
n't eat 0.517009
n't . 0.34257802
want We 0.72397226
want did 0.65793914
want n't 0.80974674
want want 1.0
want to 0.6864664
want eat 0.5250392
want . 0.34755716
to We 0.6098714
to did 0.52778155
to n't 0.57629895
to want 0.6864664
to to 1.0
to eat 0.4138872
to . 0.35827494
eat We 0.49724275
eat did 0.47160104
eat n't 0.517009
eat want 0.5250392
eat to 0.4138872
eat eat 1.0
eat . 0.27974728
. We 0.42758408
. did 0.35375196
. n't 0.34257802
. want 0.34755716
. to 0.35827494
. eat 0.27974728
. . 1.0


In [29]:
print(new_tokens.vector)
for token in new_tokens:
    print(token.vector)

[ 1.54157141e-02  1.58162847e-01 -2.09189296e-01 -1.48000717e-01
  1.92292836e-02  1.52022853e-01  1.00684287e-02 -4.59917113e-02
 -5.19261286e-02  2.48431444e+00 -3.23732466e-01  7.03217164e-02
  2.87551433e-01  1.11499861e-01 -3.15112829e-01 -5.16908653e-02
 -1.09751567e-01  9.67278600e-01 -3.28650028e-01  9.63080004e-02
  2.25772858e-01  6.71602860e-02  3.80862951e-02  5.82573935e-03
 -2.26128578e-01  7.15370029e-02 -1.75886258e-01 -2.10741282e-01
  1.62428141e-01 -2.70758301e-01 -1.78099707e-01  8.75744373e-02
 -2.17962284e-02  3.26618589e-02  1.71059415e-01 -6.51314333e-02
  2.26513147e-01  1.35602131e-01 -1.07505716e-01 -6.50308281e-03
 -1.84223562e-01  1.20009948e-02  1.04176272e-02 -1.74822569e-01
  4.89668548e-02  4.38821986e-02 -2.09584281e-01  4.87384275e-02
  5.36047556e-02  6.00738600e-02 -1.80440709e-01 -2.76702736e-02
  1.53272718e-01 -8.46770331e-02  4.07411426e-01 -1.01337001e-01
 -1.35096863e-01 -7.91867152e-02  5.74515853e-03  6.83657406e-03
 -3.55541408e-02 -1.35380

In [None]:
print(len(new_tokens[1]))
print(new_tokens[1])
print(new_tokens[1].vector)
print(new_tokens.vector)

Tada!!! And here we finish again another lab session. This time meeting some word vectors.