[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ignaziogallo/data-mining/blob/aa20-21/tutorials/data/Text_Features.ipynb)


## Text Features

Another common need in feature engineering is to **convert text to** a set of representative **numerical values**.

For example, most automatic mining of social media data relies on some form of encoding the text as numbers.
* One of the **simplest methods** of encoding data is by **word counts**: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

#### Example
For example, consider the following set of three phrases:

In [8]:
sample = ['evil problem evil of evil',
          'queen evil queen',
          'horizon problem']

For a vectorization of this data based on word count, we could **construct a column** representing the word "**problem**," the word "**evil**," the word "**horizon**," and so on.  

#### CountVectorizer
While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X

<3x5 sparse matrix of type '<class 'numpy.longlong'>'
	with 7 stored elements in Compressed Sparse Row format>

#### Result
The result is a **sparse matrix** recording the number of times each word appears;   
it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:

In [10]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,3,0,1,1,0
1,1,0,0,0,2
2,0,1,0,1,0


#### Problem
There are some issues with this approach, however: the raw word counts lead to features which put **too much weight on words that appear very frequently**, and this can be sub-optimal in some classification algorithms.


#### Solution
One approach to fix this is known as **term frequency-inverse document frequency** (**TF–IDF**) which weights the word counts by a measure of how often they appear in the documents.


### Term frequency
The **simplest choice** is to use the **raw count** of a term $t$ in a document $d$, i.e.,   
> tf = the number of times that term $t$ occurs in document $d$.

### Inverse document frequency
The inverse document frequency is a measure of **how much information the word provides**, i.e., if it's **common or rare** across all documents.

$$ \mathrm{idf}(t, D) =  \log \frac{N}{1 + |\{d \in D: t \in d\}|}$$

with
* $N$: total number of documents in the corpus $N = {|D|}$
* $|\{d \in D: t \in d\}|$  : number of documents where the term $t$ appears.

#### TF-IDF in Scikit-Learn
The syntax for computing these features is similar to the previous example:

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,0.875976,0.0,0.383935,0.291992,0.0
1,0.355432,0.0,0.0,0.0,0.934702
2,0.0,0.795961,0.0,0.605349,0.0
