# Text featurization

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">

This notebook shows how to extract numerical features from a textual feature.

## Simple example

Let's first define a small dataset with 4 inputs and just one textual feature:

In [2]:
# see 4.2.3.3. at http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

We can use scikit-learn to extract one feature per word `w`, which value for a given input is the count of occurences of `w` in that input:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(strip_accents='ascii', min_df=1)
vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents='ascii', token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Here are the words that each column corresponds to:

In [8]:
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Try setting min_df to higher values (e.g. 2 or 3)

We can featurize each input of our `corpus` dataset with:

In [9]:
X = vectorizer.transform(corpus)

Here is its new (numerical) representation:

In [10]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

## Application to StumbleUpon data

Load data

In [11]:
import pandas as pd
data = pd.read_csv('/data/kaggle-stumbleupon-evergreen.csv', index_col=1)
data[['body']].head()

Unnamed: 0_level_0,body
url,Unnamed: 1_level_1
http://techflesh.com/eye-controlled-laptop/,A lot of companies including heavyweights like...
http://www.johnnywander.com/comics/163,A lot of people have been requesting the frost...
http://deliciouslyorganic.net/chicken-and-black-bean-quesadillas/,Our house is aflutter this week Pete pinned on...
http://apac2020.the-diplomat.com/,A decade into The Asian Century The Diplomat l...
http://www.news.com.au/business/markets/apples-tim-cook-says-we-have-too-much-money/story-e6frfm30-1226280357290,CEO Tim Cook s next challenge is to figure out...


Featurize:

In [12]:
corpus = data['body'].tolist()
vectorizer = CountVectorizer()
vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

See how many features we have:

In [13]:
len(vectorizer.get_feature_names())

25806

Apply the transformation:

In [14]:
X = vectorizer.transform(corpus)

Look at the new representation of our dataset:

In [15]:
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## Feature selection

We could make our models lighter and faster to train by selecting the `k` most "promising" features. One method for this is that of the $\chi^2$.

In [16]:
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=20)
y = data['label'].tolist()
selector.fit(X, y)

SelectKBest(k=20, score_func=<function chi2 at 0x7f919ef8a8c8>)

Apply feature selection:

In [17]:
X_new = selector.transform(X)

See what we end up with:

In [18]:
X_new.toarray()

array([[ 0,  9,  0, ...,  0,  9,  5],
       [ 1,  3,  0, ...,  0,  3,  5],
       [ 1, 25,  0, ...,  5, 10,  2],
       ...,
       [ 3,  9,  0, ...,  0,  0,  0],
       [ 0,  6,  0, ...,  1,  3,  0],
       [ 0,  0,  0, ...,  0,  1,  0]], dtype=int64)

Also look at [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html).