There is no shortage of tutorials on the web which show you how to recognize digits from images in the MNIST corpus using only 10 lines of python. It is certainly remarkable how low the barriers to entry are for constructing and training neural networks. But when you attempt to solve a real-world problem, you may find that these tutorials don't apply. You don't have enough data to do anything useful with neural networks. (Maybe you want to use a simpler regression or tree-based model, but that's a post for another day and another author.) Your data might be messy, a big pile of numeric and text features, unnormalized and unembedded. You need to do __feature engineering__, and you're in luck: `scikit-learn` has a number of utilities to make this easier.

In post, we'll survey some techniques for pre-processing your data using `scikit-learn`. These techniques are broadly applicable, transforming input data into forms that are useful for various machine learning algorithms.

In [1]:
import itertools
import numpy as np
import warnings; warnings.filterwarnings('ignore')


from sklearn import feature_extraction, preprocessing

# the following is purely for the purposes of pretty printing matrices
from IPython.display import display
import sympy; sympy.init_printing()

def display_matrix(m):
    display(sympy.Matrix(m))

## CountVectorizer

Below, we have a number of opening sentences and phrases from famous novels. To use the parlance of [NLP](https://en.wikipedia.org/wiki/Natural_Language_Processing), each line below is a (short) _document_, and the collection of documents is a _corpus_.

In [2]:
corpus = [
    "Call me Ishmael.",
    "It is a truth universally acknowledged",
    "A screaming comes across the sky.",
    "Many years later, as he faced the firing",
    "Happy families are all alike.",
    "It was a bright cold day in April",
    "I am an invisible man."
]

A simple way to make this data more useful is to count the number of occurrences of each word in each document. To do this, we can use a a `CountVectorizer` from the `feature_extraction.text` package of `sklearn`.

In [3]:
count_vectorizer = feature_extraction.text.CountVectorizer()

count_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]:
count_vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
result = count_vectorizer.transform(corpus)

result

<7x35 sparse matrix of type '<class 'numpy.int64'>'
	with 37 stored elements in Compressed Sparse Row format>

In [6]:
display_matrix(result.todense())

⎡0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0 
⎢                                                                             
⎢1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  0  0 
⎢                                                                             
⎢0  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0 
⎢                                                                             
⎢0  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0  1  0  1  0  0  0  0  0  1  0 
⎢                                                                             
⎢0  0  1  1  0  0  0  1  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0  0  0 
⎢                                                                             
⎢0  0  0  0  0  0  1  0  0  1  0  1  0  1  0  0  0  0  0  1  0  0  0  1  0  0 
⎢                                                                             
⎣0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  

In [7]:
display_matrix(result.todense())

⎡0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0 
⎢                                                                             
⎢1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  0  0 
⎢                                                                             
⎢0  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0 
⎢                                                                             
⎢0  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0  1  0  1  0  0  0  0  0  1  0 
⎢                                                                             
⎢0  0  1  1  0  0  0  1  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0  0  0 
⎢                                                                             
⎢0  0  0  0  0  0  1  0  0  1  0  1  0  1  0  0  0  0  0  1  0  0  0  1  0  0 
⎢                                                                             
⎣0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  

In [8]:
' '.join(count_vectorizer.get_feature_names())

'acknowledged across alike all am an april are as bright call cold comes day faced families firing happy he in invisible is ishmael it later man many me screaming sky the truth universally was years'

In [9]:
count_vectorizer = feature_extraction.text.CountVectorizer(
    min_df=1, stop_words='english')

display_matrix(count_vectorizer.fit_transform(corpus).todense())

⎡0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0⎤
⎢                                                          ⎥
⎢1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0⎥
⎢                                                          ⎥
⎢0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  1  1  0  0  0⎥
⎢                                                          ⎥
⎢0  0  0  0  0  0  0  1  0  1  0  0  0  1  0  0  0  0  0  1⎥
⎢                                                          ⎥
⎢0  1  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0  0  0  0⎥
⎢                                                          ⎥
⎢0  0  1  1  1  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0⎥
⎢                                                          ⎥
⎣0  0  0  0  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0⎦

In [10]:
' '.join(count_vectorizer.get_feature_names())

'acknowledged alike april bright cold comes day faced families firing happy invisible ishmael later man screaming sky truth universally years'

In [11]:
repetitive_doc = 'On a cold, cold day, years and years ago'

counts = count_vectorizer.transform(corpus + [repetitive_doc])

display_matrix(counts.todense())

⎡0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0⎤
⎢                                                          ⎥
⎢1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0⎥
⎢                                                          ⎥
⎢0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  1  1  0  0  0⎥
⎢                                                          ⎥
⎢0  0  0  0  0  0  0  1  0  1  0  0  0  1  0  0  0  0  0  1⎥
⎢                                                          ⎥
⎢0  1  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0  0  0  0⎥
⎢                                                          ⎥
⎢0  0  1  1  1  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0⎥
⎢                                                          ⎥
⎢0  0  0  0  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0⎥
⎢                                                          ⎥
⎣0  0  0  0  2  0  1  0  0  0  0  0  0  0  0  0  0  0  0  2⎦

## TfIdfTransformer


Take a matrix of counts, like the one above, and transform it into a matrix where the values for each document are adjusted by the frequency of the term across the entire corpus.

In [12]:
tfidf_transformer = feature_extraction.text.TfidfTransformer()

display_matrix(tfidf_transformer.fit_transform(counts).todense()[-2:])

⎡0.0  0.0  0.0  0.0         0.0         0.0         0.0         0.0  0.0  0.0 
⎢                                                                             
⎣0.0  0.0  0.0  0.0  0.666666666666667  0.0  0.333333333333333  0.0  0.0  0.0 

 0.0  0.707106781186547  0.0  0.0  0.707106781186547  0.0  0.0  0.0  0.0      
                                                                              
 0.0         0.0         0.0  0.0         0.0         0.0  0.0  0.0  0.0  0.66

   0.0       ⎤
             ⎥
6666666666667⎦

## DictVectorizer

Turn a list of feature=>value pairs into a matrix.

In [13]:
vectorizer = feature_extraction.DictVectorizer(dtype=np.bool, sparse=False)

In [14]:
user_data = [
    {'browser': 'Chrome', 'os': 'Mac'},
    {'browser': 'Firefox', 'os': 'Windows'},
    {'browser': 'Safari', 'os': 'iOS'},
    {'browser': 'Firefox', 'os': 'Linux'},
    {'browser': 'Chrome', 'os': 'Windows'},
    {'browser': 'Chrome', 'os': 'Mac'}
]

In [15]:
vectorizer.fit(user_data)

DictVectorizer(dtype=<class 'bool'>, separator='=', sort=True, sparse=False)

In [16]:
result = vectorizer.transform(user_data)

result

array([[ True, False, False, False,  True, False, False],
       [False,  True, False, False, False,  True, False],
       [False, False,  True, False, False, False,  True],
       [False,  True, False,  True, False, False, False],
       [ True, False, False, False, False,  True, False],
       [ True, False, False, False,  True, False, False]], dtype=bool)

In [17]:
vectorizer.get_feature_names()

['browser=Chrome',
 'browser=Firefox',
 'browser=Safari',
 'os=Linux',
 'os=Mac',
 'os=Windows',
 'os=iOS']

In [18]:
def vectorize_dict(vectorizer, d):
    return vectorizer.transform([d])[0]

In [19]:
encountered_values = {'browser': 'Chrome', 'os': 'Mac'}

enc_vec = vectorize_dict(vectorizer, encountered_values)

enc_vec

array([ True, False, False, False,  True, False, False], dtype=bool)

In [20]:
def decode_vectorized_dict(vectorizer, v):
    return np.array(vectorizer.get_feature_names())[v]

In [21]:
decode_vectorized_dict(vectorizer, enc_vec)

array(['browser=Chrome', 'os=Mac'],
      dtype='<U15')

In [22]:
unencountered_values = {'browser': 'Chrome', 'os': 'Android'}

unenc_vec = vectorize_dict(vectorizer, unencountered_values)

unenc_vec

array([ True, False, False, False, False, False, False], dtype=bool)

In [23]:
decode_vectorized_dict(vectorizer, unenc_vec)

array(['browser=Chrome'],
      dtype='<U15')

In [24]:
vectorizer.fit(user_data + [unencountered_values])

DictVectorizer(dtype=<class 'bool'>, separator='=', sort=True, sparse=False)

In [28]:
decode_vectorized_dict(vectorizer, unenc_vec)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 8 but corresponding boolean dimension is 7

In [None]:
vectorizer.inverse_transform(unenc_vec)

## FeatureHasher

Transform a list of feature=>value pairs into a matrix, but without needing to know the range of possible values, and with a risk of hash collision.

In [None]:
hasher = feature_extraction.FeatureHasher(n_features=10)

In [None]:
people = [
    {'name': 'Alice', 'hobby': 'weiqi', 'age': 32},
    {'name': 'Bob', 'state': 'PA', 'age': 27},
    {'name': 'Carol', 'hobby': 'chess', 'job': 'Engineer'}
]

In [None]:
r = hasher.transform(people)

display_matrix(r.todense())

In [None]:
display_matrix(hasher.transform([{'name': 'Dan', 'age': 54, 'shoe_size': 10}]).todense())

## Standardization scaling

Scale an array so that its mean is 0 and its variance is 1.

In [None]:
ages = np.array([21, 78, 36, 61, 56, 30, 64, 25, 60, 17])

ages.mean(), ages.var()

In [None]:
ages_scaled = preprocessing.scale(ages)

ages_scaled.mean(), ages_scaled.var()

In [None]:
np.isclose(ages_scaled.mean(), 0), np.isclose(ages_scaled.var(), 1)

## Min-max scaling

Scale an array to a given range.

In [None]:
ages.min(), ages.max()

In [None]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10))

r = min_max_scaler.fit_transform(ages)

r

In [None]:
r.min(), r.max()

## Binarization

Binarize an array such that values under `threshold` are false and above `threshold` are true.

In [None]:
retirement_binarizer = preprocessing.Binarizer(threshold=60)

In [None]:
display_matrix(retirement_binarizer.fit_transform(ages.reshape(1, -1)))

## One-hot encoding

Encode categorical values such that each unique value is a boolean in an array.

In [None]:
one_hot_encoder = preprocessing.OneHotEncoder(sparse=False)

In [None]:
movies_genres = [
    ['Action', 'Horror', 'Sci-Fi'],
    ['Action', 'Drama', 'Romance'],
    ['Comedy', 'Drama', 'Sci-Fi'],
]

try:
    one_hot_encoder.fit(movies_genres)
except ValueError as e:
    print(e)

In [None]:
all_genres = list(set(itertools.chain.from_iterable(movies_genres)))

all_genres

In [None]:
movies_genres_ints = [
    [all_genres.index(genre) for genre in movie]
    for movie in movies_genres
]

movies_genres_ints

In [None]:
one_hot_encoder.fit_transform(movies_genres_ints)

## LabelEncoder

Encode labels (ie categories) such that there each label is represented by the same unique integer in the output.

In [None]:
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(all_genres)

In [None]:
np.array([label_encoder.transform(movie) for movie in movies_genres])

## LabelBinarizer

Binarize labels such that a matrix is returned in which each row is one-hot encoded label.

In [None]:
label_binarizer = preprocessing.LabelBinarizer()

In [None]:
label_binarizer.fit(all_genres)

In [None]:
for movie in movies_genres:
    print(movie)
    print(label_binarizer.transform(movie))
    print()

## MultiLabelBinarizer

Binarize multiple labels like we did with the one-hot encoder above but without all the struggle.

In [None]:
multi_label_binarizer = preprocessing.MultiLabelBinarizer()

In [None]:
multi_label_binarizer.fit_transform(movies_genres)