Based on Jake VanderPlas Python Data Science Handbook

In [None]:
import numpy as np
import pandas as pd

# Feature Engineering

The previous lecture covered the ideas of cross-validation which is very important to finding the right model fit without traditional statistics tools like the p-value.

Fitting models in sklearn require the predictor variables (features) to be in a nicely formed matrix where each row represents an observation and each column is a variable.

Feature engineering is the idea of taking your data and turning them into numbers for a matrix.

## Categorical Features

If you have a categorical variable, sklearn is not able to directly use that categorical information. We have to make them numeric.

We might be tempted to make an R-style factor out of the variable, and assign each category an integer value.

For example:

- Red becomes 1
- Blue becomes 2
- Yellow becomes 3
- etc.

This however will not work well because when sklearn sees the values 1, 2, 3 it will think they represent quantities. That 'yellow' has three times as much of some quantity than 'red' does.

Instead, we take these categories and do *one-hot* encoding. We'll have a column for each of the possible categories, and the variable will be turned into a 1 or 0.

One-hot encoding:

- A column for Red with 1 if it is red, and 0 for everything else.
- A column for blue with 1 if it is blue, and 0 for everything else.
- A column for yellow with 1 if it is yellow, and 0 for everything else.


In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
titanic = pd.read_csv('titanic_train.csv')

In [None]:
titanic.head()

Using OneHotEncoder for one variable:

In [None]:
enc = OneHotEncoder(sparse = False)
enc

In [None]:
enc.fit( titanic[['Sex']] )

In [None]:
enc.transform(titanic[['Sex']])

In [None]:
enc.fit_transform( titanic[['Sex']] )[0:5,:]

In [None]:
enc.categories_

In [None]:
# fit all into a DataFrame
pd.DataFrame( enc.fit_transform(titanic[['Sex']]), columns = enc.categories_).head()

In [None]:
enc = OneHotEncoder(sparse = False, categories = 'auto')
enc.fit_transform(titanic[['Pclass']])

In [None]:
enc.categories_

One Hot encoding for multiple columns at the same time

In [None]:
titanic.columns

In [None]:
titanic_subset = titanic[['Pclass', 'Sex']].dropna()

In [None]:
titanic_subset.head()

In [None]:
titanic_subset.shape

In [None]:
enc = OneHotEncoder(sparse = False, categories = 'auto')
results = enc.fit_transform(titanic_subset)

In [None]:
results.shape

In [None]:
results[0:5, :]

In [None]:
enc.categories_

In [None]:
[value   for array in enc.categories_   for value in array]

In [None]:
pd.DataFrame(results, columns = [y   for x in enc.categories_   for y in x]).head()

# Imputation of Missing Data

In [None]:
from numpy import nan
X = np.array([[ nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])

In [None]:
# warning, filling in the mean is not always the best strategy
# perhaps you need to use the EM algorithm

# but the simple imputer is great for filling in with simple rules like the mean
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2

In [None]:
imp = SimpleImputer(strategy='constant', fill_value = -99)
X3 = imp.fit_transform(X)
X3

In [None]:
# you can then use the imputed values for a linear regression model:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X2, y)
model.predict(X2)

# Polynomial Features

In [None]:
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4, 5])
y = np.array([4, 2, 1, 3, 7])
plt.scatter(x, y);

In [None]:
X = x.reshape(5,1)
poly = PolynomialFeatures(degree=3, include_bias=True) 
# include bias will include a 0 power column
X2 = poly.fit_transform(X)
print(X2)

In [None]:
model = LinearRegression().fit(X2, y)
yfit = model.predict(X2)
plt.scatter(x, y)
plt.plot(x, yfit);

# Feature Pipelines

With any of the preceding examples, it can quickly become tedious to do the transformations by hand, especially if you wish to string together multiple steps. For example, we might want a processing pipeline that looks something like this:

1. Impute missing values using the mean
2. Transform features to quadratic
3. Fit a linear regression

In [None]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(SimpleImputer(strategy='mean'),
                      PolynomialFeatures(degree=2),
                      LinearRegression())

In [None]:
X = np.array([[ nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])

In [None]:
model.fit(X, y)  # X with missing values, from above
print(y)
print(model.predict(X))

In [None]:
x = np.array([1, 2, nan, 4, 5])  # where the missing value is matters
y = np.array([4, 2, 1, 3, 7])
X = x.reshape(5, 1)

In [None]:
model.fit(X,y)
model.predict(X)

# Text Features

Another common need in feature engineering is to convert text to a set of representative numerical values. For example, most automatic mining of social media data relies on some form of encoding the text as numbers. One of the simplest methods of encoding data is by word counts: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

In [None]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [None]:
vec = CountVectorizer()
X = vec.fit_transform(sample)
X

In [None]:
X.toarray()

In [None]:
vec.get_feature_names()

In [None]:
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms. One approach to fix this is known as term frequency-inverse document frequency (TF–IDF) which weights the word counts by a measure of how often they appear in the documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

<https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>

<https://en.wikipedia.org/wiki/Tf%E2%80%93idf>

In [None]:
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
# values are weighted according to how many times the word appears
# and how many words are in the document

# the word 'of' only appears in the first document
# but the first document has three words, so its weight is 0.68

# the word 'horizon' appears only in the third document
# the thrid documnet only has two words, so each word is weighted more
# horizon is weighted 0.796

# 'evil' is in first and second document
# each appearance is weighted a bit less
# worth more in the second document which only has two words
# worth less in the first document which has three words