<img src='img/logo.png'>
<img src='img/title.png'>

# Table of Contents
* [Feature extraction and selection](#Feature-extraction-and-selection)
	* [Categorical Variables](#Categorical-Variables)
		* [One-Hot-Encoding (Dummy variables)](#One-Hot-Encoding-%28Dummy-variables%29)
			* [Checking string-encoded categorical data](#Checking-string-encoded-categorical-data)
		* [Numbers can encode categoricals](#Numbers-can-encode-categoricals)
	* [Binning, Discretization, Linear Models and Trees](#Binning,-Discretization,-Linear-Models-and-Trees)
	* [Interactions and Polynomials](#Interactions-and-Polynomials)
		* [Scaling before adding polynomial terms](#Scaling-before-adding-polynomial-terms)
	* [Univariate Non-linear transformations](#Univariate-Non-linear-transformations)
	* [Automatic Feature Selection](#Automatic-Feature-Selection)
		* [Univariate statistics](#Univariate-statistics)
		* [Model-based Feature Selection](#Model-based-Feature-Selection)
		* [Recursive Feature Elimination](#Recursive-Feature-Elimination)
* [Summary](#Summary)


# Feature extraction and selection

Feature extraction can include encoding of text and categorical data to a sparse integer matrix, as shown in this notebook.

Feature extraction can also include more specialized text processing, such as word counters, term frequency - inverse document frequency transformation, or hashing vectorization, which are all discussed in the Scikit-learn documentation page [Working with Text Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

More documentation of extraction:

http://scikit-learn.org/stable/auto_examples/cluster/plot_dict_face_patches.html

http://scikit-learn.org/stable/modules/feature_extraction.html#patch-extraction

Feature selection consists of using feature statistics to find a subset of the predictor matrix columns that is likely to explain variance in the dependent data set.  Here is an overview of feature selection with scikit-learn:

http://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.rcParams['image.interpolation'] = "none"
np.set_printoptions(precision=3)

import src.mglearn as mglearn

## Categorical Variables

### One-Hot-Encoding (Dummy variables)

Expansion of categorical string columns to dummy numeric matrices.

In [None]:
import os
import pandas as pd

data = pd.read_csv(os.path.join("data", "adult.csv"))
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]
data.head()

#### Checking string-encoded categorical data

In [None]:
data.gender.value_counts()

Note the expansion of the matrix columns with `OneHotEncoder`

In [None]:
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))

In [None]:
data_dummies.head()

In [None]:
# Get only the columns containing features, that is all columns from 'age' to 'occupation_ Transport-moving'
# This range contains all the features but not the target

features = data_dummies.ix[:, 'age':'occupation_ Transport-moving']
# extract numpy arrays
X = features.values
y = data_dummies['income_ >50K'].values
print(X.shape, y.shape)

With the categorical string columns now expanded into new indicator integer columns, we can use the matrix as the `X` argument to a model method like `fit`.  Below we are passing the encoded categorical matrix to `LogistricRegression`.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print(logreg.score(X_test, y_test))

### Numbers can encode categoricals

The following cells show the expansion of a dataframe originally containing a categorial column with 3 distinct values.

In [None]:
# create a dataframe with an integer feature and a categorical string feature
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1], 'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
demo_df

In [None]:
pd.get_dummies(demo_df)

In [None]:
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
pd.get_dummies(demo_df)

## Binning, Discretization, Linear Models and Trees

Binning converts a continuous feature to a categorical one and can be useful as a preprocessing step before training / prediction and in statistical analysis more generally.  

The following cells show how a `DecisionTreeRegressor` and `LinearRegressor` can give similar results with binning of input data.  See also [the help for `numpy.digitize`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

X, y = mglearn.datasets.make_wave(n_samples=100)
plt.plot(X[:, 0], y, 'o')
line = np.linspace(-3, 3, 1000)[:-1].reshape(-1, 1);

In [None]:
lin_reg = LinearRegression().fit(X, y)
dec_reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)

In [None]:
plt.plot(X[:, 0], y, 'o')
plt.plot(line, lin_reg.predict(line), label="linear regression")
plt.plot(line, dec_reg.predict(line), label="decision tree")
plt.ylabel("regression output")
plt.xlabel("input feature")
plt.legend(loc="best");

The figure above shows the effect of the `min_samples_split` on `DecisionTreeRegressor`.

In [None]:
np.set_printoptions(precision=2)
bins = np.linspace(-3, 3, 11)
bins

In [None]:
which_bin = np.digitize(X, bins=bins)
print("\nData points:\n", X[:5])
print("\nBin membership for data points:\n", which_bin[:5])

In [None]:
from sklearn.preprocessing import OneHotEncoder
# transform using the OneHotEncoder.
encoder = OneHotEncoder(sparse=False)
# encoder.fit finds the unique values that appear in which_bin
encoder.fit(which_bin)
# transform creates the one-hot encoding
X_binned = encoder.transform(which_bin)
print(X_binned[:5])

In [None]:
X_binned.shape

In [None]:
line_binned = encoder.transform(np.digitize(line, bins=bins))

plt.plot(X[:, 0], y, 'o')
reg = LinearRegression().fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='linear regression binned')

reg = DecisionTreeRegressor(min_samples_split=3).fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='decision tree binned')
for bin in bins:
    plt.plot([bin, bin], [-3, 3], ':', c='k')
plt.legend(loc="best")
plt.suptitle("linear_binning");

## Interactions and Polynomials

Interaction terms can form new columns in an input matrix when machine learning models should consider the correlation rather than only the independent effects of features. 

The next few cells demonstrate the effect of an interaction term in linear regressions on binned time series.

In [None]:
X_combined = np.hstack([X, X_binned])
print(X_combined.shape)

In [None]:
plt.plot(X[:, 0], y, 'o')

reg = LinearRegression().fit(X_combined, y)

line_combined = np.hstack([line, line_binned])
plt.plot(line, reg.predict(line_combined), label='linear regression combined')

for bin in bins:
    plt.plot([bin, bin], [-3, 3], ':', c='k')
plt.legend(loc="best");

In [None]:
X_product = np.hstack([X_binned, X * X_binned])
print(X_product.shape)

In [None]:
plt.plot(X[:, 0], y, 'o')
    
reg = LinearRegression().fit(X_product, y)

line_product = np.hstack([line_binned, line * line_binned])
plt.plot(line, reg.predict(line_product), label='linear regression combined')

for bin in bins:
    plt.plot([bin, bin], [-3, 3], ':', c='k')
plt.legend(loc="best");

Above we created the interaction term ourselves using `numpy` vectorized math.  `sklearn.preprocessing.PolynomialFeatures` does this automatically and [provides more options](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# include polynomials up to x ** 10:
poly = PolynomialFeatures(degree=10)
poly.fit(X)
X_poly = poly.transform(X)

In [None]:
X_poly.shape

In [None]:
poly.get_feature_names()

In [None]:
plt.plot(X[:, 0], y, 'o')
    
reg = LinearRegression().fit(X_poly, y)

line_poly = poly.transform(line)    # using the Poly transform
plt.plot(line, reg.predict(line_poly), label='polynomial linear regression')
plt.legend(loc="best");

Comparison of polynomial linear regression with a support vector regressor initialized under different `gamma` values.

In [None]:

from sklearn.svm import SVR
plt.plot(X[:, 0], y, 'o')

for gamma in [1, 10]:
    svr = SVR(gamma=gamma).fit(X, y)  # radial basis function kernel by default
    plt.plot(line, svr.predict(line), label='SVR gamma=%d' % gamma)
    
plt.legend(loc="best");

### Scaling before adding polynomial terms

If you plan to use scaling to normalize features and also add interaction or polynomial terms, it is generally best to do the scaling before adding features.

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)

# rescale data:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly = poly.transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
print(X_train.shape)
print(X_train_poly.shape)

In [None]:
print(poly.get_feature_names())

In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train_scaled, y_train)
print("score without interactions: %f" % ridge.score(X_test_scaled, y_test))
ridge = Ridge().fit(X_train_poly, y_train)
print("score with interactions: %f" % ridge.score(X_test_poly, y_test))

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled, y_train)
print("score without interactions: %f" % rf.score(X_test_scaled, y_test))
rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly, y_train)
print("score with interactions: %f" % rf.score(X_test_poly, y_test))

In [None]:
rf.apply(X_test_poly)

In [None]:
rf.apply(X_test_poly).shape

## Univariate Non-linear transformations

A common univariate transform on continuous data is to log transform for data that are skewed.

In [None]:
rnd = np.random.RandomState(0)
X_org = rnd.normal(size=(1000, 3))
w = rnd.normal(size=3)

X = np.random.poisson(10 * np.exp(X_org))
y = np.dot(X_org, w)

In [None]:
np.bincount(X[:, 0])

In [None]:
bins = np.bincount(X[:, 0])
plt.bar(range(len(bins)), bins, color='b')
plt.ylabel("number of appearances")
plt.xlabel("value");

In [None]:
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Ridge().fit(X_train, y_train).score(X_test, y_test)

In [None]:
X_train_log = np.log(X_train + 1)
X_test_log = np.log(X_test + 1)

In [None]:
plt.hist(np.log(X_train_log[:, 0] + 1), bins=25, color='b');

In [None]:
Ridge().fit(X_train_log, y_train).score(X_test_log, y_test)

## Automatic Feature Selection

Feature selection reduces the dimensionality of the input data.  Dimensionality reduction can:
 * Making the model more efficient computationally,
 * Improve fit of the model by removing redundant or low variance columns
 * Make the model easier to understand and present with fewer input data requirements

### Univariate statistics

This example uses [SelectPercentile](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html) to subset half the features.  `SelectPercentile` can taking a scoring function, defaulting to [`f_classif`, the ANOVA F-score](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html).

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()

# get deterministic random numbers
rng = np.random.RandomState(42)
noise = rng.normal(size=(len(cancer.data), 50))
# add noise features to the data
# the first 30 features are from the dataset, the next 50 are noise
X_w_noise = np.hstack([cancer.data, noise])

X_train, X_test, y_train, y_test = train_test_split(
    X_w_noise, cancer.target, random_state=0, test_size=.5)
# use f_classif (the default) and SelectPercentile to select 10% of features:
select = SelectPercentile(percentile=50)
select.fit(X_train, y_train)
# transform training set:
X_train_selected = select.transform(X_train)

print(X_train.shape)
print(X_train_selected.shape)

In [None]:
from sklearn.feature_selection import f_classif, f_regression, chi2

In [None]:
F, p = f_classif(X_train, y_train)

In [None]:
plt.figure()
plt.plot(p, 'o');

In [None]:
mask = select.get_support()
print(mask)
# visualize the mask. black is True, white is False
plt.matshow(mask.reshape(1, -1), cmap='gray_r');

In [None]:
from sklearn.linear_model import LogisticRegression

# transform test data:
X_test_selected = select.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
print("Score with all features: %f" % lr.score(X_test, y_test))
lr.fit(X_train_selected, y_train)
print("Score with only selected features: %f" % lr.score(X_test_selected, y_test))

### Model-based Feature Selection

Use a model to do feature selection.  From [the docs](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html):

`sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False)`

Parameters:
   * `estimator`: An `sklearn` model
   * `threshold`: Threshold of importance determining whether to keep a feature.  Examples: `'median'`, `'mean'`, `'1.25*mean'`
   * `prefit`: `False` by default - set it to `True` if the `estimator` has already been `fit`

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42),
                         threshold="median")

Now when we call `transform` we will get a matrix with the half the features (columns) of the original (`threshold="median"`).

In [None]:
select.fit(X_train, y_train)
X_train_l1 = select.transform(X_train)
print(X_train.shape)
print(X_train_l1.shape)

In [None]:
mask = select.get_support() # indices of the columns that were selected
# visualize the mask. black is True, white is False
plt.matshow(mask.reshape(1, -1), cmap='gray_r');

In [None]:
X_test_l1 = select.transform(X_test)
LogisticRegression().fit(X_train_l1, y_train).score(X_test_l1, y_test)

### Recursive Feature Elimination

Recursive feature elimination tries different sets of features until finding the smallest explantory set of features.  Read [more here]().

`sklearn.feature_selection.RFE(estimator, n_features_to_select=None, step=1, verbose=0)`

`step=1` in the arguments is the number of features to try dropping on each iteration.

In [None]:
from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)
#select = RFE(LogisticRegression(penalty="l1"), n_features_to_select=40)

select.fit(X_train, y_train)
# visualize the selected features:
mask = select.get_support()
plt.matshow(mask.reshape(1, -1), cmap='gray_r');

In [None]:
X_train_rfe = select.transform(X_train)
X_test_rfe = select.transform(X_test)

LogisticRegression().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)

In [None]:
select.score(X_test, y_test)

# Summary

In this notebook, we reviewed the following topics in preparation for more advanced topics:

 * [Feature extraction and selection](#Feature-extraction-and-selection)
 * [Categorical Variables](#Categorical-Variables)
 * [Binning, Discretization, Linear Models and Trees](#Binning,-Discretization,-Linear-Models-and-Trees)
 * [Interactions and Polynomials](#Interactions-and-Polynomials)
 * [Scaling before adding polynomial terms](#Scaling-before-adding-polynomial-terms)

<a href='Feature_Preprocessing_Feature_Selection_Exercises.ipynb' class='btn btn-primary btn-lg'>Exercises</a>

<img src='img/copyright.png'>