<img src='img/logo.png'>
<img src='img/title.png'>

# Feature selection 

Apart from just ignoring columns, more intelligent features can be prepared for linear models.

# Table of Contents
* [Feature selection](#Feature-selection)
	* [Data](#Data)
	* [All features](#All-features)
	* [F and p values](#F-and-p-values)
		* [Select best features](#Select-best-features)
		* [Filter features](#Filter-features)
		* [Eliminate worst features](#Eliminate-worst-features)
* [Exercises](#Exercises)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Data

In [None]:
bikes = pd.read_csv('data/bike_day_raw.csv')
bikes.head()

In [None]:
X = bikes.drop(['cnt','weekday'], axis='columns')
y = bikes['cnt']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## All features

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

## F and p values

null hypothesis

*this feature contains no 

In [None]:
from sklearn.feature_selection import f_regression

In [None]:
f, p = f_regression(X, y)

Many features have a p value at or near zero.

In [None]:
pd.Series(p, index=X.columns).sort_values()

### Select best features: regression

In [None]:
from sklearn.feature_selection import SelectKBest

In [None]:
best3 = SelectKBest(f_regression, k=3)

In [None]:
best3.fit(X_train, y_train)

Here's the best 3.

In [None]:
X.columns[best3.get_support()]

Now we can use the `best3` model to perform the best-three transform for us.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [None]:
lr.fit(best3.transform(X_train), y_train)
lr.score(best3.transform(X_test), y_test)

### Select best features: classification

In [None]:
from sklearn.feature_selection import chi2
best3 = SelectKBest(chi2, k=3)

In [None]:
best3.fit(X_train, y_train)

### Filter features

In [None]:
from sklearn.feature_selection import SelectFpr

Choose a cutoff, 0.05 is chosen by convention

In [None]:
filter_features = SelectFpr(f_regression, alpha=0.05)

In [None]:
filter_features.fit(X_train, y_train)

We droped `'hum'`, `'weekday'`, and '`holiday`'

In [None]:
X.columns[filter_features.get_support()]

Transform and fit the linear model

In [None]:
lr.fit(filter_features.transform(X_train), y_train)
lr.score(filter_features.transform(X_test), y_test)

### Eliminate worst features

RFE is itself an estimator that implements fit and score.

In [None]:
from sklearn.feature_selection import RFE

In [None]:
rfe = RFE(LinearRegression(), n_features_to_select=6)

In [None]:
rfe.fit(X_train, y_train)

Now we get a different set of features because all features are scored together

In [None]:
X.columns[rfe.get_support()]

In [None]:
rfe.score(X_test, y_test)

# Exercises

<img src='img/copyright.png'>