# Feature selection
Just because you have a dataset of 30 features (30 variables on the right hand side of your equation), it doesn't mean you have to use all 30 in your model.  Can you think of reasons why it might be benificial to drop certain variables?

Let's use our breast cancer dataset to experiment with feature selection.

In [None]:
import pandas as pd
from sklearn import linear_model
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv("../../assets/breast-cancer.csv", header=None)
df.iloc[:,1] = df.iloc[:,1] == 'M'
df.head()

# First, perform a logistic regression on all of the features
(Remember, the first column is just the patient ID--you can ignore that.)

In [None]:
y = df.iloc[:,1]
X = df.iloc[:,2:]
model = linear_model.LogisticRegression()
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

# But do we need all of the features?
What sort of strategy might one take to drop features?  What if we used the correlation between the x variables and the y variable?

In [None]:
corr = df.corr()
yXCorr = corr.iloc[1,2:]
yXCorr = abs(yXCorr)
yXCorr = pd.DataFrame(yXCorr)
yXCorr.sort_values(by=yXCorr.columns[0],inplace=True)
X = df.iloc[:,yXCorr.index[-3:]]
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

In [None]:
yXCorr.tail()

In [None]:
X.head()

### Let's look at the correlations between the "three best features" according to our "most correlated with y" approach.
What can you say about how these features are correlated with each other?

### Let's also add our y variable, and look at its correlation numbers, for future reference.

As you can see in the score above, our model score doesn't decrease by much, and we are only using three features.  To drive home the point that the correlation matters, let's repeat our test with the three least correlated variables.

In [None]:
X = df.iloc[:,yXCorr.index[:3]]
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

In [None]:
yXCorr.head()

# Use SelectKBest to get the best 3 features using chi2.
Refit the model, repredict, and reprint out the score.

In [None]:
#Don't forget to return X to the original variables (all of the features)

# Can we make an improvement in the score?
Let's try a "brute force search" (exhaustive) to see if we can find three features which give us a better score.  So, go through every combination of x variables (limiting to three x's per run) and fit and score your model, keeping track of the best score and the best x's.

In [None]:
# Don't forget to reset X to the original variables (all of the features)

# How do these best features compare with the first three features we found?
Originally, we chose the three x's that were most correlated with y.  We obtained a solid score, but now we have obtained a better score.  Why do you think that is?
## Let's examine the correlation matrix of our new three best features and y