## Feature Selection Techniches
The idea of this notebook is to share multiple techniques used to select features when building a classification model. In this case, we're dealing with 30+ features. It would be ideal to bring down the number of features going into the final model.  
The techniques we'll  review are:
* Hypothesis testing: chi squared test of association
* Feature importance of random forest classification
* Coefficients from correlation matrix 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import zipfile
%matplotlib inline

##### Straightforward Data Ingestion and Preprocessing

In [2]:
zf = zipfile.ZipFile('WA_Fn-UseC_-Telco-Customer-Churn.zip') 
df = pd.read_csv(zf.open('WA_Fn-UseC_-Telco-Customer-Churn.csv'))

df.loc[df['tenure']==0, 'TotalCharges'] = 0
df['TotalCharges'] = df['TotalCharges'].apply(lambda x: float(x))
df['MonthlyCharges'] = df['MonthlyCharges'].apply(lambda x: float(x))

In [3]:
df_dummy = pd.get_dummies(df, columns=['gender', 'Partner', 'Dependents', 'PhoneService','MultipleLines', 'InternetService', 'OnlineSecurity',
                          'OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract',
                           'PaperlessBilling','PaymentMethod', 'Churn'], drop_first=True)

In [4]:
df_dummy.drop(labels=['customerID'], axis=1 , inplace=True)

In [5]:
df_dummy.shape

(7043, 31)

### Feature Selection using Chi Squared Test

In [6]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X = df_dummy.iloc[:,0:30]
y = df_dummy.iloc[:,-1]

In [7]:
#applies SelectKBest class to extract top 15 best features
bestfeatures = SelectKBest(score_func=chi2, k=15)
fit = bestfeatures.fit(X,y)
bf_scores = pd.DataFrame(fit.scores_)
bf_columns = pd.DataFrame(X.columns)

#helps visualize the top constributing features
featureScores = pd.concat([bf_columns,bf_scores],axis=1)
featureScores.columns = ['Specs','Score']

In [8]:
featureScores.sort_values('Score', ascending=False).head(15)

Unnamed: 0,Specs,Score
3,TotalCharges,624292.003004
1,tenure,16278.923685
2,MonthlyCharges,3680.787699
25,Contract_Two year,488.57809
28,PaymentMethod_Electronic check,426.422767
10,InternetService_Fiber optic,374.476216
11,InternetService_No,286.520193
18,TechSupport_No internet service,286.520193
16,DeviceProtection_No internet service,286.520193
14,OnlineBackup_No internet service,286.520193


### Feature Selection by Feature Importance in Random Forest Model

In [9]:
X = df_dummy.iloc[:,0:30]
y = df_dummy.iloc[:,-1]

# Fit RF model using all features
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=2,
                             random_state=0)
clf.fit(X, y)  

# Display top 15 by feature importance
feat_importances = pd.Series(clf.feature_importances_, index=X.columns)
feat_importances.sort_values(ascending=False).head(15)

tenure                                  0.213512
InternetService_Fiber optic             0.135466
Contract_Two year                       0.132184
PaymentMethod_Electronic check          0.099044
TotalCharges                            0.086579
OnlineSecurity_No internet service      0.042248
StreamingMovies_No internet service     0.040365
MonthlyCharges                          0.038775
InternetService_No                      0.034067
TechSupport_No internet service         0.032606
OnlineBackup_No internet service        0.028316
OnlineSecurity_Yes                      0.027269
StreamingTV_No internet service         0.022366
TechSupport_Yes                         0.019083
DeviceProtection_No internet service    0.016870
dtype: float64

### Feature Selection Using correlation coefficient with Target Variable

In [10]:
X = df_dummy.iloc[:,0:30]
y = df_dummy.iloc[:,-1]

#get correlations of each features in dataset
corrmat = df_dummy.corr()

In [19]:
corrmat['abs_Churn_Yes'] = corrmat['Churn_Yes'].apply(lambda x: abs(x))
tmp = corrmat[['Churn_Yes', 'abs_Churn_Yes']].sort_values(by='abs_Churn_Yes', ascending=False).head(16)
tmp.columns = ['corr_coef_Churn', 'abs_corr_coef_Churn']
tmp.iloc[1:,:]

Unnamed: 0,corr_coef_Churn,abs_corr_coef_Churn
tenure,-0.352229,0.352229
InternetService_Fiber optic,0.30802,0.30802
Contract_Two year,-0.302253,0.302253
PaymentMethod_Electronic check,0.301919,0.301919
StreamingTV_No internet service,-0.22789,0.22789
DeviceProtection_No internet service,-0.22789,0.22789
OnlineBackup_No internet service,-0.22789,0.22789
OnlineSecurity_No internet service,-0.22789,0.22789
InternetService_No,-0.22789,0.22789
StreamingMovies_No internet service,-0.22789,0.22789
