### Feature Selection in Python 

This notebook demonstrates the usage of the most common Feature selection methods viz:
* Recursive Feature selection with the option to select the number of features
* Recursive Feature selection using Cross-Validation which automaticall tunes the number of features to select
* Embedded Methods: Feature Ranking using algorithms such as SVM, Random Forest, Lasso and Ridge

Our main goal is to generate insights on the most important features in the dataset using the above methods. We want to understand, compare and contrasts the ranking given to features usin the above methods. 

We wish to identify the features that are consistently ranked higher by each of the above methods; if ther are such features, then clearly they are the most important features to be included in the model. On the other hand, we also want to see the features that are consistently refuted by majority of the above models to categorize them as a less important features.



In [1]:
# Import the basic libraries
import pandas as pd
import os
import numpy as np

In [2]:
# import the frequently used functions from the Sklearn package
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import RFE,RFECV
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV,Lasso,LassoCV,RandomizedLasso,RidgeClassifierCV
from sklearn.model_selection import RepeatedStratifiedKFold,GridSearchCV,StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC,LinearSVC

In [3]:
os.chdir("c:\\analytics\\data")

In [5]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [8]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), obj

#### Data Cleansing
This section is pertinent to the specific datset that we read. It involves tasks such as cleaning data,dealing with missing values, changing the data types, recoding the class labels to integer values etc. 

We also convert the dataset from pandas data frame to a numpy array form as required by the SKlearn libraries.



In [11]:
# clean the data
labels = df.Churn.map(lambda x: 1 if x=='Yes' else 0)
labels = labels.values

nominal_cols =[]
numeric_cols =[]
drop_cols = ['customerID','Churn','TotalCharges']


df['TotalCharges'] = pd.to_numeric(df['TotalCharges'],errors='coerce')
df.TotalCharges.isnull().sum()
df['TotalCharges'] = df['TotalCharges'].fillna(value=df.TotalCharges.median)
df.TotalCharges.isnull().sum()
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'],errors='coerce')

In [12]:
# transform the variables as requ`ired by sklearn

def transform_frame_sklearn(features,drop_cols=[]):
    """ Transforms the dataframe to columns """
    numeric_cols = []
    nominal_cols = []
    
    if len(drop_cols) is not 0:
        features = features.drop(drop_cols,axis=1)
    
    for col in features.columns:
        if features[col].dtype == 'O': # and features[col].nunique() <=10:
            nominal_cols.append(col)
            
        else:
            numeric_cols.append(col)

    features_t = pd.get_dummies(data=features,columns=nominal_cols,drop_first=True)
       
    #print(features_t.columns)
    
    return(features_t) #,nominal_cols,numeric_cols)         
         

In [13]:
# transform the data frame in to the form as required by skelarn libraries

feature_trans = transform_frame_sklearn(df,drop_cols)
feature_names = feature_trans.columns
features = np.array(feature_trans)
labels = labels

In [15]:
# count the distribution of the class labels
df.Churn.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [16]:
# split the data into train ans test data
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,recall_score,precision_score,accuracy_score

In [17]:
features_train,features_test,labels_train,labels_test = train_test_split(features,labels,random_state=42)


### Recursive Feature Elimination 

We will perform the recursive feature elination using multiple algorithms. We are interested in the following:

* What are the Top Features that each algorithm selects ?

* What is the validation score for each subset of features selected by the algorithm since we wish to select those features ?

We will perform the Feature Selection using Linear, Ensemble and SVM classifiers. The models we have selected for this excercise are:
Linear: Logistic, Lasso
Ensemble: RandomForest
SVM: Support Vector Classifier

In [18]:
# determines some tunign parameters
kfold=10
score_metric = 'recall'

In [27]:
# instantiate the classfiers
lr = LogisticRegressionCV(penalty='l1',n_jobs=-1)
lasso = LassoCV(n_jobs=-1,cv=kfold,random_state=42)
ridge = RidgeClassifierCV(cv=kfold)
rf = RandomForestClassifier(criterion='gini',n_jobs=-1,n_estimators=100,max_features='sqrt',random_state=42)
svc = SVC(C=0.1,kernel='linear')

#estimators = ['lasso','ridge','random_forest','svc']
rfe_estimators = {'lasso':lasso,'ridge':ridge,'svc':svc,'lr':'','rf':rf}   

# nunmber of features to select
#num_features = round(np.sqrt(len(feature_names)))

# we are selecting half of the total features 
num_features = round(len(feature_names)/2)

In [23]:
features_rankscore = pd.DataFrame(feature_names,columns=['feature_name'])

print(" Total Features: {}".format(len(feature_names)))

for clf,value in rfe_estimators.items():
    print("\n{}".format(clf))
    if value != '':
        rfe = RFE(estimator=value,step=2,n_features_to_select=num_features)
        rfe = rfe.fit(features_train, labels_train)
        print(" Estimator {} selected features = {}".format(str.upper(clf),rfe.n_features_))
        print(" Features selected are: {}".format(feature_names[rfe.support_]))
        features_rankscore[clf]=rfe.ranking_
        
        #Select the features and predict on the validation data
        #val_transform = rfe.transform(features_test)
        #predproba_validation = rfe.predict_proba(features_test)
        pred_labels = rfe.predict(features_test)
        val_roc_score = roc_auc_score(labels_test,pred_labels)
        #val_recall_score = recall_score(labels_test,pred_labels) 
        # Print the results
        print("Using the model selected {} features, the validation scores are: ".format(rfe.n_features_))
        print(" ROC: {}:  ".format(round(val_roc_score,3)))
    else:
        continue              


 Total Features: 29

lasso
 Estimator LASSO selected features = 14
 Features selected are: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'Dependents_Yes',
       'PhoneService_Yes', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
       'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Electronic check'],
      dtype='object')
Using the model selected 14 features, the validation scores are: 
 ROC: 0.857:  

ridge
 Estimator RIDGE selected features = 14
 Features selected are: Index(['PhoneService_Yes', 'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_Yes', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service',
       'StreamingMovies_No internet service', 'Contract_One year',
       'Contract_Two year', 'PaperlessBilling_Yes',
  

In [25]:
# Display the rank score of the features
features_rankscore

Unnamed: 0,feature_name,lasso,ridge,svc,rf
0,SeniorCitizen,1,4,5,3
1,tenure,1,9,7,1
2,MonthlyCharges,1,8,9,1
3,gender_Male,7,6,7,1
4,Partner_Yes,7,9,8,1
5,Dependents_Yes,1,4,1,2
6,PhoneService_Yes,1,1,3,7
7,MultipleLines_No phone service,6,6,3,8
8,MultipleLines_Yes,5,7,6,1
9,InternetService_Fiber optic,1,1,1,1


#### Embedded Methods

Now we perform the feature selection using the Feature Ranking property of algorithms such as Random Forest and via the coeffient weights for Lasso, Ridge etc. We use the same set of algorithms as above

In [70]:
from sklearn.linear_model import RidgeClassifier

lasso = LassoCV(n_jobs=-1,cv=kfold,random_state=42)#,alphas=[0.01,0.001,0.1,1.0,10,100])
ridge = RidgeClassifierCV(alphas=[0.01,0.001,0.1, 1.0, 10.0], cv=3, fit_intercept=True)
rf = RandomForestClassifier(criterion='gini',n_jobs=-1,n_estimators=100,max_features='sqrt',random_state=42)
svc = SVC(C=0.1,kernel='linear')

#estimators = ['lasso','ridge','random_forest','svc']
estimators = {'lasso':lasso,'ridge':ridge,'svc':svc,'lr':'','rf':rf,'rand_lasso':''}

In [73]:
#%%timeit
features_from_model = pd.DataFrame(feature_names,columns=['feature_name'])
coeffs = []

print(" Total Features: {}".format(len(feature_names)))

for clf,value in estimators.items():
    #print("\n{}".format(clf))
    if value != '':
        #print(eval(str(value)))
        model = eval(str(value))
        model = model.fit(features_train, labels_train)
        #print(model.coef_)
        #print(" Estimator {} selected features = {}".format(clf,rfe.n_features_))
        #print(" Features selected are: {}".format(feature_names[rfe.support_]))
        if clf in ['lasso']:
            coeffs = pd.Series(np.abs(model.coef_))
            coeffs = [str(x)[0:4] for x in coeffs]
            #print(coeffs)
            #print("alpha is : {}".format(model.alpha_))
            #list = list(np.abs(model.coef_[0]))
            features_from_model[clf]= pd.Series(coeffs)
        elif clf in ['ridge','svc']:
            coeffs = pd.Series(np.abs(model.coef_[0]))
            coeffs = [str(x)[0:4] for x in coeffs]
            #print(coeffs)
            #list = list(np.abs(model.coef_[0]))
            features_from_model[clf]= pd.Series(coeffs)
        
        else: 
            coeffs = pd.Series(np.abs(model.feature_importances_))
            coeffs = [str(x)[0:4] for x in coeffs]
            features_from_model[clf]= pd.Series(coeffs) 
    else: 
        continue

 Total Features: 29


In [74]:
# Display the Feature Ranking based on each of the algorithm
features_from_model


Unnamed: 0,feature_name,lasso,ridge,svc,rf
0,SeniorCitizen,0.02,0.07,0.15,0.02
1,tenure,0.0,0.0,0.03,0.23
2,MonthlyCharges,0.0,0.0,0.01,0.22
3,gender_Male,0.0,0.01,0.05,0.03
4,Partner_Yes,0.0,0.0,0.02,0.02
5,Dependents_Yes,0.01,0.04,0.11,0.02
6,PhoneService_Yes,0.08,0.05,0.34,0.0
7,MultipleLines_No phone service,0.0,0.05,0.34,0.0
8,MultipleLines_Yes,0.0,0.07,0.17,0.02
9,InternetService_Fiber optic,0.02,0.28,0.79,0.04
