# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [123]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [124]:
player_df = pd.read_csv('garments_worker_productivity.csv')
player_df.info()
df=player_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   1197 non-null   object 
 1   quarter                1197 non-null   object 
 2   department             1197 non-null   object 
 3   day                    1197 non-null   object 
 4   team                   1197 non-null   int64  
 5   targeted_productivity  1197 non-null   float64
 6   smv                    1197 non-null   float64
 7   wip                    691 non-null    float64
 8   over_time              1197 non-null   int64  
 9   incentive              1197 non-null   int64  
 10  idle_time              1197 non-null   float64
 11  idle_men               1197 non-null   int64  
 12  no_of_style_change     1197 non-null   int64  
 13  no_of_workers          1197 non-null   float64
 14  actual_productivity    1197 non-null   float64
dtypes: f

In [60]:
numcols = ['team','smv','wip','over_time','incentive', 'idle_time','idle_men','no_of_style_change', 'no_of_workers','actual_productivity']
catcols = ['date','quarter','department','day']

player_df.loc[:,'department'] = player_df.loc[:,'department'].str.strip()

In [61]:
player_df = player_df[numcols+catcols]
player_df

Unnamed: 0,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity,date,quarter,department,day
0,8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725,1/1/2015,Quarter1,sweing,Thursday
1,1,3.94,,960,0,0.0,0,0,8.0,0.886500,1/1/2015,Quarter1,finishing,Thursday
2,11,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570,1/1/2015,Quarter1,sweing,Thursday
3,12,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570,1/1/2015,Quarter1,sweing,Thursday
4,6,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382,1/1/2015,Quarter1,sweing,Thursday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,10,2.90,,960,0,0.0,0,0,8.0,0.628333,3/11/2015,Quarter2,finishing,Wednesday
1193,8,3.90,,960,0,0.0,0,0,8.0,0.625625,3/11/2015,Quarter2,finishing,Wednesday
1194,7,3.90,,960,0,0.0,0,0,8.0,0.625625,3/11/2015,Quarter2,finishing,Wednesday
1195,9,2.90,,1800,0,0.0,0,0,15.0,0.505889,3/11/2015,Quarter2,finishing,Wednesday


In [62]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.fillna(0)
traindf

Unnamed: 0,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity,...,quarter_Quarter4,quarter_Quarter5,department_finishing,department_sweing,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday
0,8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725,...,0,0,0,1,0,0,0,1,0,0
1,1,3.94,0.0,960,0,0.0,0,0,8.0,0.886500,...,0,0,1,0,0,0,0,1,0,0
2,11,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570,...,0,0,0,1,0,0,0,1,0,0
3,12,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570,...,0,0,0,1,0,0,0,1,0,0
4,6,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382,...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,10,2.90,0.0,960,0,0.0,0,0,8.0,0.628333,...,0,0,1,0,0,0,0,0,0,1
1193,8,3.90,0.0,960,0,0.0,0,0,8.0,0.625625,...,0,0,1,0,0,0,0,0,0,1
1194,7,3.90,0.0,960,0,0.0,0,0,8.0,0.625625,...,0,0,1,0,0,0,0,0,0,1
1195,9,2.90,0.0,1800,0,0.0,0,0,15.0,0.505889,...,0,0,1,0,0,0,0,0,0,1


In [63]:
traindf = pd.DataFrame(traindf,columns=features)

In [64]:
#y = df['actual_productivity']
X = traindf.copy()
del X['actual_productivity']

In [118]:
#y.columns=['actual_productivity']
y=player_df[['actual_productivity']]
X

Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
0,0,1,3,8,26.16,1108.0,7080,98,0.0,0,0,59.0
1,0,0,3,1,3.94,0.0,960,0,0.0,0,0,8.0
2,0,1,3,11,11.41,968.0,3660,50,0.0,0,0,30.5
3,0,1,3,12,11.41,968.0,3660,50,0.0,0,0,30.5
4,0,1,3,6,25.90,1170.0,1920,50,0.0,0,0,56.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1192,1,0,5,10,2.90,0.0,960,0,0.0,0,0,8.0
1193,1,0,5,8,3.90,0.0,960,0,0.0,0,0,8.0
1194,1,0,5,7,3.90,0.0,960,0,0.0,0,0,8.0
1195,1,0,5,9,2.90,0.0,1800,0,0.0,0,0,15.0


In [66]:
len(X.columns)

81

In [125]:
data = pd.read_csv('garments_worker_productivity.csv')


data.isnull().sum()
data['wip']=data['wip'].fillna(0)
data.isnull().sum()
data.dtypes
data.loc[:,'department'] = data.loc[:,'department'].str.strip()
data['department'].value_counts()


from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
data['department']=label_encoder.fit_transform(data['department'])
data['day']=label_encoder.fit_transform(data['day'])
data['quarter']=label_encoder.fit_transform(data['quarter'])
data.drop('date',inplace=True,axis=1)


target_productivity=data['targeted_productivity']
target_productivity	
data.drop('targeted_productivity',inplace=True,axis=1)
y=data[['actual_productivity']]
data.drop('actual_productivity',inplace=True,axis=1)
X=data
X,y
X.shape,y.shape

((1197, 12), (1197, 1))

In [126]:
X

Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
0,0,1,3,8,26.16,1108.0,7080,98,0.0,0,0,59.0
1,0,0,3,1,3.94,0.0,960,0,0.0,0,0,8.0
2,0,1,3,11,11.41,968.0,3660,50,0.0,0,0,30.5
3,0,1,3,12,11.41,968.0,3660,50,0.0,0,0,30.5
4,0,1,3,6,25.90,1170.0,1920,50,0.0,0,0,56.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1192,1,0,5,10,2.90,0.0,960,0,0.0,0,0,8.0
1193,1,0,5,8,3.90,0.0,960,0,0.0,0,0,8.0
1194,1,0,5,7,3.90,0.0,960,0,0.0,0,0,8.0
1195,1,0,5,9,2.90,0.0,1800,0,0.0,0,0,15.0


In [133]:
y


Unnamed: 0,actual_productivity
0,0.940725
1,0.886500
2,0.800570
3,0.800570
4,0.800382
...,...
1192,0.628333
1193,0.625625
1194,0.625625
1195,0.505889


### Set some fixed set of features

In [75]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=10

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [85]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0,1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature

In [86]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

10 selected features


### List the selected features from Pearson Correlation

In [87]:
cor_feature

['quarter',
 'over_time',
 'no_of_workers',
 'incentive',
 'idle_time',
 'department',
 'smv',
 'team',
 'idle_men',
 'no_of_style_change']

## Filter Feature Selection - Chi-Sqaure

In [136]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import MinMaxScaler

### mutual info regression Selector function

In [137]:
def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    Xscale=MinMaxScaler().fit_transform(X)
    selector=SelectKBest(mutual_info_regression,num_feats)
    selector.fit(Xscale,y)
    chi_score=selector.scores_
    chi_support=selector.get_support()
    chi_features=X.loc[:,chi_support].columns.tolist()
    # Your code ends here
    return chi_support, chi_features

In [138]:
chi_support, chi_features = chi_squared_selector(X, y,num_feats)
print(str(len(chi_features)), 'selected features')

10 selected features


  return f(*args, **kwargs)


### List the selected features from Chi-Square 

In [140]:
chi_features

['quarter',
 'department',
 'team',
 'smv',
 'wip',
 'over_time',
 'incentive',
 'idle_men',
 'no_of_style_change',
 'no_of_workers']

## Wrapper Feature Selection - Recursive Feature Elimination

In [148]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [149]:
def rfe_selector(X, y, num_feats):
    X_norm = MinMaxScaler().fit_transform(X)
    rfe_selector = RFE(estimator=RandomForestRegressor(), n_features_to_select=num_feats, step=10, verbose=5)
    rfe_selector.fit(X_norm, y)
    rfe_support = rfe_selector.get_support()
    rfe_feature = X.loc[:,rfe_support].columns.tolist()
    return rfe_support, rfe_feature


In [150]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

Fitting estimator with 12 features.


  estimator.fit(X[:, features], y)
  self.estimator_.fit(X[:, features], y)


10 selected features


### List the selected features from RFE

In [151]:
rfe_feature

['quarter',
 'day',
 'team',
 'smv',
 'wip',
 'over_time',
 'incentive',
 'idle_men',
 'no_of_style_change',
 'no_of_workers']

## Embedded Selection - Lasso: SelectFromModel

In [157]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

In [163]:
def embedded_log_reg_selector(X, y, num_feats):
#     X_norm = MinMaxScaler().fit_transform(X)
    embedded_lr_selector = SelectFromModel(LinearRegression(normalize=True, n_jobs=0), max_features=num_feats)
    embedded_lr_selector.fit(X, y)
    embedded_lr_support = embedded_lr_selector.get_support()
    embedded_lr_feature = X.loc[:,embedded_lr_support].columns.tolist()
    return embedded_lr_support, embedded_lr_feature

In [164]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

2 selected features


In [165]:
embedded_lr_feature

['department', 'no_of_style_change']

## Tree based(Random Forest): SelectFromModel

In [208]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

In [209]:
def embedded_rf_selector(X, y, num_feats):
    embedded_rf_selector = SelectFromModel(RandomForestRegressor(n_estimators=100), max_features=num_feats)
    embedded_rf_selector.fit(X, y)
    embedded_rf_support = embedded_rf_selector.get_support()
    embedded_rf_feature = X.loc[:,embedded_rf_support].columns.tolist()
    return embedded_rf_support, embedded_rf_feature

In [210]:
embedder_rf_support, embedder_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')

  self.estimator_.fit(X, y, **fit_params)


NameError: name 'embedded_rf_feature' is not defined

In [88]:
embeded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Belgium',
 'Nationality_Slovenia']

## Tree based(Light GBM): SelectFromModel

In [173]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMRegressor

In [176]:
def embedded_lgbm_selector(X, y, num_feats):
    lgbc=LGBMRegressor(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

    embedded_lgbm_selector = SelectFromModel(lgbc, max_features=num_feats)
    embedded_lgbm_selector.fit(X, y)
    embedded_lgbm_support = embedded_lgbm_selector.get_support()
    embedded_lgbm_feature = X.loc[:,embedded_lgbm_support].columns.tolist()
    return embedded_lgbm_support, embedded_lgbm_feature

In [178]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')

5 selected features


In [179]:
embedded_lgbm_feature

['team', 'smv', 'over_time', 'incentive', 'no_of_workers']

## Putting all of it together: AutoFeatureSelector Tool

In [183]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'mutual info regression':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'LightGBM':embedded_lgbm_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,mutual info regression,RFE,Logistics,LightGBM,Total
1,team,True,True,True,False,True,4
2,smv,True,True,True,False,True,4
3,over_time,True,True,True,False,True,4
4,no_of_workers,True,True,True,False,True,4
5,no_of_style_change,True,True,True,True,False,4
6,incentive,True,True,True,False,True,4
7,quarter,True,True,True,False,False,3
8,idle_men,True,True,True,False,False,3
9,department,True,True,False,True,False,3
10,wip,False,True,True,False,False,2


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [205]:
def preprocess_dataset(dataset_path):
    # Your code starts here (Multiple lines)
    data = pd.read_csv(dataset_path)
    data.isnull().sum()
    data['wip']=data['wip'].fillna(0)
    data.isnull().sum()
   
    data.loc[:,'department'] = data.loc[:,'department'].str.strip()
    data['department'].value_counts()


    from sklearn.preprocessing import LabelEncoder
    label_encoder=LabelEncoder()
    data['department']=label_encoder.fit_transform(data['department'])
    data['day']=label_encoder.fit_transform(data['day'])
    data['quarter']=label_encoder.fit_transform(data['quarter'])
    data.drop('date',inplace=True,axis=1)


    target_productivity=data['targeted_productivity']
    target_productivity	
    data.drop('targeted_productivity',inplace=True,axis=1)
    y=data['actual_productivity']
    data.drop('actual_productivity',inplace=True,axis=1)
    X=data
    num_feats=10
    X,y

    # Your code ends here
    return X, y, num_feats

In [206]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Parameters
    # data - dataset to be analyzed (csv file)
    # methods - various feature selection methods we outlined before, use them all here (list)
    
    # preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)
    
    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y,num_feats)
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
    #if 'rf' in methods:
       # embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
    
    
    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    #### Your Code starts here (Multiple lines)
    feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'mutual info regression':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'LightGBM':embedded_lgbm_support})
    # count the selected times for each feature
    feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
    # display the top 100
    feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
    feature_selection_df.index = range(1, len(feature_selection_df)+1)
    best_features = feature_selection_df['Feature'].tolist()[:5]
    #### Your Code ends here
    return best_features

In [207]:
best_features = autoFeatureSelector(dataset_path="garments_worker_productivity.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'lgbm'])




Fitting estimator with 12 features.


### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features

In [201]:
best_features

['team', 'smv', 'over_time', 'no_of_workers', 'no_of_style_change']