# Task 7: AutoFeatureSelector Tool
### This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats
import warnings
warnings.filterwarnings('ignore')  # Ignore notebook warnings to keep it clearer

In [2]:
player_df = pd.read_csv("fifa19.csv")

In [3]:
player_df.head(5)

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [4]:
# player_df.info()

In [5]:
#     for col in player_df.columns:
#         if player_df[col].nunique() == player_df.shape[0]:
#             print(col)        

In [6]:
# counts = player_df['Name'].nunique()
# counts

In [7]:
# counts = player_df.nunique()
# counts

In [8]:
# player_df['Photo']

In [9]:
# player_df.columns

In [10]:
# player_df.columns.values

In [11]:
player_df.shape

(18207, 89)

In [12]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [13]:
# counts = player_df.nunique()
# counts

In [14]:
player_df = player_df[numcols+catcols]

In [15]:
# columns_with_missing_values = player_df.columns[player_df.isnull().any()]
# columns_with_missing_values

In [16]:
# player_df.shape

In [17]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [18]:
# traindf.shape

In [19]:
traindf = pd.DataFrame(traindf,columns=features)

In [20]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [21]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,0,0,0,0,0,0,0,0,0,0
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,0,0,0,0,0,0,0,0,0,0
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,0,0,0,0,0,0,0,0,0,0
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,0,0,0,0,0,0,0,0,0,0
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
X.shape

(18159, 223)

In [23]:
len(X.columns)

223

In [24]:
y.shape

(18159,)

In [25]:
player_df.shape

(18207, 25)

In [26]:
traindf.shape

(18159, 224)

In [27]:
traindf.to_csv('fifa_traindf.csv')

### Set some fixed set of features

In [28]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

In [29]:
# feature_name

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [30]:
def cor_selector(X, y,num_feats):
    # Your code goes here (Multiple lines)
    
    # Make a feature list
    feature_name = X.columns.tolist()
    
    # Calculate the correlation with y for each feature and collect all correlation values in a list
    cor_list = []  
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    
    # Replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    
    #Choose the feature
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    
    #Feature Selection or not
    cor_support = [True if i in cor_feature else False for i in feature_name]
    
    # Your code ends here
    return cor_support, cor_feature

In [31]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [32]:
cor_feature

['Nationality_Costa Rica',
 'Position_LAM',
 'Nationality_Uruguay',
 'Acceleration',
 'SprintSpeed',
 'Strength',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Stamina',
 'Weak Foot',
 'Agility',
 'Crossing',
 'Nationality_Belgium',
 'Dribbling',
 'ShotPower',
 'LongShots',
 'Finishing',
 'BallControl',
 'FKAccuracy',
 'LongPassing',
 'Volleys',
 'ShortPassing',
 'Position_RF',
 'Position_LF',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_Courtois',
 'Body Type_Neymar',
 'Body Type_Messi',
 'Body Type_C. Ronaldo',
 'Reactions']

## Filter Feature Selection - Chi-Square

In [33]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [34]:
def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Make a feature list
    feature_name = X.columns.tolist()
    
    # Apply SelectKBest class to extract top 10 best features
    bestfeatures = SelectKBest(score_func=chi2, k= num_feats)
    
    # Train the model
    topfeatures = bestfeatures.fit(X,y)
    topfeatures.scores_
    
    # Create dataframes from Scores and Features
    chi_feature = X.iloc[:,np.argsort(np.abs(topfeatures.scores_))[-num_feats:]].columns.tolist()
    chi_support = [True if i in chi_feature else False for i in feature_name]
    
    # Your code ends here
    return chi_support, chi_feature


In [35]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


In [36]:
len(chi_feature)

30

In [37]:
len(chi_support)

223

### List the selected features from Chi-Square 

In [38]:
# chi_support

In [39]:
chi_feature

['Nationality_Uruguay',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Balance',
 'Nationality_Belgium',
 'Strength',
 'Aggression',
 'Interceptions',
 'Acceleration',
 'SprintSpeed',
 'Position_RF',
 'Position_LF',
 'Stamina',
 'Agility',
 'Crossing',
 'Dribbling',
 'ShotPower',
 'ShortPassing',
 'BallControl',
 'LongPassing',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_Neymar',
 'Body Type_Messi',
 'Body Type_Courtois',
 'Body Type_C. Ronaldo',
 'LongShots',
 'FKAccuracy',
 'Finishing',
 'Volleys',
 'Reactions']

## Wrapper Feature Selection - Recursive Feature Elimination

In [40]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [41]:
def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Build a Logistic Regression model
    lr = LogisticRegression(solver='lbfgs')
    
    # Build RFE model with Logistic Regression as Learning Algorithm / Estimator
    rfe_lr = RFE(estimator=lr,
                 n_features_to_select=num_feats,
                 step=1,
                 verbose=0  # verbose=5
                 )
    # Train the RFE model
    rfe_lr = rfe_lr.fit(X, y)
    
    # Get Support from the model
    rfe_lr_support = rfe_lr.get_support()
    
    # Best features from the model
    rfe_lr_feature = X.loc[:, rfe_lr_support].columns.tolist()
    
    # Your code ends here
    return rfe_lr_support, rfe_lr_feature

In [42]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

30 selected features


### List the selected features from RFE

In [43]:
rfe_feature

['Reactions',
 'Position_CAM',
 'Position_CDM',
 'Position_CF',
 'Position_CM',
 'Position_GK',
 'Position_LAM',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RCB',
 'Position_RS',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_Lean',
 'Nationality_Bosnia Herzegovina',
 'Nationality_Brazil',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Denmark',
 'Nationality_Gabon',
 'Nationality_Germany',
 'Nationality_Greece',
 'Nationality_Netherlands',
 'Nationality_Portugal',
 'Nationality_Senegal',
 'Nationality_Serbia',
 'Nationality_Slovenia',
 'Nationality_Wales']

## Embedded Selection - Lasso: SelectFromModel

In [44]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [45]:
def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    #Build a Logistic Regression model
    logreg = LogisticRegression(penalty='l1', solver='liblinear')
    
    # Build Embedded model with Logistic Regression as Learning Algorithm / Estimator
    embedded_lr_selector = SelectFromModel(LogisticRegression(penalty='l1', 
                                                              solver='liblinear', 
                                                              max_iter=50000),
                                                              max_features=num_feats)
    #Train the RFE model
    embedded_lr_selector = embedded_lr_selector.fit(X, y)
    
    # Get Support from the model
    embedded_lr_support = embedded_lr_selector.get_support()
    
    # Best features from the model
    embedded_lr_feature = X.loc[:, embedded_lr_support].columns.tolist()
    
    # Your code ends here
    return embedded_lr_support, embedded_lr_feature

In [46]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [47]:
embedded_lr_feature

['Crossing',
 'Finishing',
 'Dribbling',
 'LongPassing',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RW',
 'Body Type_Lean',
 'Nationality_Brazil',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Germany',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [48]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [49]:
def embedded_rf_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Build the Random Forest model¶
    rf = RandomForestClassifier(n_estimators=100)
    
    # Build the Embedded model with Random Forest as the Learning Algorithm
    embedded_rf_selector = SelectFromModel(rf, max_features=num_feats)
    
    # Train the model
    embedded_rf_selector = embedded_rf_selector.fit(X, y)
    
    # Get support
    embedded_rf_support = embedded_rf_selector.get_support()
    
    # Best Features
    embedded_rf_feature = X.loc[:, embedded_rf_support].columns.tolist()
    
    # Your code ends here
    return embedded_rf_support, embedded_rf_feature

In [50]:
embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')

26 selected features


In [51]:
len(embedded_rf_support)

223

In [52]:
len(embedded_rf_feature)

26

In [53]:
embedded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Belgium',
 'Nationality_Germany',
 'Nationality_Slovenia',
 'Nationality_Spain']

## Tree based(Light GBM): SelectFromModel

In [54]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [55]:
def embedded_lgbm_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Choose LightGBM as your learning algorithm
    lgbmc = LGBMClassifier(n_estimators=500,
                           learning_rate=0.05,
                           num_leaves=32,
                           colsample_bytree=0.2,
                           reg_alpha=3,
                           reg_lambda=1,
                           min_split_gain=0.01,
                           min_child_weight=40
                          )
    
    # Build a Embedded model with LGBM as learning algorithm
    embedded_lgbm_selector = SelectFromModel(lgbmc, max_features=num_feats)
    
    # Train the model
    embedded_lgbm_selector = embedded_lgbm_selector.fit(X, y)
    
    # Get the support
    embedded_lgbm_support = embedded_lgbm_selector.get_support()
    
    # Get the feature names
    embedded_lgbm_feature = X.loc[:, embedded_lgbm_support].columns.tolist()
    
    # Your code ends here
    return embedded_lgbm_support, embedded_lgbm_feature

In [56]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature )), 'selected features')

30 selected features


In [57]:
embedded_lgbm_feature 

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Left',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CB',
 'Position_CDM',
 'Position_CF',
 'Position_CM',
 'Position_GK',
 'Position_LAM',
 'Position_LB']

## Putting all of it together: AutoFeatureSelector Tool

In [58]:
pd.set_option('display.max_rows', None)

# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 
                                     'Pearson':cor_support, 
                                     'Chi-2':chi_support, 
                                     'RFE':rfe_support, 
                                     'Logistics':embedded_lr_support,
                                     'Random Forest':embedded_rf_support, 
                                     'LightGBM':embedded_lgbm_support})

# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,Reactions,True,True,True,True,True,True,6
2,SprintSpeed,True,True,False,True,True,True,5
3,LongPassing,True,True,False,True,True,True,5
4,Finishing,True,True,False,True,True,True,5
5,FKAccuracy,True,True,False,True,True,True,5
6,Dribbling,True,True,False,True,True,True,5
7,Crossing,True,True,False,True,True,True,5
8,Agility,True,True,False,True,True,True,5
9,Acceleration,True,True,False,True,True,True,5
10,Weak Foot,True,False,False,True,True,True,4


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [59]:
def preprocess_dataset(dataset_path, num_feats = 30):
    
    # Your code starts here (Multiple lines)

    #     Assumptions:
    #         - The last column contains the Labels of the dataset
    #         - A categorical feature with more than MAX_ONE_HOT_LENGTH_ALLOWED values is useless for the model in its raw form.
    #           (it might be useful with other kind of transformation); so, it is discarded and a warning is given.
    #         - The cleaner the dataset, the better the results: 
    #               - drop obviuous useless features in advance.
    #               - drop the not so obviuos ones is even better
    #               - putting any currency columns as "float" dtype (not as 'objects'), etc
 
    import re
    MAX_ONE_HOT_LENGTH_ALLOWED = 35  # Drop categorical variables with too many unique values
    DROP_INFO_GAIN_FACTOR =0.9       # Drop features with poor or no information gain (IDs, ticket numbers, correlatives, etc.)
    
    # Read the dataset   
    dataset = pd.read_csv(dataset_path)
    
    #  Rename the feature columns with no special characters to avoid errors in algorithms (e.g. "Light GBM")
    dataset = dataset.rename(columns = lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
    
    # Drop the columns with almost no "Information Gain": 
    # The columns with the number of unique values very close or equal to the total number of rows produce no entropy change.
    # We eliminate features such as: IDs, correlative numbers, Ticket numbers, etc.
    # We can adjust the value OF DROP_INFO_GAIN_FACTOR to 1 for a more conservative approach.
    for col in dataset.columns:
        if dataset[col].nunique() >= dataset.shape[0]*DROP_INFO_GAIN_FACTOR:
            dataset.drop(col, axis='columns', inplace=True)
    
    # Create a list to hold the names of the data types: int16, int32, ... 
    num_cols = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    
    # Separate all numeric columns
    dataset_numcols_only = dataset.select_dtypes(include=num_cols)

    # Initializing MinMaxScaler
    min_max = MinMaxScaler()

    # Scaling down the numeric variables
    # We exclude the last column (labels) using iloc()
    dataset_numcols = pd.DataFrame( min_max.fit_transform(dataset_numcols_only.iloc[:,0:-1]), 
                                                          columns=dataset_numcols_only.iloc[:,0:-1].columns.tolist() )
    
    # Separate all categorical columns
    dataset_catcol = dataset.select_dtypes(exclude=num_cols)
    
    # Fill NaNs with the mode for categorical columns
    dataset_catcol.fillna(dataset_catcol.mode(), inplace=True)
    
    # Remove features with too many unique values >= MAX_ONE_HOT_LENGTH_ALLOWED
    for col in dataset_catcol.columns.values:
        if dataset_catcol[col].nunique() >= MAX_ONE_HOT_LENGTH_ALLOWED:
            dataset_catcol.drop([col],axis=1, inplace=True)
            
    # Transform categorical features to one-hot encoding 
    # Steps to one-hot encoding:
    # - Iterate through each categorical column name
    # - Create encoded variables for each categorical columns
    # - Concatenate the encoded variables column to the data frame
    # - Remove the original categorical variable column
    for col in dataset_catcol.columns.values:
        one_hot_encoded_variables = pd.get_dummies(dataset_catcol[col], prefix=col,sparse=False)
        dataset_catcol = pd.concat([dataset_catcol, one_hot_encoded_variables], axis=1)
        dataset_catcol.drop([col],axis=1, inplace=True)

    # Concatenate both numeric and one-hot encoded columns in the data frame
    dataset_almost_final = pd.concat([dataset_numcols, dataset_catcol], axis=1)
    dataset_almost_final.shape
    
    # Concatenate 'label column' (last column) to the final data frame
    dataset_final = pd.concat([dataset_almost_final, dataset_numcols_only.iloc[:,-1]], axis=1)
    
    # Delete NaNs as an initial rough approach for feature selection
    dataset_final.dropna(axis=0, inplace=True)
    
    # Split dataset in Features and Labels
    X = dataset_final.iloc[:,0:-1]
    y = dataset_final.iloc[:,-1] 

    # Your code ends here
    return X, y, num_feats

In [60]:
# pd.options.display.max_columns = None
# # pd.options.display.max_rows = None
# pd.options.display.max_rows = 10
# X.head()

In [61]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Parameters
    # data - dataset to be analyzed (csv file)
    # methods - various feature selection methods we outlined before, use them all here (list)
    
    # preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)
    feature_name = list(X.columns)
    dict_keys = ['Feature'] + methods
    dict_values =[feature_name]
        
    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y,num_feats)
        dict_values.append(cor_support)
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
        dict_values.append(chi_support)
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
        dict_values.append(rfe_support)
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
        dict_values.append(embedded_lr_support)
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
        dict_values.append(embedded_rf_support)
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        dict_values.append(embedded_lgbm_support)
    
    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    
    #### Your Code starts here (Multiple lines)
    
    pd.set_option('display.max_rows', None)

    # put all selection together
    dict_of_selectors = dict(zip(dict_keys, dict_values))
    feature_selection_df = pd.DataFrame(dict_of_selectors)

    # feature_selection_df
    
    # count the selected times for each feature
    feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

    # display the top 100
    feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
    feature_selection_df.index = range(1, len(feature_selection_df)+1)
    # print(feature_selection_df.head(num_feats))
    # best_features = pd.DataFrame(feature_selection_df.head(num_feats))
    best_features = pd.DataFrame(feature_selection_df.head(num_feats))['Feature'].to_list()
    
    #### Your Code ends here
    return best_features

In [62]:
%%time
best_features = autoFeatureSelector(dataset_path="fifa19.csv", 
                                    methods=['pearson', 
                                             'chi-square', 
                                             'rfe', 
                                             'log-reg', 
                                             'rf', 
                                             'lgbm'])
best_features
# CPU times: total: 5min 36s

CPU times: total: 26min 29s
Wall time: 5min 36s


['Stamina',
 'Special',
 'Marking',
 'HeadingAccuracy',
 'GKPositioning',
 'GKHandling',
 'SprintSpeed',
 'ShotPower',
 'Positioning',
 'Penalties',
 'GKKicking',
 'GKDiving',
 'Finishing',
 'Dribbling',
 'Curve',
 'Agility',
 'Aggression',
 'Acceleration',
 'Volleys',
 'SlidingTackle',
 'ShortPassing',
 'Position_GK',
 'LongShots',
 'JerseyNumber',
 'FKAccuracy',
 'Crossing',
 'BallControl',
 'Balance',
 'Vision',
 'Strength']

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features

Provided in separate file

In [63]:
# !pip install catboost