# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pwd
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/GBC')
!pwd

/content
/content/drive/My Drive/Colab Notebooks/GBC


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [None]:
player_df = pd.read_csv("fifa19.csv")

In [None]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [None]:
player_df = player_df[numcols+catcols]

In [None]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [None]:
traindf = pd.DataFrame(traindf,columns=features)

In [None]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [None]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,False,False,False,False,False,False,False,False,False,False
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,False,False,False,False,False,False,False,False,False,False
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,False,False,False,False,False,False,False,False,False,False
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,False,False,False,False,False,False,False,False,False,False
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,False,False,False,False,False,False,False,False,False,False


In [None]:
len(X.columns)

223

### Set some fixed set of features

In [None]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [None]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_list = []

    for i in range(len(X.columns)):
        # Calculate correlation between numerical features and target
        cor = np.corrcoef(X.iloc[:, i], y)[0, 1]
        cor_list.append(cor)
        feature_list.append(X.columns[i])

    # Sort features by absolute correlation value
    zipped_lists = zip(cor_list, feature_list)
    sorted_pairs = sorted(zipped_lists, key=lambda x: abs(x[0]))

    # Select top num_feats features based on absolute correlation
    selected_features = [x[1] for x in sorted_pairs[:num_feats]]

    # Return support for selected features and list of selected features
    cor_support = [True if i in selected_features else False for i in X.columns]
    cor_feature = selected_features

    return cor_support, cor_feature


In [None]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [None]:
cor_feature

['Nationality_Denmark',
 'Nationality_Portugal',
 'Nationality_Poland',
 'Nationality_Fiji',
 'Nationality_Oman',
 'Nationality_São Tomé & Príncipe',
 'Nationality_Malta',
 'Nationality_South Sudan',
 'Nationality_Qatar',
 'Nationality_United Arab Emirates',
 'Body Type_Shaqiri',
 'Nationality_Kuwait',
 'Nationality_New Caledonia',
 'Nationality_Belize',
 'Nationality_Guam',
 'Nationality_Grenada',
 'Nationality_Indonesia',
 'Nationality_Liberia',
 'Nationality_Mauritius',
 'Nationality_Palestine',
 'Nationality_Jordan',
 'Nationality_Nicaragua',
 'Nationality_Lebanon',
 'Nationality_Puerto Rico',
 'Nationality_Rwanda',
 'Body Type_Akinfenwa',
 'Nationality_Botswana',
 'Nationality_Andorra',
 'Nationality_St Lucia',
 'Nationality_Ethiopia']

## Filter Feature Selection - Chi-Sqaure

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [None]:
def chi_squared_selector(X, y, num_feats):
    # Apply SelectKBest class to extract top 'num_feats' best features using Chi-Square test
    bestfeatures = SelectKBest(score_func=chi2, k=num_feats)
    fit = bestfeatures.fit(X, y)

    # Get the scores for each feature
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(X.columns)

    # Concatenate two dataframes for better visualization
    featureScores = pd.concat([dfcolumns, dfscores], axis=1)
    featureScores.columns = ['Feature', 'Score']  # Naming the dataframe columns

    # Get the 'num_feats' best features
    chi_support = bestfeatures.get_support()
    chi_feature = X.loc[:,chi_support].columns.tolist()

    return chi_support, chi_feature


In [None]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square

In [None]:
chi_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LF',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [None]:
def rfe_selector(X, y, num_feats):
    # Initialize the MinMaxScaler
    scaler = MinMaxScaler()

    # Scale the features
    X_scaled = scaler.fit_transform(X)

    # Initialize the model
    model = LogisticRegression()

    # Initialize RFE
    rfe = RFE(estimator=model, n_features_to_select=num_feats)

    # Fit RFE on scaled features
    fit = rfe.fit(X_scaled, y)

    # Get the selected features
    rfe_support = rfe.support_
    rfe_feature = X.loc[:,rfe_support].columns.tolist()

    return rfe_support, rfe_feature

In [None]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

30 selected features


### List the selected features from RFE

In [None]:
rfe_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Strength',
 'Weak Foot',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RF',
 'Position_RM',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Embedded Selection - Lasso: SelectFromModel

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [None]:
def embedded_log_reg_selector(X, y, num_feats):
    # Initialize the MinMaxScaler
    scaler = MinMaxScaler()

    # Scale the features
    X_scaled = scaler.fit_transform(X)

    # Initialize the model
    model = LogisticRegression(penalty="l1", solver='liblinear')

    # Initialize SelectFromModel
    sfm = SelectFromModel(estimator=model, max_features=num_feats)

    # Fit SelectFromModel
    fit = sfm.fit(X_scaled, y)

    # Get the selected features
    embedded_lr_support = sfm.get_support()
    embedded_lr_feature = X.loc[:,embedded_lr_support].columns.tolist()

    return embedded_lr_support, embedded_lr_feature

In [None]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

27 selected features


In [None]:
embedded_lr_feature

['LongPassing',
 'Reactions',
 'Balance',
 'Aggression',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RCB',
 'Position_RW',
 'Body Type_Lean',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Brazil',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Germany',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Nationality_Portugal',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [None]:
def embedded_rf_selector(X, y, num_feats):
    # Initialize the model
    model = RandomForestClassifier(n_estimators=100)

    # Initialize SelectFromModel
    sfm = SelectFromModel(estimator=model, max_features=num_feats)

    # Fit SelectFromModel
    fit = sfm.fit(X, y)

    # Get the selected features
    embedded_rf_support = sfm.get_support()
    embedded_rf_feature = X.loc[:,embedded_rf_support].columns.tolist()

    return embedded_rf_support, embedded_rf_feature

In [None]:
embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')

25 selected features


In [None]:
embedded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Body Type_Courtois',
 'Body Type_Lean',
 'Body Type_Normal',
 'Nationality_Costa Rica',
 'Nationality_Slovenia']

## Tree based(Light GBM): SelectFromModel

In [None]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [None]:
def embedded_lgbm_selector(X, y, num_feats):
    # Initialize the model
    model = LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

    # Fit the model
    model.fit(X, y)

    # Initialize SelectFromModel
    sfm = SelectFromModel(estimator=model, max_features=num_feats, prefit=True)

    # Get the selected features
    embedded_lgbm_support = sfm.get_support()
    embedded_lgbm_feature = X.loc[:,embedded_lgbm_support].columns.tolist()

    return embedded_lgbm_support, embedded_lgbm_feature

In [None]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')

[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006051 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1812
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
30 selected features


In [None]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Left',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CB',
 'Position_CDM',
 'Position_CF',
 'Position_CM',
 'Position_GK',
 'Position_LAM',
 'Position_LB']

## Putting all of it together: AutoFeatureSelector Tool

In [None]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'Random Forest':embedded_rf_support, 'LightGBM':embedded_lgbm_support})

# count the selected times for each feature
# Only sum across the feature selection results, not the 'Feature' column
feature_selection_df['Total'] = np.sum(feature_selection_df[['Pearson', 'Chi-2', 'RFE', 'Logistics', 'Random Forest', 'LightGBM']], axis=1)

# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,Reactions,False,True,True,True,True,True,5
2,LongPassing,False,True,True,True,True,True,5
3,Volleys,False,True,True,False,True,True,4
4,Strength,False,True,True,False,True,True,4
5,SprintSpeed,False,True,True,False,True,True,4
6,ShortPassing,False,True,True,False,True,True,4
7,Nationality_Slovenia,False,True,True,True,True,False,4
8,Finishing,False,True,True,False,True,True,4
9,FKAccuracy,False,True,True,False,True,True,4
10,BallControl,False,True,True,False,True,True,4


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [None]:
def preprocess_dataset(dataset_path):
    # Load the dataset
    df = pd.read_csv(dataset_path)

    # Drop unnecessary columns
    df = df.drop(columns=['Unnamed: 0', 'ID', 'Name', 'Photo', 'Nationality', 'Flag',
                          'Club', 'Club Logo', 'Value', 'Wage', 'Special', 'Preferred Foot',
                          'International Reputation', 'Weak Foot', 'Skill Moves', 'Work Rate',
                          'Body Type', 'Real Face', 'Position', 'Jersey Number', 'Joined',
                          'Loaned From', 'Contract Valid Until', 'Height', 'Weight', 'LS',
                          'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM', 'LM',
                          'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB', 'LCB',
                          'CB', 'RCB', 'RB', 'Release Clause'])

    # Handle missing values
    # This will depend on your specific dataset
    # For example, you might want to fill missing values with the mean or median,
    # or drop rows/columns with a high percentage of missing values
    df = df.dropna()

    # Encode categorical variables
    # This will depend on your specific dataset
    # For example, you might want to use one-hot encoding or label encoding
    # ...

    # Separate features and target
    # This will depend on what your target variable is
    X = df.drop('Overall', axis=1)
    y = df['Overall']

    # Number of features to select
    num_feats = X.shape[1]

    return X, y, num_feats

In [None]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)

    # Initialize an empty DataFrame to store feature importance
    feature_selection_df = pd.DataFrame(X.columns, columns=['Feature'])

    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y,num_feats)
        feature_selection_df['Pearson'] = cor_support
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
        feature_selection_df['Chi-2'] = chi_support
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
        feature_selection_df['RFE'] = rfe_support
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
        feature_selection_df['Logistics'] = embedded_lr_support
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
        feature_selection_df['Random Forest'] = embedded_rf_support
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        feature_selection_df['LightGBM'] = embedded_lgbm_support

    # Count the selected times for each feature
    feature_selection_df['Total'] = np.sum(feature_selection_df.iloc[:, 1:], axis=1)

    # Display the top 'num_feats' features
    best_features = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False).head(num_feats)['Feature'].tolist()

    return best_features

In [None]:
best_features = autoFeatureSelector(dataset_path="fifa19.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
best_features

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[1;30;43mStreaming output truncated to the last 5000 lines.[0m


['StandingTackle',
 'Reactions',
 'Potential',
 'Marking',
 'Interceptions',
 'Composure',
 'BallControl',
 'SlidingTackle',
 'ShotPower',
 'ShortPassing',
 'Positioning',
 'HeadingAccuracy',
 'Dribbling',
 'Crossing',
 'Age',
 'Volleys',
 'Vision',
 'Strength',
 'Stamina',
 'SprintSpeed',
 'Penalties',
 'LongShots',
 'LongPassing',
 'Jumping',
 'GKReflexes',
 'GKPositioning',
 'GKKicking',
 'GKHandling',
 'GKDiving',
 'Finishing',
 'FKAccuracy',
 'Curve',
 'Balance',
 'Agility',
 'Aggression',
 'Acceleration']

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features

In [None]:
pip install ipynb-py-convert

Collecting ipynb-py-convert
  Downloading ipynb-py-convert-0.4.6.tar.gz (3.9 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ipynb-py-convert
  Building wheel for ipynb-py-convert (setup.py) ... [?25l[?25hdone
  Created wheel for ipynb-py-convert: filename=ipynb_py_convert-0.4.6-py3-none-any.whl size=4623 sha256=6373cb316db7a5a8d89fcc96e025dad9bd13b1377412731820fe51c41abe28c4
  Stored in directory: /root/.cache/pip/wheels/69/a2/b7/2816fda86a647adbe8c4e7b7f4cc72cdc37720ad11b1611af5
Successfully built ipynb-py-convert
Installing collected packages: ipynb-py-convert
Successfully installed ipynb-py-convert-0.4.6
