# AutoFeatureSelector Tool

Author: Mohamed Oussama NAJI

Date: Jan 26, 2024

## Introduction

This notebook demonstrates the implementation of an Automatic Feature Selection tool using various feature selection methods. The tool is designed to select the best features from a real-world dataset, specifically the FIFA 19 Player Skills dataset. The feature selection methods used include Pearson Correlation, Chi-Square, Recursive Feature Elimination (RFE), Embedded (Logistic Regression), Tree-based (Random Forest), and Tree-based (Light GBM).


## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Data Preprocessing](#data-preprocessing)
3. [Feature Selection Methods](#feature-selection-methods)
   - [Pearson Correlation](#pearson-correlation)
   - [Chi-Square](#chi-square)
   - [Recursive Feature Elimination (RFE)](#recursive-feature-elimination)
   - [Embedded (Logistic Regression)](#embedded-logistic-regression)
   - [Tree-based (Random Forest)](#tree-based-random-forest)
   - [Tree-based (Light GBM)](#tree-based-light-gbm)
4. [AutoFeatureSelector Tool](#autofeature-selector-tool)
5. [Conclusion](#conclusion)


## Dataset Overview <a id="dataset-overview"></a>

The FIFA 19 Player Skills dataset contains attributes of FIFA 2019 players, including age, nationality, overall rating, potential, club, value, wage, preferred foot, international reputation, weak foot, skill moves, work rate, position, and various skill ratings such as crossing, finishing, heading accuracy, short passing, volleys, dribbling, curve, free kick accuracy, long passing, ball control, acceleration, sprint speed, agility, reactions, balance, shot power, jumping, stamina, strength, long shots, aggression, interceptions, positioning, vision, penalties, composure, marking, standing tackle, sliding tackle, goalkeeper diving, goalkeeper handling, goalkeeper kicking, goalkeeper positioning, and goalkeeper reflexes.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

player_df = pd.read_csv("fifa19.csv")

## Data Preprocessing <a id="data-preprocessing"></a>

In [None]:
numcols = ['Overall', 'Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions']
catcols = ['Preferred Foot', 'Position', 'Body Type', 'Nationality', 'Weak Foot']

player_df = player_df[numcols + catcols]

traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])], axis=1)

# Remove 'Nationality' related columns
traindf = traindf[[col for col in traindf.columns if not col.startswith('Nationality_')]]

features = traindf.columns

traindf = traindf.dropna()

traindf = pd.DataFrame(traindf, columns=features)

y = traindf['Overall'] >= 87
X = traindf.copy()
del X['Overall']

X.head()

len(X.columns)

feature_name = list(X.columns)
num_feats = 30

## Feature Selection Methods <a id="feature-selection-methods"></a>

### Pearson Correlation <a id="pearson-correlation"></a>

In [None]:
def cor_selector(X, y, num_feats):
    cor_list = []
    feature_name = X.columns.tolist()

    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)

    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    cor_support = [True if i in cor_feature else False for i in feature_name]

    return cor_support, cor_feature

cor_support, cor_feature = cor_selector(X, y, num_feats)
print(str(len(cor_feature)), 'selected features')

### Chi-Square <a id="chi-square"></a>

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

def chi_squared_selector(X, y, num_feats):
    X_norm = MinMaxScaler().fit_transform(X)
    chi_selector = SelectKBest(chi2, k=num_feats)
    chi_selector.fit(X_norm, y)

    chi_support = chi_selector.get_support()
    chi_feature = X.loc[:, chi_support].columns.tolist()

    return chi_support, chi_feature

chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
print(str(len(chi_feature)), 'selected features')

### Recursive Feature Elimination (RFE) <a id="recursive-feature-elimination"></a>

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

def rfe_selector(X, y, num_feats):
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    logreg = LogisticRegression()
    rfe = RFE(estimator=logreg, n_features_to_select=num_feats)
    rfe.fit(X_scaled, y)

    rfe_support = rfe.get_support()
    rfe_feature = X.columns[rfe_support].tolist()

    return rfe_support, rfe_feature

rfe_support, rfe_feature = rfe_selector(X, y, num_feats)
print(str(len(rfe_feature)), 'selected features')

### Embedded (Logistic Regression) <a id="embedded-logistic-regression"></a>

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

def embedded_log_reg_selector(X, y, num_feats):
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    logreg = LogisticRegression(penalty='l1', solver='saga', max_iter=1000)

    embedded_lr_selector = SelectFromModel(logreg, max_features=num_feats)
    embedded_lr_selector.fit(X_scaled, y)

    embedded_lr_support = embedded_lr_selector.get_support()
    embedded_lr_feature = X.loc[:, embedded_lr_support].columns.tolist()

    return embedded_lr_support, embedded_lr_feature

embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

### Tree-based (Random Forest) <a id="tree-based-random-forest"></a>

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def embedded_rf_selector(X, y, num_feats):
    rf = RandomForestClassifier(n_estimators=100)

    embedded_rf_selector = SelectFromModel(rf, max_features=num_feats)
    embedded_rf_selector.fit(X, y)

    embedded_rf_support = embedded_rf_selector.get_support()
    embedded_rf_feature = X.loc[:, embedded_rf_support].columns.tolist()

    return embedded_rf_support, embedded_rf_feature

embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')

### Tree-based (Light GBM) <a id="tree-based-light-gbm"></a>

In [None]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

def embedded_lgbm_selector(X, y, num_feats):
    lgbm = LGBMClassifier(n_estimators=100)

    embedded_lgbm_selector = SelectFromModel(lgbm, max_features=num_feats)
    embedded_lgbm_selector.fit(X, y)

    embedded_lgbm_support = embedded_lgbm_selector.get_support()
    embedded_lgbm_feature = X.loc[:, embedded_lgbm_support].columns.tolist()

    return embedded_lgbm_support, embedded_lgbm_feature

embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')

## AutoFeatureSelector Tool <a id="autofeature-selector-tool"></a>

In [None]:
pd.set_option('display.max_rows', None)

feature_selection_df = pd.DataFrame({'Feature': feature_name, 'Pearson': cor_support, 'Chi-2': chi_support, 'RFE': rfe_support, 'Logistics': embedded_lr_support,
                                     'Random Forest': embedded_rf_support, 'LightGBM': embedded_lgbm_support})

feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df) + 1)
feature_selection_df.head(num_feats)

In [None]:
def preprocess_dataset(dataset_path):
    player_df = pd.read_csv(dataset_path)

    # Drop unwanted columns related to nationality
    player_df.drop(player_df.filter(regex='(?i)nationality').columns, axis=1, inplace=True)

    numcols = ['Overall', 'Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl',
               'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions',
               'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions']

    catcols = ['Preferred Foot', 'Position', 'Body Type', 'Weak Foot']

    player_df = player_df[numcols + catcols]

    traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])], axis=1)
    features = traindf.columns
    traindf = traindf.dropna()

    X = traindf.copy()
    y = X['Overall'] >= 87
    del X['Overall']

    num_feats = 30

    return X, y, num_feats

def autoFeatureSelector(dataset_path, methods=[]):
    feature_selection_dict = {}

    X, y, num_feats = preprocess_dataset(dataset_path)

    if 'pearson' in methods:
        feature_selection_dict['pearson'] = cor_selector(X, y, num_feats)

    if 'chi-square' in methods:
        feature_selection_dict['chi-square'] = chi_squared_selector(X, y, num_feats)

    if 'rfe' in methods:
        feature_selection_dict['rfe'] = rfe_selector(X, y, num_feats)

    if 'log-reg' in methods:
        feature_selection_dict['log-reg'] = embedded_log_reg_selector(X, y, num_feats)

    if 'rf' in methods:
        feature_selection_dict['rf'] = embedded_rf_selector(X, y, num_feats)

    if 'lgbm' in methods:
        feature_selection_dict['lgbm'] = embedded_lgbm_selector(X, y, num_feats)

    def debug_feature_lengths(X, feature_selection_dict):
        print("Number of features in the preprocessed dataset:", X.shape[1])
        for method, (support, _) in feature_selection_dict.items():
            print(f"Number of selected features by {method}:", len(support))
            if len(support) != X.shape[1]:
                print(f"Warning: Mismatch in feature length for {method}")

    debug_feature_lengths(X, feature_selection_dict)

    results_dict = feature_selection_dict

    pd.set_option('display.max_rows', None)

    display_df = pd.DataFrame({'Feature': feature_name})

    for method, (support, _) in results_dict.items():
        display_df[method] = support

    display_df['Total'] = display_df.iloc[:, 1:].sum(axis=1)

    display_df = display_df.sort_values(['Total', 'Feature'], ascending=False)
    display_df.index = range(1, len(display_df) + 1)
    display(display_df.head(num_feats))
    best_features = feature_selection_dict
    return best_features

best_features = autoFeatureSelector(dataset_path="fifa19.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
best_features

## Conclusion <a id="conclusion"></a>

In this notebook, we developed an Automatic Feature Selection tool that employs various feature selection methods to select the best features from the FIFA 19 Player Skills dataset. The tool combines the results of different feature selection methods, including Pearson Correlation, Chi-Square, Recursive Feature Elimination (RFE), Embedded (Logistic Regression), Tree-based (Random Forest), and Tree-based (Light GBM).

The feature selection process involved data preprocessing, where unwanted columns were removed, and categorical features were encoded using one-hot encoding. The target variable was defined based on the 'Overall' rating of the players.

The autoFeatureSelector function takes the dataset path and a list of desired feature selection methods as input. It applies each specified method to the preprocessed dataset and returns the selected features for each method. The results are then consolidated into a DataFrame, which displays the selected features and their corresponding support from each method.

The tool provides flexibility in choosing the desired feature selection methods and the number of features to select. It helps in identifying the most relevant features for the given dataset and can be used as a starting point for further analysis or modeling tasks.

Overall, the Automatic Feature Selection tool simplifies the process of feature selection by automating the application of multiple methods and providing a consolidated view of the selected features. It can be a valuable addition to the data scientist's toolkit for efficient and effective feature selection.