## Boruta - feature selection method

1. Take original input dataset
2. For all predictors, create shadow features (copied and randomly shuffled)
3. Apply trees ensemble estimator (Random Forest is most common) that is allowed to pick from original **and** shadow features
4. For feature-pairs where the importance of the shadow feature is greater than that of the original one, drop it as it provides no significant predictive power
5. End up with only those features that perform stronger then their randomly shuffled pairs

<img src="https://miro.medium.com/max/1130/0*bf8a63w2zfrCGJoN" width="700"/>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

Data is scraped from ESPN's NFL database, includes seasonal team stats from all phases of the game

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/a/a2/National_Football_League_logo.svg/1200px-National_Football_League_logo.svg.png" width="150"/>

In [24]:
data = pd.read_csv('../../scraping_projects/nfl_model/data/scraped_for_modeling_labeled.csv')
data.fillna(0, inplace = True)
data.head(3)

Unnamed: 0,Team,offense_points_per_game,season,games_played,offense_downs_Third Downs_PCT,offense_downs_Fourth Downs_PCT,offense_passing_CMP%,offense_passing_AVG,offense_passing_YDS/G,offense_passing_RTG,...,offense_receiving_FUM_per_game,offense_rushing_FUM_per_game,defense_receiving_FUM_per_game,defense_rushing_FUM_per_game,offense_receiving_LST_FUM_ratio,offense_rushing_LST_FUM_ratio,defense_receiving_LST_FUM_ratio,defense_rushing_LST_FUM_ratio,winner,played
0,Kansas City Chiefs,30.2,2004,16,47.2,28.6,66.0,8.3,275.4,94.9,...,0.125,0.4375,0.4375,0.5625,1.0,0.571429,0.428571,0.444444,0,0
1,Indianapolis Colts,32.6,2004,16,42.7,57.1,67.0,9.0,288.9,119.7,...,0.25,0.5625,0.375,0.625,0.75,0.444444,0.666667,0.4,0,0
2,Green Bay Packers,26.5,2004,16,47.3,57.1,63.9,7.6,278.1,93.9,...,0.3125,0.5625,0.1875,0.3125,0.6,0.666667,0.666667,0.4,0,0


In [25]:
data.shape

(544, 68)

### Problem: too many features for amount of datapoints (especially after train-test split)

Which features to use?
- Correlation
- Predictive Power Scores
- Industry knowledge
- Use all in a trees ensemble model and let the trees decide

**Better solution is Boruta selection method**

In [73]:
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [27]:
X = data.drop(['winner', 'played', 'games_played', 'season', 'Team'], 1).copy()
y = data['played'].copy()

Random Forest regressor as estimator

In [115]:
rf = RandomForestClassifier(max_depth = 5, max_features = 1/3, n_jobs = -1)

boruta = BorutaPy(estimator = rf, 
                  n_estimators = 250, 
                  perc = 75, 
                  max_iter = 200, 
                  random_state = 56)

In [116]:
%%time

boruta.fit(np.array(X), np.array(y))

Wall time: 2min 43s


BorutaPy(estimator=RandomForestClassifier(max_depth=5,
                                          max_features=0.3333333333333333,
                                          n_estimators=250, n_jobs=-1,
                                          random_state=RandomState(MT19937) at 0x2B003EE8740),
         max_iter=200, n_estimators=250, perc=75,
         random_state=RandomState(MT19937) at 0x2B003EE8740)

Check supported features

In [117]:
boruta_selection = X.columns[boruta.support_].to_list()

print('Number of selected features', boruta.n_features_, 'out of', data.shape[1])
boruta.n_features_ == len(boruta_selection)

Number of selected features 15 out of 68


True

In [118]:
print('Boruta suggests keeping:\n\n', '\n'.join(boruta_selection))

Boruta suggests keeping:

 offense_points_per_game
offense_downs_Fourth Downs_PCT
offense_passing_AVG
offense_passing_YDS/G
offense_passing_RTG
offense_receiving_AVG
defense_points_per_game
defense_passing_AVG
defense_passing_YDS/G
defense_receiving_AVG
defense_passing_SYL_per_game
defense_downs_First Downs_penalty_ratio
defense_rushing_TD_per_game
defense_pass_TD_per_rush_TD
offense_pass_TD_to_INT


Models with only these 14 features have almost identical prediction performance as models that use all 63. Reducing from 68 to 19 does not decrease performance but really reduces complexity which is always positive!

In [124]:
logreg = LogisticRegression().fit(X, y)
print('AUC with ALL features:', roc_auc_score(y, logreg.predict_proba(X)[:,1]).round(5))

AUC with ALL features: 0.95242


In [123]:
logreg = LogisticRegression().fit(X[boruta_selection], y)
print('AUC with Boruta selected features:', roc_auc_score(y, logreg.predict_proba(X[boruta_selection])[:,1]).round(5))

AUC with Boruta selected features: 0.9301


Check ranking (from best to worst features for prediction)

In [121]:
boruta_rankings = pd.concat([pd.Series(X.columns), pd.Series(boruta.ranking_)], 1).rename(columns = {0 : 'feature', 1 : 'rank'})
boruta_rankings.sort_values('rank', inplace = True)
boruta_rankings.reset_index(drop = True, inplace = True)

In [122]:
boruta_rankings.head(30)

Unnamed: 0,feature,rank
0,offense_points_per_game,1
1,defense_passing_SYL_per_game,1
2,defense_receiving_AVG,1
3,defense_passing_AVG,1
4,offense_pass_TD_to_INT,1
5,defense_points_per_game,1
6,defense_rushing_TD_per_game,1
7,defense_passing_YDS/G,1
8,offense_passing_RTG,1
9,offense_receiving_AVG,1
