#__Automatic Feature Selection with Boruta__

Let's use Boruta to learn about automatic feature selection.

## Step 1: Import Required Libraries

- Import pandas, NumPy, RandomForestClassifier, BorutaPy, train_test_split, and accuracy_score


In [2]:
# !pip install boruta

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Step 2: Load and Prepare the Dataset

- Load the dataset using pandas
- Split the dataset into features and target variables
- Split the data into training and testing sets


In [4]:
URL = "https://raw.githubusercontent.com/Aditya1001001/English-Premier-League/master/pos_modelling_data.csv"
data = pd.read_csv(URL)
data.info()
X = data.drop('Position', axis = 1)
y = data['Position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1793 entries, 0 to 1792
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Position             1793 non-null   object 
 1   Clean sheets         1793 non-null   float64
 2   Goals conceded       1793 non-null   float64
 3   Tackles              1793 non-null   float64
 4   Tackle success %     1793 non-null   int64  
 5   Blocked shots        1793 non-null   float64
 6   Interceptions        1793 non-null   float64
 7   Clearances           1793 non-null   float64
 8   Recoveries           1793 non-null   float64
 9   Successful 50/50s    1793 non-null   float64
 10  Own goals            1793 non-null   float64
 11  Assists              1793 non-null   int64  
 12  Passes               1793 non-null   int64  
 13  Passes per match     1793 non-null   float64
 14  Big chances created  1793 non-null   float64
 15  Crosses              1793 non-null   f

__Obeservations__
- As you can see above, we have 35 columns and 1793 observations.
- There are no missing values, and all are non-null values.

Now, lets split the data into X and y columns.

In [5]:
X = data.drop('Position', axis = 1)
y = data['Position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Step 4: Train a RandomForest Classifier

- Train a RandomForest classifier using all features
- Calculate the accuracy score on the test set


In [6]:
rf_all_features = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_all_features.fit(X_train, y_train) 

accuracy_score(y_test, rf_all_features.predict(X_test))

0.7298050139275766

__Observation__
- The accuracy score for the model is  72.9%.

## Step 5: Perform Feature Selection Using Boruta

- Train a RandomForest classifier for Boruta feature selection
- Perform feature selection using Boruta
- Display the ranking and the number of significant features


In [9]:
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X_train), np.array(y_train))  

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

__Observations__
- As you can see here, we have iterations for every 100 samples.
- It also shows how many variables are tentative and confirmed, and based on this, the ranking will be formed.
- We can adjust a number of features using n_features_to_select in the BorutaPy function.


In [None]:
print("Ranking: ", boruta_selector.ranking_)
print("No. of significant features: ", boruta_selector.n_features_) 

Ranking:  [1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 2 1 1 1 1 1 1]
No. of significant features:  31


__Observation__

- Out of 35 attributes, 31 have significant features and have ranking of 1.

## Step 6: Display the Ranking of Features

- Create a DataFrame with the feature ranking
- Sort the DataFrame based on the ranking


In [None]:
selected_rf_features = pd.DataFrame({'Feature':list(X_train.columns),'Ranking':boruta_selector.ranking_})
selected_rf_features.sort_values(by='Ranking') 

Unnamed: 0,Feature,Ranking
0,Clean sheets,1
31,Arial Saves,1
30,overall,1
29,value_eur,1
28,age,1
26,Saves,1
25,Shooting accuracy %,1
24,Shots,1
23,Goals per match,1
22,Goals,1


__Observation__
- Now, we know the attributes apart from rank 1.

## Step 7: Train a RandomForest Classifier Using the Selected Features

- Transform the training and testing sets using the Boruta selector
- Train a RandomForest classifier using the selected features
- Calculate the accuracy score on the test set

In [None]:
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test)) 

- Using the important feature selection, let's fit the RandomForestClassifier model.

In [None]:
rf_boruta = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_boruta.fit(X_important_train, y_train) 

In [None]:
accuracy_score(y_test, rf_boruta.predict(X_important_test))

0.7325905292479109

__Observation__

- As seen above, the accuracy is still 73 even though we eliminated 4 variables.