# Feature Selection and Balancing

With such a side set of data, it's important for us to determine which features are truly important.  With the below notebook, we use Boruta to do further analysis on our features. 

#### 1. [Installation and Importing of Libraries](#eda_import)
#### 2. [Data Retrieval](#eda_retrieval)
#### 3. [Train / Test Split](#eda_traintest)
#### 3. [Write a CSV of cleaned data](#eda_csv) (TO-DO: dependent on outcome variable)

### <a name="eda_import"></a>Installation and Importing of Libraries
In order to both explore and visualize the data, it's necessary for us to load various libraries. The first cell installs packages necessary for Boruta.

In [64]:
!pip install boruta
!pip install cmake
!pip install scikit-learn
!pip install imblearn



In [65]:
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from boruta import BorutaPy

### <a name="eda_retrieval"></a>Data Retrieval

Below is the code to retrieve the clean csv from our merged_data folder.

In [66]:
# Set file path
file_path = "/dsa/groups/casestudy2023su/team03/merged_data/mros_merged_clean.csv"

# Load dataframe from CSV
merged_df = pd.read_csv(file_path)

In [67]:
merged_df["GIEDUC"] = merged_df["GIEDUC"].str.extract(r'^(\d+)', expand=False).astype(int)
merged_df["GIERACE"] = merged_df["GIERACE"].str.extract(r'^(\d+)', expand=False).astype(int)
merged_df.describe()

Unnamed: 0,B1TRD,B1ITD,B1FND,B1L1D,B1L3D,B1TBD,B1HDD,B1LAD,B1RAD,B1LRD,...,HEIGHTCHANGEFROM25,WEIGHTCHANGEFROM25,FAFXN,FAFXNT,GIEDUC,GIERACE,GIAGE1,FAFXN_BIN,FAFXNT_BIN,outlier
count,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,...,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0
mean,0.76499,1.111712,0.78417,0.979764,1.098752,1.167219,2.132447,0.848691,0.861511,0.706571,...,3.705292,10.339891,0.432432,0.343577,5.799132,1.214882,73.657658,0.25342,0.214715,0.8999
std,0.127367,0.166538,0.128116,0.176903,0.20732,0.126578,0.332499,0.091894,0.092162,0.135757,...,2.96009,11.375341,1.002012,0.813297,1.646366,0.707665,5.872264,0.435006,0.410659,0.436133
min,0.167047,0.389357,0.272729,0.298691,0.382238,0.764788,1.05692,0.515547,0.515601,0.443625,...,-30.18,-42.3937,0.0,0.0,1.0,1.0,64.0,0.0,0.0,-1.0
25%,0.677239,0.998989,0.696196,0.859821,0.955975,1.083122,1.905628,0.791598,0.804045,0.636937,...,1.97,2.994512,0.0,0.0,5.0,1.0,69.0,0.0,0.0,1.0
50%,0.758539,1.104805,0.773544,0.96803,1.08187,1.15755,2.11395,0.843987,0.856622,0.691454,...,3.59,9.36535,0.0,0.0,6.0,1.0,73.0,0.0,0.0,1.0
75%,0.847302,1.221518,0.860129,1.086195,1.219765,1.241125,2.339317,0.898742,0.913262,0.751444,...,5.24,17.1256,1.0,0.0,7.0,1.0,78.0,1.0,0.0,1.0
max,1.69903,1.98445,1.59835,1.97685,2.24568,2.04658,4.6952,1.94582,2.13936,4.66774,...,30.87,69.2577,12.0,12.0,8.0,5.0,100.0,1.0,1.0,1.0


### <a name="eda_traintest"></a>Train/Test Data Split

It is important to split the data into train, test prior to resampling to avoid leakage from the test and validation sets into the training set.

In [68]:
#split into X and y by dropping target variables from our X variables and using one target at a time for y variable. Also used get_dummies to transform some categorical features.
X = pd.get_dummies(merged_df.drop(['FAFXN', 'FAFXNT', 'FAFXN_BIN', 'FAFXNT_BIN'], axis=1)) 
y = merged_df['FAFXNT_BIN'].astype(int)

In [69]:
# Display the count of each class in the target prior to rebalancing
print(y.value_counts())

0    4707
1    1287
Name: FAFXNT_BIN, dtype: int64


In [70]:
#creating initial temp split of 10% for testing
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

#second split of 10% for validation... leaving us with 80% for training.
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.1111, random_state=42) # 0.1111 * 0.9 = 0.1

## Train a simple logistic regression model to evaluate performance 
1. First we look at the current performance of the logistic regression model.  
2. Next we review the performance after resampling with SMOTE
3. Then we evaluate SMOTE in conjunction with Boruta for feature reduction
4. Finally, we look at using just Boruta

In [71]:
# Scaling data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic regression model
model = LogisticRegression(max_iter=1000)

# Fit the logistic regression model on the original and scaled training data
model.fit(X_train_scaled, y_train)

# Predictions on the test set using original data
y_test_pred_orig = model.predict(X_test_scaled)

# Evaluate model performance on original data
print("Classification report for original training data:")
print(classification_report(y_test, y_test_pred_orig))

Classification report for original training data:
              precision    recall  f1-score   support

           0       0.79      0.95      0.87       471
           1       0.36      0.10      0.16       129

    accuracy                           0.77       600
   macro avg       0.58      0.53      0.51       600
weighted avg       0.70      0.77      0.71       600



In [72]:
#fit smote and create new X and y resampled
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
X_res_scaled = scaler.fit_transform(X_res)

# Fit resampled and scaled training data
model.fit(X_res_scaled, y_res)

# Predictions on the test data using resampled data
y_test_pred_resampled = model.predict(X_test_scaled)

# Evaluate model performance on resampled data
print("Classification report for SMOTE resampled training data:")
print(classification_report(y_test, y_test_pred_resampled))

Classification report for SMOTE resampled training data:
              precision    recall  f1-score   support

           0       0.85      0.56      0.68       471
           1       0.29      0.65      0.40       129

    accuracy                           0.58       600
   macro avg       0.57      0.61      0.54       600
weighted avg       0.73      0.58      0.62       600



In [73]:
# Perform Boruta analysis for feature selection on the SMOTE resampled dataset
boruta_selector_smote = BorutaPy(estimator=RandomForestClassifier(n_estimators=75, max_depth=6), verbose=2, max_iter=75, random_state=20)
boruta_selector_smote.fit(X_res.values, y_res.values)

# Select the relevant features using the Boruta mask
X_res_selected = X_res.loc[:, boruta_selector_smote.support_]
X_test_selected = X_test.loc[:, boruta_selector_smote.support_]

# Fit the logistic regression model on the resampled and feature-selected training data
model.fit(X_res_selected, y_res)

# Predictions on the test data using resampled and feature-selected data
y_test_pred_resampled_boruta = model.predict(X_test_selected)

Iteration: 	1 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	2 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	3 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	4 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	5 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	6 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	7 / 75
Confirmed: 	0
Tentative: 	267
Rejected: 	0
Iteration: 	8 / 75
Confirmed: 	118
Tentative: 	20
Rejected: 	129
Iteration: 	9 / 75
Confirmed: 	118
Tentative: 	20
Rejected: 	129
Iteration: 	10 / 75
Confirmed: 	118
Tentative: 	20
Rejected: 	129
Iteration: 	11 / 75
Confirmed: 	118
Tentative: 	20
Rejected: 	129
Iteration: 	12 / 75
Confirmed: 	121
Tentative: 	17
Rejected: 	129
Iteration: 	13 / 75
Confirmed: 	121
Tentative: 	17
Rejected: 	129
Iteration: 	14 / 75
Confirmed: 	121
Tentative: 	17
Rejected: 	129
Iteration: 	15 / 75
Confirmed: 	121
Tentative: 	17
Rejected: 	129
Iteration: 	16 / 75
Confirmed: 	122
Tent

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [76]:
# While we didn't converge and there were still 3 remaining, we've made progress with reduction. 
# Evaluate model performance on resampled and feature-selected data
print("Classification report for SMOTE resampled data with Boruta feature reduction:")
print(classification_report(y_test, y_test_pred_resampled_boruta))

# Get the selected and confirmed features
confirmed_features_sb = np.array(X_res.columns)[boruta_selector_smote.support_]
selected_features_sb = np.array(X_res.columns)[boruta_selector_smote.support_weak_]

print(f"Confirmed features ({len(confirmed_features_sb)}):")
print(confirmed_features_sb)

print(f"\nTentative features ({len(selected_features_sb)}):")
print(selected_features_sb)

Classification report for SMOTE resampled data with Boruta feature reduction:
              precision    recall  f1-score   support

           0       0.85      0.62      0.72       471
           1       0.31      0.61      0.41       129

    accuracy                           0.62       600
   macro avg       0.58      0.62      0.56       600
weighted avg       0.74      0.62      0.65       600

Confirmed features (125):
['B1TRD' 'B1ITD' 'B1FND' 'B1L1D' 'B1L3D' 'B1TBD' 'B1HDD' 'B1LAD' 'B1RAD'
 'B1LRD' 'B1RRD' 'B1TSD' 'B1PED' 'B1LLD' 'B1RLD' 'NPSEAT' 'NPLEFT1'
 'NPRIGHT6' 'MHDIAB' 'MHBP' 'MHMI' 'MHANGIN' 'MHANGINT' 'MHPROST'
 'MHPROSTT' 'MHGLAU' 'MHCAT' 'MHSTOM' 'MHARTH' 'MHOSTART' 'MHGOUT'
 'MHARTDK' 'MHHIP' 'MHKNEE' 'MHHAND' 'MHBACK' 'MHNECK' 'MHSHOULD' 'MHFOOT'
 'MHARTHMD' 'MHKDNY' 'MHCANCER' 'MHSC' 'MHPC' 'MHDIZZY' 'MHDZBAL' 'MHFALL'
 'MHBRUISE' 'MHNOINJR' 'MHBW' 'MHWEIGHT' 'MHWGTAGE' 'MHHGTCM' 'MHWGTMKG'
 'MHKNEEOA' 'MHHANDOA' 'FFFX50' 'SECONDSTOCOMPLETE5STANDS?'
 'TIMETOCOMP

In [44]:
# Perform Boruta analysis for feature selection on the original dataset
boruta_selector_orig = BorutaPy(estimator=RandomForestClassifier(n_estimators=50, max_depth=6), verbose=2, max_iter=50, random_state=20)
boruta_selector_orig.fit(X_train.values, y_train.values)

# Select the relevant features using the Boruta mask
X_train_selected = X_train.loc[:, boruta_selector_orig.support_]
X_test_selected_orig = X_test.loc[:, boruta_selector_orig.support_]

# Fit the logistic regression model on the original and feature-selected training data
model.fit(X_train_selected, y_train)

# Predictions on the test set using original and feature-selected data
y_test_pred_orig_boruta = model.predict(X_test_selected_orig)

Iteration: 	1 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	2 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	3 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	4 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	5 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	6 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	7 / 50
Confirmed: 	0
Tentative: 	278
Rejected: 	0
Iteration: 	8 / 50
Confirmed: 	0
Tentative: 	18
Rejected: 	260
Iteration: 	9 / 50
Confirmed: 	11
Tentative: 	7
Rejected: 	260
Iteration: 	10 / 50
Confirmed: 	11
Tentative: 	7
Rejected: 	260
Iteration: 	11 / 50
Confirmed: 	11
Tentative: 	7
Rejected: 	260
Iteration: 	12 / 50
Confirmed: 	11
Tentative: 	7
Rejected: 	260
Iteration: 	13 / 50
Confirmed: 	11
Tentative: 	6
Rejected: 	261
Iteration: 	14 / 50
Confirmed: 	11
Tentative: 	6
Rejected: 	261
Iteration: 	15 / 50
Confirmed: 	11
Tentative: 	6
Rejected: 	261
Iteration: 	16 / 50
Confirmed: 	12
Tentative: 	5
Rejecte

In [53]:
# Evaluate model performance on original and feature-selected data
print("Classification report for original data with Boruta feature reduction:")
print(classification_report(y_test, y_test_pred_orig_boruta))

# Get the selected and confirmed features
confirmed_features = np.array(X_res.columns)[boruta_selector_orig.support_]
selected_features = np.array(X_res.columns)[boruta_selector_orig.support_weak_]

print(f"Confirmed features ({len(confirmed_features)}):")
print(confirmed_features)

print(f"\nTentative features ({len(selected_features)}):")
print(selected_features)

Classification report for original data with Boruta feature reduction:
              precision    recall  f1-score   support

           0       0.79      1.00      0.88       471
           1       0.33      0.01      0.02       129

    accuracy                           0.78       600
   macro avg       0.56      0.50      0.45       600
weighted avg       0.69      0.78      0.69       600

Confirmed features (13):
['B1TRD' 'B1ITD' 'B1FND' 'B1L1D' 'B1TBD' 'B1LAD' 'B1RAD' 'B1LRD' 'B1RRD'
 'B1PED' 'B1RLD' 'NPSEAT' 'MHHGTCM']

Tentative features (2):
['B1L3D' 'B1HDD']


With the use of Boruta we were able to narrow down the confirmed+tentative features to 15 when working with the original data but we have an awful recall score.  When using Boruta in conjunction with SMOTE, we find that we're at 125 features which is progress.  