# Balancing Notebook

### Our target variables are quite imbalanced in our dataset. This notebook uses Synthetic Minority Over-sampling Technique (SMOTE) to create new synthetic samples that are similar to existing observations in the minority class. By creating a dataset with balanced target variables we hope to improve model performance and in particular increase the predictive ability on the minority class (experienced fracture). 

#### 1. [Installation and Importing of Libraries](#eda_import)
#### 2. [SMOTE](#eda_smote)
#### 3. [Write a CSV of cleaned data](#eda_csv) (TO-DO: dependent on outcome variable)

### <a name="eda_import"></a>Installation and Importing of Libraries
In order to both explore and visualize the data, it's necessary for us to load various libraries. The first cell installs packages necessary for imblearn.

In [1]:
!pip install cmake

!pip install scikit-learn

!pip install imblearn



In [2]:
##import libraries required for analysis
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler


### <a name="eda_smote"></a>SMOTE

Below is the code to retrieve the clean csv from our merged_data folder.

In [3]:
# Set file path
file_path = "/dsa/groups/casestudy2023su/team03/merged_data/mros_merged_clean.csv"

# Load dataframe from CSV
merged_df = pd.read_csv(file_path)



In [4]:
merged_df.describe()

Unnamed: 0,B1TRD,B1ITD,B1FND,B1L1D,B1L3D,B1TBD,B1HDD,B1LAD,B1RAD,B1LRD,...,RADIALPULSE_MEAS1,BMI,HEIGHTCHANGEFROM25,WEIGHTCHANGEFROM25,FAFXN,FAFXNT,GIAGE1,FAFXN_BIN,FAFXNT_BIN,outlier
count,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,...,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0,5994.0
mean,0.76499,1.111712,0.78417,0.979764,1.098752,1.167219,2.132447,0.848691,0.861511,0.706571,...,64.349891,27.378324,3.705292,10.339891,0.432432,0.343577,73.657658,0.25342,0.214715,0.8999
std,0.127367,0.166538,0.128116,0.176903,0.20732,0.126578,0.332499,0.091894,0.092162,0.135757,...,9.989382,3.82994,2.96009,11.375341,1.002012,0.813297,5.872264,0.435006,0.410659,0.436133
min,0.167047,0.389357,0.272729,0.298691,0.382238,0.764788,1.05692,0.515547,0.515601,0.443625,...,36.0,17.211,-30.18,-42.3937,0.0,0.0,64.0,0.0,0.0,-1.0
25%,0.677239,0.998989,0.696196,0.859821,0.955975,1.083122,1.905628,0.791598,0.804045,0.636937,...,58.0,24.77785,1.97,2.994512,0.0,0.0,69.0,0.0,0.0,1.0
50%,0.758539,1.104805,0.773544,0.96803,1.08187,1.15755,2.11395,0.843987,0.856622,0.691454,...,64.0,26.90575,3.59,9.36535,0.0,0.0,73.0,0.0,0.0,1.0
75%,0.847302,1.221518,0.860129,1.086195,1.219765,1.241125,2.339317,0.898742,0.913262,0.751444,...,70.0,29.490625,5.24,17.1256,1.0,0.0,78.0,1.0,0.0,1.0
max,1.69903,1.98445,1.59835,1.97685,2.24568,2.04658,4.6952,1.94582,2.13936,4.66774,...,198.0,50.6687,30.87,69.2577,12.0,12.0,100.0,1.0,1.0,1.0


In [5]:
#split into X and y by dropping target variables from our X variables and using one target at a time for y variable. Also used get_dummies to transform some categorical features.
X = pd.get_dummies(merged_df.drop(['FAFXN', 'FAFXNT', 'FAFXN_BIN', 'FAFXNT_BIN'], axis=1)) 
y = merged_df['FAFXNT_BIN'].astype(int)


In [6]:
# Display the count of each class in the target prior to rebalancing
print(y.value_counts())


0    4707
1    1287
Name: FAFXNT_BIN, dtype: int64


### It is important to split the data into train, test prior to resampling to avoid leakage from the test and validation sets into the training set.

In [7]:
#creating initial temp split of 10% for testing

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
#second split of 10% for validation... leaving us with 80% for training.

X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.1111, random_state=42) # 0.1111 * 0.9 = 0.1

In [8]:
#fit smote and create new X and y resampled
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print(y_res.value_counts())

1    3748
0    3748
Name: FAFXNT_BIN, dtype: int64


## Train a simple logistic regression model to evaluate performance between training data before and after resampling

In [14]:
# Initialize StandardScaler
scaler = StandardScaler()

# Scaling data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_res_scaled = scaler.fit_transform(X_res)

# logistic regression model
model = LogisticRegression(max_iter=1000)

# Fit resampled and scaled training data
model.fit(X_res_scaled, y_res)

# predictions on the test
y_test_pred = model.predict(X_test_scaled)

# evaluate model performance
print("Classification report for resampled training data:")
print(classification_report(y_test, y_test_pred))

# Fit original and scaled training data
model.fit(X_train_scaled, y_train)

# predictions on the test set
y_test_pred_orig = model.predict(X_test_scaled)

# evaluate model performance
print("Classification report for original training data:")
print(classification_report(y_test, y_test_pred_orig))



Classification report for resampled training data:
              precision    recall  f1-score   support

           0       0.91      0.15      0.26       471
           1       0.23      0.95      0.38       129

    accuracy                           0.32       600
   macro avg       0.57      0.55      0.32       600
weighted avg       0.77      0.32      0.29       600

Classification report for original training data:
              precision    recall  f1-score   support

           0       0.80      0.95      0.87       471
           1       0.38      0.12      0.18       129

    accuracy                           0.77       600
   macro avg       0.59      0.53      0.52       600
weighted avg       0.71      0.77      0.72       600

