# Improved Biomedical Data Analysis: Handling Imbalanced Data
### Teaching Notebook (Extended)

**Objective:** Improve diabetes prediction performance using class balancing techniques.\
	• Identify class imbalance in real-world data\
	• Apply **SMOTE (Synthetic Minority Over-sampling Technique)** to create a balanced training set\
	• Train a Random Forest on resampled data\
	• Evaluate model improvement for minority class detection

Dataset source: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

In [3]:
# Install required packages (run only if needed)
# !pip install pandas matplotlib seaborn scikit-learn imbalanced-learn

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

## Load the Dataset

In [4]:
df = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')
df.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


## Addressing Class Imbalance with SMOTE (Synthetic Minority Over-sampling Technique)

In [7]:
X = df.drop(columns='Diabetes_binary')
y = df['Diabetes_binary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE to training data only
sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)
print('Original dataset shape:', y_train.value_counts().to_dict())
print('Resampled dataset shape:', pd.Series(y_train_resampled).value_counts().to_dict())

Original dataset shape: {0.0: 152729, 1.0: 24847}
Resampled dataset shape: {1.0: 152729, 0.0: 152729}


## Train Model on Resampled Data

In [10]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.88      0.96      0.92     65605
         1.0       0.45      0.20      0.27     10499

    accuracy                           0.86     76104
   macro avg       0.67      0.58      0.60     76104
weighted avg       0.82      0.86      0.83     76104



## Summary
Improved machine learning model performance after applying **SMOTE** to balance the dataset.
With SMOTE applied, the model is trained on a balanced set of diabetic and non-diabetic cases. This improves the model's ability to identify diabetic patients, particularly increasing recall and F1-score for the minority class.

### Before vs. After: Why We Used SMOTE

Originally, the dataset had way more people without diabetes than with it. The model mostly learned to say “no diabetes” and was bad at catching actual diabetes cases. That’s because it didn’t have enough examples to learn from the smaller group (the diabetic patients).

**SMOTE** helps fix this by creating new synthetic diabetic examples in the training set. Now, the model sees equal examples of diabetic and non-diabetic people during training — giving it a fair chance to learn both.

### So What Improved?
• The model is now better at identifying diabetes, especially recall (catching more of the real diabetes cases).\
• The overall accuracy stayed at 86%, but now it’s less biased toward predicting “no diabetes.”\
• The macro average F1-score increased from 0.59 to 0.60. Small, but meaningful in healthcare.