# Working with Imbalanced Datasets

* [imbalanced-learn documentation](https://imbalanced-learn.org/stable/)

Imbalanced-learn (imported as imblearn) is an open source, MIT-licensed library relying on scikit-learn (imported as sklearn) and provides tools when dealing with classification with imbalanced classes.

In [1]:
# Libraries
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Remove warning logging
warnings.filterwarnings("ignore")

# Set the seaborn style
sns.set(style="darkgrid")

In [3]:
# Load Dataset
df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,Pedigree,Age,Diabetes
0,10,129,76,28,122,35.9,0.28,39,0
1,4,84,90,23,56,39.5,0.159,25,0
2,0,84,82,31,125,38.2,0.233,23,0
3,9,134,74,33,60,25.9,0.46,81,0
4,0,124,56,13,105,21.8,0.452,21,0


In [4]:
df.shape

(550, 9)

## Data Preprocessing

In [5]:
# Drop duplicates
# df.duplicated().sum() # Check for duplicates
df.drop_duplicates(inplace=True)

# Drop rows with missing values
print(f"Summary of Missing Values: {df.isnull().sum()}") # Check for missing values

Summary of Missing Values: Pregnancies      0
Glucose          0
BloodPressure    0
SkinThickness    0
Insulin          0
BMI              0
Pedigree         0
Age              0
Diabetes         0
dtype: int64


## View Data Balance/Imbalanced

In [6]:
df.groupby('Diabetes').size()

Diabetes
0    500
1     21
dtype: int64

## Enfoque Clasico

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *


In [8]:
X = df.drop(columns=['Diabetes'])
y = df['Diabetes']

seed = 15
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

# Count the 0s and 1s in the target variable y_test
print(f"Count of 0s: {y_test.value_counts()[0]}")
print(f"Count of 1s: {y_test.value_counts()[1]}")

# # Other option to count the 0s and 1s in the target variable y_test
# print(f"Count of 0s: {np.count_nonzero(y_test == 0)}")
# print(f"Count of 1s: {np.count_nonzero(y_test == 1)}")


Count of 0s: 100
Count of 1s: 5


In [9]:
# Define the models in a list
models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier(random_state=seed)))
models.append(('RFC', RandomForestClassifier(random_state=seed)))

In [10]:
# Fit the model
# The model is fit to the training data
# The model learns the relationship between X_train and y_train
# The model is trained on the data
for name, model in models:
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    acc = accuracy_score(y_pred=y_hat, y_true=y_test)
    pre = precision_score(y_pred=y_hat, y_true=y_test)
    rec = recall_score(y_pred=y_hat, y_true=y_test)
    print(f"- Model {name} - Accuracy: {acc:.2f} - Precision: {pre:.2f} - Recall: {rec:.2f}")
    print(f"  Confusion Matrix:\n  {confusion_matrix(y_pred=y_hat, y_true=y_test)}")


- Model LR - Accuracy: 0.95 - Precision: 0.00 - Recall: 0.00
  Confusion Matrix:
  [[100   0]
 [  5   0]]
- Model KNN - Accuracy: 0.95 - Precision: 0.00 - Recall: 0.00
  Confusion Matrix:
  [[100   0]
 [  5   0]]
- Model DTC - Accuracy: 0.94 - Precision: 0.40 - Recall: 0.40
  Confusion Matrix:
  [[97  3]
 [ 3  2]]
- Model RFC - Accuracy: 0.93 - Precision: 0.00 - Recall: 0.00
  Confusion Matrix:
  [[98  2]
 [ 5  0]]


### Results Summary

The previous results showcase the performance of multiple classification models on the imbalanced diabetes dataset. The models evaluated include Logistic Regression (LR), K-Nearest Neighbors (KNN), Decision Tree Classifier (DTC), and Random Forest Classifier (RFC). The dataset was split into training and testing sets, with a random seed of 15 to ensure reproducibility.

Key observations:
- The dataset contains 521 entries with 8 features and a target variable (`Diabetes`), which is highly imbalanced (500 instances of class 0 and 21 instances of class 1).
- After training, the models were evaluated on the test set (`X_test`, `y_test`) using accuracy and confusion matrices.
- The Random Forest Classifier (RFC) achieved the highest accuracy of 93.33% on the test set.

The imbalance in the dataset may affect the performance of the models, particularly in predicting the minority class (class 1). Further techniques, such as resampling or using specialized algorithms, could be explored to address this issue.

## Estrategia Seleccion Muestreo 1: Stratify

What is Stratified sampling?

Stratified sampling is a sampling technique in which the population is subdivided into groups based on specific characteristics relevant to the problem before sampling. The samples are drawn from this group with ample sizes proportional to the size of the subgroup in the population and combined to form the final sample. The purpose is to ensure that all subgroup is represented proportionally in the final sample.

Stratified sampling is particularly useful when there are known variations within the population that could significantly impact the model results.

* [Scikit Learn - cross-validation-iterators-with-stratification-based-on-class-labels](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels)
* [Scikit Learn - train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)



In [15]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [19]:
X = df.drop(columns=['Diabetes'])
y = df['Diabetes']

seed = 99

# Stratified sampling ensures that the proportion of classes in the target variable is maintained in both training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)

# Count the 0s and 1s in the target variable y_test
print("y_test:")
print(f"  Count of 0s: {y_test.value_counts()[0]}")
print(f"  Count of 1s: {y_test.value_counts()[1]}")

X_test

y_test:
  Count of 0s: 101
  Count of 1s: 4


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,Pedigree,Age
56,1,97,70,15,0,18.2,0.147,21
296,6,96,0,0,0,23.7,0.190,28
519,1,0,74,20,23,27.7,0.299,21
203,2,56,56,28,45,24.2,0.332,22
211,3,129,92,49,155,36.4,0.968,32
...,...,...,...,...,...,...,...,...
314,2,112,66,22,0,25.0,0.307,24
375,6,92,92,0,0,19.9,0.188,28
47,0,102,86,17,105,29.3,0.695,27
107,6,183,94,0,0,40.8,1.461,45


In [None]:
# Fit the model
# The model is fit to the training data
# The model learns the relationship between X_train and y_train
for name, alg in models:
    grid_model = GridSearchCV(
            estimator=alg,
            param_grid = {},
            #param_grid={'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']},
            scoring='f1',
            cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
            #verbose=0,
            #n_jobs=-1
        )
    grid_model.fit(X_train, y_train)
    y_hat = grid_model.predict(X_test)
    acc = accuracy_score(y_pred=y_hat, y_true=y_test)
    pre = precision_score(y_pred=y_hat, y_true=y_test)
    rec = recall_score(y_pred=y_hat, y_true=y_test)
    print(f"- Model {name} - Accuracy: {acc:.2f} - Precision: {pre:.2f} - Recall: {rec:.2f}")
    #print(f"  Confusion Matrix:\n  {confusion_matrix(y_pred=y_hat, y_true=y_test)}")

- Model LR - Accuracy: 0.95 - Precision: 0.00 - Recall: 0.00
- Model KNN - Accuracy: 0.95 - Precision: 0.00 - Recall: 0.00
- Model DTC - Accuracy: 0.95 - Precision: 0.00 - Recall: 0.00
- Model RFC - Accuracy: 0.95 - Precision: 0.00 - Recall: 0.00


## Estrategia Seleccion Muestreo 2: Random Over Sampling

Over-sampling is a technique used to address class imbalance in datasets—where one class (e.g., "fraudulent transactions") is significantly underrepresented compared to another (e.g., "legitimate transactions").

What It Does: Over-sampling increases the number of instances in the minority class to balance the dataset. This helps models learn the characteristics of both classes more effectively and reduces bias toward the majority class.

* [Imbalanced-learn - over_sampling](https://imbalanced-learn.org/stable/over_sampling.html)