# X4: Handling Imbalanced DataTechniques for learning from datasets with skewed class distributions.

## IntroductionMany real-world problems have imbalanced classes:- Fraud detection: 0.1% fraud- Disease screening: 1% positive- Manufacturing defects: 0.01% defectiveNaive models achieve high accuracy by predicting majority class!

## Table of Contents1. Problem Understanding2. Evaluation Metrics3. Resampling Techniques4. Algorithm-Level Approaches5. Cost-Sensitive Learning

In [None]:
import numpy as npfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report, roc_auc_score, f1_scorefrom sklearn.ensemble import RandomForestClassifierfrom imblearn.over_sampling import SMOTEfrom imblearn.under_sampling import RandomUnderSamplerfrom collections import Counternp.random.seed(42)

## 1. Create Imbalanced Dataset

In [None]:
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15,                            n_redundant=5, weights=[0.95, 0.05], random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f'Class distribution: {Counter(y_train)}')print(f'Imbalance ratio: {Counter(y_train)[0] / Counter(y_train)[1]:.1f}:1')

## 2. Baseline Model (Fails on Imbalanced Data)

In [None]:
baseline = RandomForestClassifier(random_state=42)baseline.fit(X_train, y_train)y_pred = baseline.predict(X_test)print('Baseline (without handling imbalance):')print(classification_report(y_test, y_pred))print(f'F1 Score (minority class): {f1_score(y_test, y_pred):.3f}')

## 3. SMOTE (Synthetic Minority Oversampling)**Creates synthetic examples** of minority class by interpolating between existing samples.

In [None]:
smote = SMOTE(random_state=42)X_resampled, y_resampled = smote.fit_resample(X_train, y_train)print(f'After SMOTE: {Counter(y_resampled)}')model_smote = RandomForestClassifier(random_state=42)model_smote.fit(X_resampled, y_resampled)y_pred_smote = model_smote.predict(X_test)print('\nWith SMOTE:')print(classification_report(y_test, y_pred_smote))print(f'F1 Score: {f1_score(y_test, y_pred_smote):.3f}')

## 4. Class Weights (Algorithm-Level)

In [None]:
model_weighted = RandomForestClassifier(class_weight='balanced', random_state=42)model_weighted.fit(X_train, y_train)y_pred_weighted = model_weighted.predict(X_test)print('With Class Weights:')print(classification_report(y_test, y_pred_weighted))print(f'F1 Score: {f1_score(y_test, y_pred_weighted):.3f}')

## Conclusion**Techniques Comparison:****SMOTE:**- ✅ Often works well- ⚠️ Can create unrealistic samples- Use for moderate imbalance (<1:100)**Under-sampling:**- ✅ Fast- ⚠️ Loses information- Use when abundant data**Class Weights:**- ✅ No data modification- ✅ Works with most algorithms- ✅ Try this first!**Best Practice:**1. Try class_weight='balanced' first2. If insufficient, add SMOTE3. Always evaluate with F1, ROC-AUC, not accuracy4. Use stratified splits