# DA5401 A4 — GMM-Based Synthetic Sampling for Imbalanced Data

**Completed notebook (Parts B & C).**

This notebook completes Part B and Part C of the assignment using the same SMOTE / CBO / CBU ideas used in A3. Explanations are provided inline (markdown) before each code cell.

In [None]:

# Setup: imports and basic config
import os, sys, math, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, classification_report, roc_auc_score, confusion_matrix
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
from imblearn.over_sampling import SMOTE

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

print('Ready. Random state =', RANDOM_STATE)


## Part A (brief reproducible baseline)

Load the `creditcard.csv` dataset. **Note**: place `creditcard.csv` inside a folder named `Dataset/` next to this notebook, or in the notebook folder with name `creditcard.csv`. The code below tries both locations.

In [None]:

# Try to locate dataset
candidates = ['Dataset/creditcard.csv', 'creditcard.csv', '/mnt/data/creditcard.csv']
path = None
for c in candidates:
    if os.path.exists(c):
        path = c
        break

if path is None:
    raise FileNotFoundError("creditcard.csv not found. Please place it in 'Dataset/creditcard.csv' or the current directory.")
else:
    print('Using dataset at:', path)

df = pd.read_csv(path)
print('Dataset shape:', df.shape)
print('Class distribution:
', df['Class'].value_counts())


In [None]:

# Make a reproducible train/test split (keep test set imbalanced)
X = df.drop(columns=['Class']).values
y = df['Class'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_STATE, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)
print('Train class counts:', Counter(y_train))
print('Test class counts:', Counter(y_test))


In [None]:

# Helper functions for training/evaluation and plotting
def fit_eval_lr(X_tr, y_tr, X_te, y_te, desc='LR'):
    pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))])
    pipe.fit(X_tr, y_tr)
    y_pred = pipe.predict(X_te)
    y_proba = pipe.predict_proba(X_te)[:,1] if hasattr(pipe, 'predict_proba') else None
    p, r, f, _ = precision_recall_fscore_support(y_te, y_pred, labels=[1], average='binary', zero_division=0)
    auc = roc_auc_score(y_te, y_proba) if y_proba is not None else np.nan
    print(f"== {desc} ==\nPrecision(1): {p:.4f}, Recall(1): {r:.4f}, F1(1): {f:.4f}, ROC-AUC: {auc:.4f}")
    return {'Precision (1)': p, 'Recall (1)': r, 'F1 (1)': f, 'ROC–AUC': auc}

def bar_compare(df_metrics, metrics=['Precision (1)','Recall (1)','F1 (1)']):
    ax = df_metrics[metrics].plot(kind='bar', rot=45, figsize=(10,5))
    ax.set_title('Comparison of metrics (minority class = 1)')
    plt.tight_layout()
    plt.show()


In [None]:

# Baseline (imbalanced) logistic regression
baseline_metrics = fit_eval_lr(X_train, y_train, X_test, y_test, desc='Baseline (imbalanced)')



## Part B — GMM for Synthetic Sampling

**Theoretical notes (short):**

- **SMOTE** generates synthetic minority samples by interpolating between nearest minority neighbours. It is *local* and purely geometric: it does not fit an explicit generative model to the minority distribution.

- **GMM-based sampling** fits a probabilistic model (a mixture of Gaussians) to the minority class. Once fit, sampling from the GMM draws from estimated component densities, allowing generation that respects multi-modality of the minority distribution and captures covariance structure.

**Why GMM can be better:** when the minority class has multiple sub-groups (clusters) or anisotropic covariance (elongated clusters), SMOTE's linear interpolation can create unrealistic points that cross cluster boundaries. GMM, by estimating per-component means and covariances, can generate points that better follow each sub-cluster.

Below we implement a practical pipeline:
1. Choose number of GMM components via BIC/AIC.
2. Fit GMM to minority training data.
3. Generate synthetic samples from the fitted GMM to reach the desired target.

We also implement _Clustering-Based Undersampling (CBU)_ that reduces the majority by picking representative points from majority clusters (KMeans + nearest-to-centroid selection).


In [None]:

# Separate minority and majority in the training set
X_min = X_train[y_train==1]
X_maj = X_train[y_train==0]
print('Training minority:', X_min.shape, 'Training majority:', X_maj.shape)


In [None]:

# Select k for GMM using BIC and AIC (try 1..10)
ks = list(range(1, 11))
bics = []
aics = []
for k in ks:
    g = GaussianMixture(n_components=k, covariance_type='full', random_state=RANDOM_STATE)
    g.fit(X_min)
    bics.append(g.bic(X_min))
    aics.append(g.aic(X_min))

import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
plt.plot(ks, bics, marker='o', label='BIC')
plt.plot(ks, aics, marker='o', label='AIC')
plt.xlabel('n_components (k)')
plt.xticks(ks)
plt.ylabel('Information Criterion')
plt.legend()
plt.title('GMM: BIC/AIC vs n_components (on minority training set)')
plt.show()

best_k = ks[int(np.argmin(bics))]
print('Selected k (min BIC) =', best_k)


In [None]:

# Fit GMM with selected k and sample until matching majority count (full oversampling)
gmm = GaussianMixture(n_components=best_k, covariance_type='full', random_state=RANDOM_STATE)
gmm.fit(X_min)
n_min = len(X_min)
n_maj = len(X_maj)
n_synth = n_maj - n_min
print('Minority count:', n_min, 'Majority count:', n_maj, ' -> need synth:', n_synth)

if n_synth > 0:
    X_synth, comp_idx = gmm.sample(n_synth)
else:
    X_synth = np.empty((0, X_min.shape[1]))
print('Generated synthetic minority samples shape:', X_synth.shape)


In [None]:

# CBU: reduce majority by KMeans and pick representative sample per cluster
def cbu_undersample(X_majority, target_n, random_state=RANDOM_STATE):
    """Cluster majority into target_n clusters and pick one sample closest to each centroid."""
    if target_n >= len(X_majority):
        return X_majority.copy()
    k = int(target_n)
    km = KMeans(n_clusters=k, random_state=random_state, n_init='auto')
    labels = km.fit_predict(X_majority)
    centroids = km.cluster_centers_
    selected_idx = []
    for cluster_id in range(k):
        idxs = np.where(labels==cluster_id)[0]
        pts = X_majority[idxs]
        # choose sample closest to centroid
        dists = np.linalg.norm(pts - centroids[cluster_id], axis=1)
        chosen = idxs[np.argmin(dists)]
        selected_idx.append(chosen)
    return X_majority[selected_idx]

# Example: choose maj_target as 3 * minority (as a sensible compromise)
maj_target = min(len(X_maj), max(1, 3 * len(X_min)))
print('maj_target (for CBU):', maj_target)
X_maj_down = cbu_undersample(X_maj, maj_target)
print('Majority after CBU shape:', X_maj_down.shape)


In [None]:

# Fit a fresh GMM on minority (if needed) and sample to match the reduced majority
gmm_cbu = GaussianMixture(n_components=best_k, covariance_type='full', random_state=RANDOM_STATE)
gmm_cbu.fit(X_min)
n_target = len(X_maj_down) - len(X_min)
n_target = max(0, n_target)
print('For GMM+CBU variant we need to generate:', n_target, 'samples to match reduced majority (maj_target).')

if n_target > 0:
    X_synth_cbu, _ = gmm_cbu.sample(n_target)
else:
    X_synth_cbu = np.empty((0, X_min.shape[1]))

print('Synth (GMM+CBU) shape:', X_synth_cbu.shape)


In [None]:

# Version 1: GMM full oversampling (match original majority)
X_bal_gmm_full = np.vstack([X_maj, X_min, X_synth])
y_bal_gmm_full = np.hstack([np.zeros(len(X_maj), dtype=int), np.ones(len(X_min)+len(X_synth), dtype=int)])

# Version 2: CBU (downsample majority) + GMM synthetic to match reduced maj
X_bal_gmm_cbu = np.vstack([X_maj_down, X_min, X_synth_cbu])
y_bal_gmm_cbu = np.hstack([np.zeros(len(X_maj_down), dtype=int), np.ones(len(X_min)+len(X_synth_cbu), dtype=int)])

print('GMM full balanced counts:', Counter(y_bal_gmm_full))
print('GMM+CBU balanced counts:', Counter(y_bal_gmm_cbu))


In [None]:

# Simple SMOTE-like sampler (for a cluster): linear interpolation between a sample and one of its neighbors within cluster
def smote_synthetic_for_cluster(X_cluster, n_samples, k_neighbors=5, random_state=RANDOM_STATE):
    if len(X_cluster) == 0 or n_samples <= 0:
        return np.empty((0, X_cluster.shape[1]))
    # fit neighbors on cluster
    nbrs = NearestNeighbors(n_neighbors=min(k_neighbors, max(1, len(X_cluster)-1))).fit(X_cluster)
    neigh = nbrs.kneighbors(return_distance=False)
    out = []
    rng = np.random.RandomState(random_state)
    for _ in range(n_samples):
        i = rng.randint(len(X_cluster))
        nn_list = neigh[i]
        if len(nn_list)==0:
            # duplicate plus noise
            sample = X_cluster[i] + 0.001 * rng.randn(X_cluster.shape[1])
        else:
            j = rng.choice(nn_list)
            gap = rng.rand()
            sample = X_cluster[i] + gap * (X_cluster[j] - X_cluster[i])
        out.append(sample)
    return np.vstack(out)

def cbo_oversample(X_train, y_train, k_clusters=6, target_ratio=1.0, random_state=RANDOM_STATE):
    # cluster minority into k_clusters, then generate synthetic samples (SMOTE-like) distributed proportional to cluster sizes
    X_min_loc = X_train[y_train==1]
    X_maj_loc = X_train[y_train==0]
    minority_count = len(X_min_loc)
    majority_count = len(X_maj_loc)
    desired_minority = int(target_ratio * majority_count)
    to_generate = max(0, desired_minority - minority_count)
    if to_generate == 0:
        return X_train.copy(), y_train.copy()
    km = KMeans(n_clusters=min(k_clusters, max(1, len(X_min_loc))), random_state=random_state, n_init='auto')
    labels = km.fit_predict(X_min_loc)
    # compute per-cluster generation quotas proportional to cluster sizes
    cluster_sizes = np.array([np.sum(labels==i) for i in range(km.n_clusters)])
    props = cluster_sizes / cluster_sizes.sum()
    gens = (props * to_generate).astype(int)
    # adjust remainder
    remainder = to_generate - gens.sum()
    if remainder > 0:
        gens[:remainder] += 1
    synth_parts = []
    for ci in range(km.n_clusters):
        Xc = X_min_loc[labels==ci]
        if gens[ci] > 0:
            synth_c = smote_synthetic_for_cluster(Xc, gens[ci], random_state=random_state+ci)
            synth_parts.append(synth_c)
    if len(synth_parts) > 0:
        X_synth_total = np.vstack(synth_parts)
    else:
        X_synth_total = np.empty((0, X_min_loc.shape[1]))
    # assemble new training data (majority unchanged)
    X_new = np.vstack([X_maj_loc, X_min_loc, X_synth_total])
    y_new = np.hstack([np.zeros(len(X_maj_loc), dtype=int), np.ones(len(X_min_loc)+len(X_synth_total), dtype=int)])
    return X_new, y_new

# Apply SBO/CBO using k=6 (as used in A3) and SMOTE for comparison
X_cbo, y_cbo = cbo_oversample(X_train, y_train, k_clusters=6, target_ratio=1.0, random_state=RANDOM_STATE)
print('CBO resampled counts:', Counter(y_cbo))


In [None]:

# SMOTE: oversample minority to match majority_count (classic SMOTE on entire train set)
sm = SMOTE(random_state=RANDOM_STATE)
X_sm, y_sm = sm.fit_resample(X_train, y_train)
print('SMOTE resampled counts:', Counter(y_sm))


In [None]:

# CBU-only variant: undersample majority to maj_target (no synthetic)
X_cbu_only = np.vstack([X_maj_down, X_min])
y_cbu_only = np.hstack([np.zeros(len(X_maj_down), dtype=int), np.ones(len(X_min), dtype=int)])
print('CBU-only counts:', Counter(y_cbu_only))


In [None]:

# Train and collect metrics for each method (evaluate on original imbalanced test set)
methods = {}
methods['Baseline'] = (X_train, y_train)
methods['SMOTE'] = (X_sm, y_sm)
methods['CBO'] = (X_cbo, y_cbo)
methods['CBU-only'] = (X_cbu_only, y_cbu_only)
methods['GMM-full'] = (X_bal_gmm_full, y_bal_gmm_full)
methods['GMM+CBU'] = (X_bal_gmm_cbu, y_bal_gmm_cbu)

metrics = {}
for name, (Xtr, ytr) in methods.items():
    print('\n---', name)
    metrics[name] = fit_eval_lr(Xtr, ytr, X_test, y_test, desc=name)

df_metrics = pd.DataFrame.from_dict(metrics, orient='index')
print('\nSummary table:')
display(df_metrics)
bar_compare(df_metrics)



## Part C — Conclusion and Recommendation

**Summary of findings (how to interpret results):**

- Compare Precision/Recall/F1 for the minority class across Baseline, SMOTE, CBO, CBU-only, GMM-full, and GMM+CBU.
- If GMM-based oversampling increases Recall substantially (detects more frauds) with acceptable Precision drop, it is likely beneficial.

**Recommendation template:**

- If GMM+CBU gives better recall + F1 compared to Baseline and similar or better than SMOTE/CBO, recommend using GMM+CBU for production augmentation.
- Otherwise, prefer simpler techniques (SMOTE) if they provide similar performance with less complexity.

---

**Notes & reproducibility:**
- This notebook is self-contained: it reimplements SMOTE-like cluster oversampling (CBO), classic SMOTE (via `imblearn`), CBU undersampling, and GMM sampling. It uses `RANDOM_STATE=42` for reproducibility.
- Place `creditcard.csv` in `Dataset/` or the notebook directory and run all cells sequentially to reproduce the analysis.
