# Oddball Assessment

In [None]:
from __future__ import annotations

from collections.abc import Iterator
import json
import itertools
from pprint import pprint
from typing import TYPE_CHECKING

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans, HDBSCAN    # pyright: ignore [reportAttributeAccessIssue]  HDBSCAN not recognized.
from sklearn.compose import ColumnTransformer
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.metrics import calinski_harabasz_score, silhouette_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline

from umap import UMAP

if TYPE_CHECKING:
    from sklearn.base import BaseEstimator

In [None]:
data_path = 'archive/CC GENERAL.csv'
# see https://www.kaggle.com/datasets/arjunbhasin2013/ccdata?resource=download for dataset description.
df = pd.read_csv(data_path)

## Data Exploration and Discussion

The data is very clean. Missing values are only in CREDIT_LIMIT and MINIMUM_PAYMENT, 1 and ~300 (of 8950) respectively. 
These are easy to explain: a no-limit card and card-holders with no payment history to date.
There are some huge outliers.
The the variables are not normally distributed; they tend to center to the left and right in histograms. 
A few have a bathtub shape.

CUST_ID : Identification of Credit Card holder (Categorical)

BALANCE : Balance amount left in their account to make purchases

BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

PURCHASES : Amount of purchases made from account

ONEOFF_PURCHASES : Maximum purchase amount done in one-go

INSTALLMENTS_PURCHASES : Amount of purchase done in installment

CASH_ADVANCE : Cash in advance given by the user

PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)

PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)

CASHADVANCEFREQUENCY : How frequently the cash in advance being paid

CASHADVANCETRX : Number of Transactions made with "Cash in Advanced"

PURCHASES_TRX : Numbe of purchase transactions made

CREDIT_LIMIT : Limit of Credit Card for user

PAYMENTS : Amount of Payment done by user

MINIMUM_PAYMENTS : Minimum amount of payments made by user

PRCFULLPAYMENT : Percent of full payment paid by user

TENURE : Tenure of credit card service for user


In [None]:
print('Missing Value Counts')
pprint({key: value for key, value in df.isna().sum().items() if value > 0})
print("Duplicate Record Count")
print(len(df) - len(df.drop_duplicates()))

In [None]:
df.head()

In [None]:
df.describe()

## Evaluation Process

Per my preference, I define evaluation before modeling. This is because:
- It's a bad idea to create a model you don't know how to evaluate. It can result in deciding a model has virtues you can find rather than virtues you expect.
- It's a good idea to have conversation about how a model will be assess before you can talk about the strengths and weakness of a model. (Teams take responsibility for what counts as good, expressing requirements, etc.)
- Is intelletually honest about the quality of evaluation you can produce. 
(Clustering, for example, does not have an obvious right answer, but depends on application.)
- It encourages writing flexible evalution functions, which makes model development and iteration faster in the long run. (Promotes code re-use.)


In [None]:
def simple_logs(new_data: dict | list | str, path: str):
    """Append new data to an existing json, or create one if it does not exist."""
    try:
        with open(path, 'r') as f:
            data = json.loads(f.read())
    except FileNotFoundError:
        data = []
    data.append(new_data)
    with open(path, 'w') as f:
        f.write(json.dumps(data, indent=4))

def evaluate_unsupervised_clustering(model: BaseEstimator, data: np.ndarray) -> dict:
    """
    Evaluate an unsupervised clustering model using common clustering metrics.

    Parameters
    ----------
    model : BaseEstimator
        A fitted or unfitted scikit-learn compatible clustering model (e.g., KMeans, HDBSCAN).
        The model must implement a `.fit_predict` method.

    data : np.ndarray
        The input data array of shape (n_samples, n_features) to cluster.

    Returns
    -------
    scores : dict
        A dictionary containing the following metrics:
        - 'n_labels': int
            The number of non-noise clusters found (ignores label -1).
        - 'noise_percent': float
            Proportion of data points labeled as noise (-1).
        - 'silhouette_score': float
            Silhouette score of the clustering (0 if fewer than 2 clusters).
        - 'calinski_harabasz_score': float
            Calinski-Harabasz index of the clustering (0 if fewer than 2 clusters).

    Notes
    -----
    If the model finds fewer than 2 clusters (excluding noise), the silhouette and
    Calinski-Harabasz scores will be set to 0, as these metrics are undefined in that case.
    """
    scores = {}
    clustered_data = model.fit_predict(data)
    labels = np.unique(clustered_data)
    scores['n_labels'] = labels.size - (-1 in labels)
    scores['noise_percent'] = np.sum((clustered_data == -1)) / clustered_data.size
    if scores['n_labels'] > 1:
        scores['silhouette_score'] = silhouette_score(data, clustered_data, metric='euclidean')
        scores['calinski_harabasz_score'] = calinski_harabasz_score(data, clustered_data)
    else:
        scores['silhouette_score'] = 0
        scores['calinski_harabasz_score'] = 0
    return scores


def test_evaluate_unsupervised_clustering():
    from sklearn.cluster import KMeans

    model = KMeans(n_clusters=3, random_state=123)
    data, *_ = make_blobs(random_state=567)

    result = evaluate_unsupervised_clustering(model, data)
    assert result['silhouette_score'] > 0
    assert result['calinski_harabasz_score'] > 0
    assert result['noise_percent'] >= 0 and result['noise_percent'] <= 1.
    assert result['n_labels'] == 3


test_evaluate_unsupervised_clustering()

## Clustering Model

In [None]:
def clean_data(df: pd.DataFrame, selected_columns: list[str] = []) -> np.ndarray:
    if selected_columns:
        df = df[selected_columns]
    else:
        df = df.drop(columns=['CUST_ID'])
    data = ColumnTransformer(
        [step for step in [
            ('min_payment', SimpleImputer(strategy='constant', fill_value=-1.), ['MINIMUM_PAYMENTS']),
            ('credit_limit', SimpleImputer(strategy='constant', fill_value=200_000.), ['CREDIT_LIMIT']),
        ]
        if step[2][0] in selected_columns or not selected_columns],
        remainder='passthrough'
    ).fit_transform(df)
    return data

def fit_and_evaluate_model(model: BaseEstimator, data_transform: Pipeline, df: pd.DataFrame = df):
    data = clean_data(df)
    transformed_data = data_transform.fit_transform(data)
    return evaluate_unsupervised_clustering(model, transformed_data)

def param_grid_to_parameters(param_grid: dict[str, list]) -> Iterator:
    parameters = list(param_grid)
    for param_options in itertools.product(*param_grid.values()):
        param_spec = {parameters[i]: param_value for i, param_value in enumerate(param_options)}
        yield param_spec

def tune(model_name: str, pipeline: Pipeline, df: pd.DataFrame, param_grid: dict):
    """Extract parameters from a parameters grid and trained the re-parameterized model on the data in df."""
    parameter_grid = param_grid[model_name]
    for parameters in param_grid_to_parameters(parameter_grid):
        pipeline.set_params(**parameters)
        data_transform = pipeline.named_steps['data_transform']
        cluster_model = pipeline.named_steps['clustering']
        scores = fit_and_evaluate_model(cluster_model, data_transform, df)
        scores['model_name'] = model_name
        results = {'parameters': parameters, 'scores': scores}
        simple_logs(results, 'model_scores.json')


In [None]:
# Define Data Preprocessing and Model Architecture
# This will be passed through a tuning process, so we can mostly use default values now.

data_transform_standard_pca = Pipeline([
    ('scale_data', StandardScaler()),
    ('PCA', PCA(n_components=4)),
])
data_transform_umap = Pipeline([
    ('scale_data', StandardScaler()),
    ('UMAP', UMAP(n_components=30)),
])
data_transform_robust_pca = Pipeline([
    ('scale_data', RobustScaler()),
    ('PCA', PCA(n_components=3)),
])

clustering_models: dict[str, Pipeline] = {
    'kmeans_model': Pipeline([
        ('data_transform', data_transform_standard_pca),
        ('clustering', KMeans(n_clusters=4))
    ]),
    'normalize_kmeans': Pipeline([
        ('data_transform', data_transform_standard_pca),
        ('clustering', KMeans(n_clusters=4))
    ]),
    'hdb_model': Pipeline([
        ('data_transform', data_transform_standard_pca),
        ('clustering', HDBSCAN())
    ]),
    'robust_hdb_model': Pipeline([
        ('data_transform', data_transform_robust_pca),
        ('clustering', HDBSCAN(min_cluster_size=50))
    ]),
    'umap_hdb_model': Pipeline([
        ('data_transform', data_transform_umap),
        ('clustering', HDBSCAN(min_cluster_size=50))
    ]),
}

In [None]:
# Define Hyperparameters

hdb_param_grid = {
    'data_transform__PCA__n_components': [2, 3, 4, 5, .95],
    'clustering__min_cluster_size': [5, 25, 40, 50, 60],
    'clustering__min_samples': [None, 3, 7, 20],
}
kmeans_param_grid = {
    'data_transform__PCA__n_components': [2, 3, 4, 5, .95],
    'clustering__n_clusters': [2, 3, 4, 5]
}
umap_param_grid = {
    'data_transform__UMAP__n_neighbors': [30],
    'data_transform__UMAP__n_components': [2, 3],
    'clustering__min_cluster_size': [5, 25, 40],
    'clustering__min_samples': [None, 3, 7, 20],
}

clustering_models_params = {
    'kmeans_model': kmeans_param_grid,
    'normalize_kmeans': kmeans_param_grid,
    'hdb_model': hdb_param_grid,
    'robust_hdb_model': hdb_param_grid,
    'umap_hdb_model': umap_param_grid
}


In [None]:
for model_name, pipeline in clustering_models.items():
    ...
    # UNCOMMENT TO RUN. Takes a few minutes.
    # tune(model_name, pipeline, df, clustering_models_params)


In [None]:
def quality_models() -> list[dict]:
    """
    Load and filter clustering model results based on quality criteria.

    This function reads model evaluation results from a JSON file (`model_scores.json`)
    and returns a list of models that meet specific quality thresholds. These thresholds
    are used to filter out low-quality or poorly performing clustering models.

    The filtering criteria are:
    - The number of clusters (`n_labels`) is between 2 and 7 (inclusive).
    - The proportion of noise points (`noise_percent`) is less than or equal to 0.25.
    - The silhouette score is at least 0.3.

    Returns
    -------
    list of dict
        A list of model evaluation dictionaries that pass the quality filter.
        Each dictionary contains a 'scores' field and other metadata.

    Notes
    -----
    The input file `model_scores.json` must exist in the current working directory
    and must contain a JSON array of model evaluation results, each with a 'scores'
    dictionary containing the keys:
    - 'n_labels'
    - 'noise_percent'
    - 'silhouette_score'
    """
    with open("model_scores.json", 'r') as f:
        tuning_data = json.loads(f.read())

    def is_acceptable(score_data: dict) -> bool:
        score = score_data['scores']
        return (
            score['n_labels'] >= 2
            and score['n_labels'] <= 7
            and score['noise_percent'] <= .25
            and score['silhouette_score'] >= .3
        )
    return [score_data for score_data in tuning_data if is_acceptable(score_data)]


In [23]:
# Extract results to a dataframe. List best models with more than 2 labels.
acceptable_models = quality_models()
scores = pd.DataFrame.from_records([a['scores'] | a['parameters'] for a in acceptable_models])
scores.sort_values('silhouette_score', ascending=False).query("n_labels > 2").head()

Unnamed: 0,n_labels,noise_percent,silhouette_score,calinski_harabasz_score,model_name,data_transform__PCA__n_components,clustering__n_clusters,clustering__min_cluster_size,clustering__min_samples
29,3,0.039777,0.659379,1375.822506,robust_hdb_model,2.0,,25.0,3.0
24,4,0.016089,0.624716,439.578199,hdb_model,3.0,,5.0,
32,3,0.075866,0.569218,1158.880716,robust_hdb_model,2.0,,40.0,7.0
12,3,0.0,0.448912,5314.578994,normalize_kmeans,2.0,3.0,,
1,3,0.0,0.443876,5315.265851,kmeans_model,2.0,3.0,,


In [24]:
scores.sort_values('silhouette_score', ascending=False).query("n_labels == 2").head()

Unnamed: 0,n_labels,noise_percent,silhouette_score,calinski_harabasz_score,model_name,data_transform__PCA__n_components,clustering__n_clusters,clustering__min_cluster_size,clustering__min_samples
35,2,0.008827,0.822311,972.796604,robust_hdb_model,5.0,,5.0,7.0
25,2,0.003575,0.799203,591.894524,hdb_model,3.0,,5.0,7.0
27,2,0.004134,0.785955,515.49906,hdb_model,4.0,,5.0,7.0
28,2,0.004804,0.75528,369.457868,hdb_model,0.95,,5.0,7.0
33,2,0.027598,0.70318,1009.086536,robust_hdb_model,4.0,,5.0,20.0


## Assessment

### Four Good Models
Depending on needs there are four cluster analysis models which I recommend pursuing. 
We can pursue them as a basis for further modeling or use them for our business needs.
- Model 29, a HDBSCAN model using a Robust Scaler produced 3 labels and a score of .659.
- Model 24, a HDBSCAN model produced 4 labels and a score of .624
- Model 35, a robust HDBSCAN model, produced 2 labels and a score .822
- Model 25, a HDBSCAN model, produced 2 labels and a score .799

All of these models are low-noise (<5%).

All other models producing 2 labels were either similar to these in model character or significantly lower in score.
There were model with greater than 2 labels which scored close to those models named above.

#### Which of these 4 models to use?

Either of the two 2 labels models is likely to be similar in application.
Between these two, the higher scoring model 35 should be used.
Both merit further exploration as a basis for improvement.
However, 2 is a small number of models and the score likely benefits from simplicity.
Having more labels, as in models 29 and 24, may provide more useful information even if it is less certain.
Model 24, with 4 labels and a score nearly as high as model 29 (3 labels) is a standout.
_I would recommend model 24 as the first candidate for use and further exploration_.

Finally, we should explore ensemble usage. Especially, since the silhouette scores are high,
combining a two label model and another model--creating 6 and 8 clusters, respectively--may
provide further insight.

### Process Assessment

We tuned a large range of hyperparameters to find the above models.
It is noteworthy that seemingly adjacent models in the parameter space sometimes diverged in scores.
I have encountered many cases with structured data models where this hyperparameter tuning produced no benefit beyond chance differences.
That was not the result here.
Models 24 and 29 are standouts; there was no photo finish for these models.
Further investigations with a more granular hyperparameter space are strongly recommended.
Additionally, more careful consideration of hyperparameter effects is recommended.
These models fit to this data quickly and there is little concern that more exploration would be too computationally expensive.
Along these lines, investigation of DBSCAN and other clustering models is also recommended. 
