__<h1 style="text-align: center;font-size: 3rem">Model Exploration</h1><p style="text-align: center;font-size: 1.3rem">(Notebook IV)</p>__

## Imports

Now entering model exploration, classification models which will classify whether a transaction is genuine or fraudulent. These models come primarily from _'Scikit-Learn'_.

In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    make_scorer,
    accuracy_score,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
)
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    GridSearchCV,
    train_test_split as tts,
    StratifiedKFold,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks
from imblearn.pipeline import Pipeline
from dotenv import load_dotenv
from os import getenv

from typing import NamedTuple
import pandas as pd
import numpy as np

## Setup

Loading the random state to be used throughout the notebook and project as a whole.

In [2]:
load_dotenv()

RANDOM_STATE = int(getenv("RANDOM_STATE", 0))
RANDOM_STATE

39105

`FeatureTarget` is a named tuple that contains the features (`X`) and the targets (`y`) that organizes the way the training and testing data the columns in itself.

In [3]:
class FeatureTarget(NamedTuple):
    X: pd.DataFrame
    y: pd.Series

Transactions are loaded from the minimally processed parquet file.

In [4]:
transactions: pd.DataFrame = pd.read_parquet(r"../data/processed/creditcard.parquet")

Separating the numeric features from the categorical features will help with appropriately transforming the data for the use of models. The features of the dataset is continuous so categorical transformations are not necessary.

In [5]:
numeric_feats = transactions.select_dtypes(
    include=["float64", "int64"]
).columns.tolist()

categorical_feats = transactions.select_dtypes(include=["object"]).columns.tolist()

The transaction are split into training and testing sets so that a model's performance can be observed on unseen data.

In [6]:
X_train, X_test, y_train, y_test = tts(
    transactions.drop(columns=["is_fraud"]),
    transactions["is_fraud"],
    test_size=0.3,
    random_state=RANDOM_STATE,
    stratify=transactions["is_fraud"],
)

Using the `FeatureTarget` named tuple, a `FeatureTarget` named 'train' will contain the features and targets associated with the training split of data with another named 'test' will contain the features and targets associated with the testing split of data.

In [7]:
train: FeatureTarget = FeatureTarget(
    X=X_train,
    y=y_train,
)

train.X.shape, train.y.shape

((199364, 30), (199364,))

In [8]:
test: FeatureTarget = FeatureTarget(
    X=X_test,
    y=y_test,
)

test.X.shape, test.y.shape

((85443, 30), (85443,))

The proportions of the train and test are checked to ensure that they maintained the proportions from the complete dataset. The proportions of both the training and testing match the complete dataset (only accurate to 3 significant figures).

In [9]:
def get_is_fraud_prop(y: pd.Series) -> pd.Series:
    return y.sum() / y.count()


print(f"Training Proportion: {get_is_fraud_prop(train.y):4%}")
print(f"Testing Proportion: {get_is_fraud_prop(test.y):4%}")
print(f"Overall Proportion: {get_is_fraud_prop(transactions['is_fraud']):4%}")

Training Proportion: 0.172549%
Testing Proportion: 0.173215%
Overall Proportion: 0.172749%


In [10]:
def pretty_print_metrics(
    y_true: np.ndarray | pd.Series,
    y_pred: np.ndarray | pd.Series,
):
    cm_df = pd.DataFrame(
        confusion_matrix(y_true, y_pred),
        index=["Actual 0", "Actual 1"],
        columns=["Predicted 0", "Predicted 1"],
    )

    print(
        "Confusion Matrix",
        "----------------",
        cm_df,
        "\n",
        "Classification Report",
        "---------------------",
        classification_report(y_true, y_pred, digits=4),
        sep="\n",
    )

In [11]:
preproc = ColumnTransformer(
    [
        (
            "numeric",
            StandardScaler(),
            numeric_feats,
        )
    ]
)

## Baseline Model

Using two baseline models, one being completely random and the other being a simple logistic regression, tuned models will be evaluated along side these two baselines to ensure it is doing better than at least complete random selection and a simple model.

### Random baseline model using a Dummy Classifier

In [None]:
stratified_base = DummyClassifier(strategy="stratified", random_state=RANDOM_STATE)
stratified_base.fit(train.X, train.y)

0,1,2
,strategy,'stratified'
,random_state,39105
,constant,


The random baseline model's performance on training data

In [None]:
y_pred = stratified_base.predict(train.X)
pretty_print_metrics(train.y, y_pred)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0       198694          326
Actual 1          344            0


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9983    0.9984    0.9983    199020
        True     0.0000    0.0000    0.0000       344

    accuracy                         0.9966    199364
   macro avg     0.4991    0.4992    0.4992    199364
weighted avg     0.9965    0.9966    0.9966    199364



The random baseline model's performance on testing data

In [None]:
y_pred = stratified_base.predict(test.X)
pretty_print_metrics(test.y, y_pred)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0        85142          153
Actual 1          148            0


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9983    0.9982    0.9982     85295
        True     0.0000    0.0000    0.0000       148

    accuracy                         0.9965     85443
   macro avg     0.4991    0.4991    0.4991     85443
weighted avg     0.9965    0.9965    0.9965     85443



In [32]:
majority_base = DummyClassifier(strategy="most_frequent", random_state=RANDOM_STATE)
majority_base.fit(train.X, train.y)

0,1,2
,strategy,'most_frequent'
,random_state,39105
,constant,


In [34]:
y_pred = majority_base.predict(train.X)
pretty_print_metrics(train.y, y_pred)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0       199020            0
Actual 1          344            0


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9983    1.0000    0.9991    199020
        True     0.0000    0.0000    0.0000       344

    accuracy                         0.9983    199364
   macro avg     0.4991    0.5000    0.4996    199364
weighted avg     0.9966    0.9983    0.9974    199364



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [35]:
y_pred = majority_base.predict(test.X)
pretty_print_metrics(test.y, y_pred)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0        85295            0
Actual 1          148            0


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9983    1.0000    0.9991     85295
        True     0.0000    0.0000    0.0000       148

    accuracy                         0.9983     85443
   macro avg     0.4991    0.5000    0.4996     85443
weighted avg     0.9965    0.9983    0.9974     85443



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### Simple baseline using a Logistic Regression

In [15]:
log_base = Pipeline(
    steps=[
        ("preprocessor", preproc),
        (
            "classifier",
            LogisticRegression(
                random_state=RANDOM_STATE,
                max_iter=1500,
                class_weight="balanced",
            ),
        ),
    ]
)

log_base.fit(train.X, train.y)

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,39105
,solver,'lbfgs'
,max_iter,1500


The simple baseline model's performance on training data

In [16]:
train_preds = log_base.predict(train.X)
pretty_print_metrics(train.y, train_preds)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0       194820         4200
Actual 1           26          318


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9999    0.9789    0.9893    199020
        True     0.0704    0.9244    0.1308       344

    accuracy                         0.9788    199364
   macro avg     0.5351    0.9517    0.5600    199364
weighted avg     0.9983    0.9788    0.9878    199364



The simple baseline model's performance on testing data

In [17]:
test_preds = log_base.predict(test.X)
pretty_print_metrics(test.y, test_preds)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0        83492         1803
Actual 1           18          130


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9998    0.9789    0.9892     85295
        True     0.0673    0.8784    0.1249       148

    accuracy                         0.9787     85443
   macro avg     0.5335    0.9286    0.5571     85443
weighted avg     0.9982    0.9787    0.9877     85443



__Notes__

The model are that it's accuracy is extremely high again, attributed to the extreme imbalance in the dataset. Accuracy in this scenario is misleading so the precision and recall are highlighted to understand how the model is handling the imbalance and how sensitive it is to it.

It does perform better than the random baseline, indicating there is a the ability to decern trends in the data. but due to the imbalance, more care will have to be given in the sampling techniques used.

## Resampling the Training Data

In [18]:
num_fraud = train.y.sum()

In [19]:
sm = SMOTE(
    sampling_strategy={1: num_fraud * 2},  # type: ignore (type checking)
    random_state=RANDOM_STATE,
    k_neighbors=3,
)
us = RandomUnderSampler(
    sampling_strategy=0.1,  # type: ignore (type checking)
    random_state=RANDOM_STATE,
)
nm = NearMiss(
    sampling_strategy=0.1,  # type: ignore (type checking)
    version=1,
)
tl = TomekLinks()

In [20]:
rfe = RFE(
    DecisionTreeClassifier(
        random_state=RANDOM_STATE, max_depth=3, class_weight="balanced"
    ),
    n_features_to_select=10,
)

In [21]:
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

In [22]:
log = LogisticRegression(
    random_state=RANDOM_STATE,
    max_iter=1500,
    fit_intercept=True,
    class_weight="balanced",
)
rfc = RandomForestClassifier(
    random_state=RANDOM_STATE,
    class_weight="balanced",
)
dtc = DecisionTreeClassifier(
    random_state=RANDOM_STATE,
    class_weight="balanced",
)
svm = SVC(
    random_state=RANDOM_STATE,
    max_iter=-1,
    class_weight="balanced",
)
knn = KNeighborsClassifier()
sgd = SGDClassifier(
    random_state=RANDOM_STATE,
    class_weight="balanced",
    max_iter=5000,
)

In [23]:
non_classifier_params = {
    "under_sampler__sampling_strategy": [0.1],
    "under_sampler__n_neighbors": [3, 5, 7, 9],
}

In [24]:
knn_params = {
    "classifier__n_neighbors": [3, 5, 7, 9],
    "classifier__weights": ["uniform", "distance"],
    "classifier__algorithm": ["auto", "ball_tree", "kd_tree"],
}

In [25]:
log_params = {
    "classifier__C": [0.1, 1, 10],
    "classifier__solver": ["lbfgs", "liblinear"],
}

In [26]:
sgd_params = {
    "classifier__loss": ["hinge", "squared_error", "modified_huber"],
    "classifier__alpha": [0.001, 0.01, 0.1],
    "classifier__penalty": ["l2", "l1", "elasticnet"],
    "classifier__learning_rate": ["optimal", "adaptive", "constant"],
}

In [27]:
pipeline = Pipeline(
    steps=[
        ("preprocessor", preproc),
        ("over_sampler", sm),
        ("under_sampler", nm),
        ("feature_selector", rfe),
        ("classifier", knn),
    ],
)

In [None]:
grid = GridSearchCV(
    estimator=pipeline,
    param_grid=knn_params | non_classifier_params,
    cv=cv,
    scoring=make_scorer(accuracy_score),
    n_jobs=6,
)

In [29]:
grid.fit(
    train.X,
    train.y,
)

0,1,2
,estimator,Pipeline(step...lassifier())])
,param_grid,"{'classifier__algorithm': ['auto', 'ball_tree', ...], 'classifier__n_neighbors': [3, 5, ...], 'classifier__weights': ['uniform', 'distance'], 'under_sampler__n_neighbors': [3, 5, ...], ...}"
,scoring,make_scorer(f...hod='predict')
,n_jobs,6
,refit,True
,cv,StratifiedKFo... shuffle=True)
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,sampling_strategy,{1: np.int64(688)}
,random_state,39105
,k_neighbors,3

0,1,2
,sampling_strategy,0.1
,version,1.0
,n_neighbors,3.0
,n_neighbors_ver3,3.0
,n_jobs,

0,1,2
,estimator,DecisionTreeC...m_state=39105)
,n_features_to_select,10
,step,1
,verbose,0
,importance_getter,'auto'

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,3
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,39105
,max_leaf_nodes,
,min_impurity_decrease,0.0

0,1,2
,n_neighbors,9
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [30]:
preds = grid.predict(train.X)
pretty_print_metrics(train.y, preds)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0       191987         7033
Actual 1           41          303


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9998    0.9647    0.9819    199020
        True     0.0413    0.8808    0.0789       344

    accuracy                         0.9645    199364
   macro avg     0.5205    0.9227    0.5304    199364
weighted avg     0.9981    0.9645    0.9804    199364



In [31]:
preds = grid.predict(test.X)
pretty_print_metrics(test.y, preds)

Confusion Matrix
----------------
          Predicted 0  Predicted 1
Actual 0        82196         3099
Actual 1           20          128


Classification Report
---------------------
              precision    recall  f1-score   support

       False     0.9998    0.9637    0.9814     85295
        True     0.0397    0.8649    0.0759       148

    accuracy                         0.9635     85443
   macro avg     0.5197    0.9143    0.5286     85443
weighted avg     0.9981    0.9635    0.9798     85443

