# Introduction

This notebook is based on this [notebook](https://gist.github.com/jedrz/ab2b00dacec8c049a7a54a9ebc004867) that is a part of the ["ML models inference in fraud detection"](https://nussknacker.io/blog/ml-models-inference-in-fraud-detection/) blogpost and uses the [Kaggle dataset](https://www.kaggle.com/datasets/neharoychoudhury/credit-card-fraud-data) containing credit card fraud data. The exploratory data analysis (EDA) and feature engineering for this dataset is omitted here - refer to the mentioned blogpost for details regarding those topics.

The goal of this notebook is to present a short step by step guide from training a simple ML model to deploying it to the **Databricks Managed MLFlow**.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import pandas as pd
import sklearn.model_selection
import sklearn.tree
import sklearn.metrics
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.compose
import mlflow
import mlflow.models
import mlflow.types.schema
import imblearn.over_sampling

random_state = 42

# Dataset preparation

In this section, we cleanup and prepare the [Kaggle dataset](https://www.kaggle.com/datasets/neharoychoudhury/credit-card-fraud-data) for model training. As already mentioned, we do not go into details regarding this process and we mostly reuse the code from this [notebook](https://gist.github.com/jedrz/ab2b00dacec8c049a7a54a9ebc004867).

In [2]:
data = pd.read_csv("fraud_data.csv")
data = data.drop('trans_num', axis='columns', errors='ignore')
data = data[(data['is_fraud'] == '0') | (data['is_fraud'] == '1')]
data = data.map(lambda x: x.strip('"') if isinstance(x, str) else x)
data['trans_date_trans_time'] = data['trans_date_trans_time'].apply(lambda x: pd.to_datetime(x, dayfirst=True))
data['dob'] = data['dob'].apply(lambda x: pd.to_datetime(x, dayfirst=True))
data = data.astype({
    'merchant': 'category',
    'category': 'category',
    'city': 'category',
    'state': 'category',
    'job': 'category',
    'is_fraud': 'int',
})
data = data.astype({'is_fraud': 'boolean',})

Next, we split the prepared dataset into training and testing datasets:

In [3]:
train_data, test_data = sklearn.model_selection.train_test_split(data, random_state=random_state)
train_data_input = train_data.drop('is_fraud', axis='columns')
test_data_input = test_data.drop('is_fraud', axis='columns')
train_data_output = train_data['is_fraud']
test_data_output = test_data['is_fraud']

Note that the training dataset is heavily unbalanced towards negative classes:

In [4]:
train_data_output.value_counts()

is_fraud
False    9456
True     1377
Name: count, dtype: Int64

Therefore to improve the training process, we will over sample the training dataset: 

In [5]:
over_sampler = imblearn.over_sampling.RandomOverSampler(random_state=random_state)
train_data_input, train_data_output = over_sampler.fit_resample(train_data_input, train_data_output)
train_data_output.value_counts()

is_fraud
False    9456
True     9456
Name: count, dtype: Int64

The last thing to notice is that the testing dataset is unbalanced as well. This, however is not a problem - we just need to keep this in mind during model evaluation process by using metrics functions adjusted to unbalanced datasets (eg: `balanced_accuracy_score`):

In [6]:
test_data_output.value_counts()

is_fraud
False    3144
True      467
Name: count, dtype: Int64

# Model training and deployment to the **Databricks Managed MLFlow**

With the training and testing datasets prepared, we begin training the ML model. Since the model is only for demonstration purposes, we pick a simple tree classifier selected by performing 5-fold cross-validation on a set of hyperparameters. The selected best model is evaluated on the testing dataset and then the achieved performance metrics, the model's hyperparameters and the model itself are all logged into the **Databricks Managed MLFlow**.

In [7]:
model_name = "credit-card-fraud-classifier"

model = sklearn.pipeline.make_pipeline(
    sklearn.compose.make_column_transformer(
        (sklearn.preprocessing.OneHotEncoder(sparse_output=True, handle_unknown="ignore"), ['merchant', 'category', 'city', 'state', 'job']),
    ),
    sklearn.tree.DecisionTreeClassifier(random_state=42)
)

parameter_grid = {
    'decisiontreeclassifier__criterion': ['gini', 'entropy'],
    'decisiontreeclassifier__min_impurity_decrease': [0.0, 0.05, 0.3],
    'decisiontreeclassifier__max_depth': [None, 3, 5, 10],
    'decisiontreeclassifier__min_samples_split': [2, 5, 10, 20],
    'decisiontreeclassifier__min_samples_leaf': [1, 2, 5, 10],
}

mlflow.set_registry_uri("databricks")
with mlflow.start_run():
    grid_search = sklearn.model_selection.GridSearchCV(
        model, 
        parameter_grid,
        scoring='accuracy',
        cv=5,
        n_jobs=-1,
        verbose=1
    )

    # select the best model from a given set of hyperparameters
    grid_search.fit(train_data_input, train_data_output)

    best_params = grid_search.best_params_
    print("Best model parameters:", best_params)
    print("Best training accuracy:", grid_search.best_score_)

    best_model = grid_search.best_estimator_
    test_data_predicted = best_model.predict(test_data_input)

    # Note: we use balanced_accurace_score and average='weighted' to deal with imbalanced testing dataset.
    accuracy = sklearn.metrics.balanced_accuracy_score(test_data_output, test_data_predicted)
    precision = sklearn.metrics.precision_score(test_data_output, test_data_predicted, average='weighted')
    recall = sklearn.metrics.recall_score(test_data_output, test_data_predicted, average='weighted')

    print(f"For parameters: {best_params}, accuracy: {accuracy}, precision: {precision} and recall: {recall} was achieved")
    
    # Log all the classifier hyperparameters.
    mlflow.log_param("criterion", best_params['decisiontreeclassifier__criterion'])
    mlflow.log_param("min_impurity_decrease", best_params['decisiontreeclassifier__min_impurity_decrease'])
    mlflow.log_param("max_depth", best_params['decisiontreeclassifier__max_depth'])
    mlflow.log_param("min_samples_split", best_params['decisiontreeclassifier__min_samples_split'])
    mlflow.log_param("min_samples_leaf", best_params['decisiontreeclassifier__min_samples_leaf'])
    
    # Log metrics of the trained classifer.
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    
    # Log the trained classifier itself.
    model_signature = mlflow.models.infer_signature(model_input=train_data_input.iloc[:1], model_output=test_data_predicted[:1])
    model_signature.outputs = mlflow.types.schema.Schema([mlflow.types.schema.ColSpec("double")])
    mlflow.sklearn.log_model(best_model, artifact_path=model_name, signature=model_signature, registered_model_name=model_name)

Fitting 5 folds for each of 384 candidates, totalling 1920 fits
Best model parameters: {'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': None, 'decisiontreeclassifier__min_impurity_decrease': 0.0, 'decisiontreeclassifier__min_samples_leaf': 2, 'decisiontreeclassifier__min_samples_split': 2}
Best training accuracy: 0.9548432947474529
For parameters: {'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': None, 'decisiontreeclassifier__min_impurity_decrease': 0.0, 'decisiontreeclassifier__min_samples_leaf': 2, 'decisiontreeclassifier__min_samples_split': 2}, accuracy: 0.6140478311565893, precision: 0.8369260917731867 and recall: 0.8565494322902243 was achieved
Successfully registered model 'credit-card-fraud-classifier'.
2025/03/20 13:47:52 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: credit-card-fraud-classifier, version 1
Created version '1' of mod