[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/fraud-classifier-sklearn.ipynb)


# Fraud classification using sklearn

This notebook illustrates how sklearn models can be upladed to the Unbox platform.

## Importing the modules and loading the dataset

In [1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

We have stored a sample of the original dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. Alternatively, you can also find the full dataset on [this Kaggle competition](https://www.kaggle.com/datasets/kartik2112/fraud-detection?select=fraudTrain.csv). The dataset in our example corresponds to the first 10,000 lines of the original Kaggle competition dataset.

In [4]:
DATASET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/fraudTrainSample.csv"

In [5]:
data = pd.read_csv(DATASET_URL)

In [6]:
# Relevant columns
feature_names = ['amt', 'cc_num', 'merchant', 'category','state','job']
label = ['is_fraud']

# Outputs
class_names = ["normal", "fraudulent"]

clean_raw_data = data[feature_names + label]

In [7]:
X = clean_raw_data.drop('is_fraud', 1)
y = clean_raw_data['is_fraud']

  X = clean_raw_data.drop('is_fraud', 1)


In [8]:
X.head()

Unnamed: 0,amt,cc_num,merchant,category,state,job
0,4.97,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,NC,"Psychologist, counselling"
1,107.23,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,WA,Special educational needs teacher
2,220.11,38859492057661,fraud_Lind-Buckridge,entertainment,ID,Nature conservation officer
3,45.0,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,MT,Patent attorney
4,41.96,375534208663984,fraud_Keeling-Crist,misc_pos,VA,Dance movement psychotherapist


## Pre-processing the data and splitting it into training and validation sets

In [16]:
def data_encode_one_hot(df, encoders):
    """ Encodes categorical features using one-hot encoding. """
    df = df.copy(True)
    df.reset_index(drop=True, inplace=True) # Causes NaNs otherwise
    enc_dfs = []
    for feature, enc in encoders.items():
        enc_df = pd.DataFrame(enc.transform(df[[feature]]).toarray(), columns=enc.get_feature_names_out([feature]))
        enc_dfs.append(enc_df)
    df = pd.concat([df] + enc_dfs, axis=1)
    df.drop(list(encoders.keys()), axis=1, inplace=True)
    return df

In [17]:
def create_encoder_dict(df, categorical_feature_names):
    """ Creates encoders for each of the categorical features. 
        The predict function will need these encoders. 
    """
    from sklearn.preprocessing import OneHotEncoder
    encoders = {}
    for feature in categorical_feature_names:
        enc = OneHotEncoder(handle_unknown='error')
        enc.fit(df[[feature]])
        encoders[feature] = enc
    return encoders

In [27]:
categorical_feature_names = ['cc_num', 'merchant', 'category', 'state', 'job']

In [28]:
encoders = create_encoder_dict(X, categorical_feature_names)

In [29]:
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_train_one_hot = data_encode_one_hot(x_train, encoders)
x_val_one_hot = data_encode_one_hot(x_val, encoders)

x_val_one_hot

Unnamed: 0,amt,cc_num_60416207185,cc_num_60422928733,cc_num_60423098130,cc_num_60427851591,cc_num_60487002085,cc_num_60490596305,cc_num_60495593109,cc_num_501802953619,cc_num_501828204849,...,job_Video editor,job_Visual merchandiser,job_Volunteer coordinator,job_Warden/ranger,job_Waste management officer,job_Water engineer,job_Water quality scientist,job_Web designer,job_Wellsite geologist,job_Writer
0,1.86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,59.99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,141.97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,74.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,6.68,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,12.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,2.84,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Training and evaluating the model's performance

In [22]:
sklearn_model = GradientBoostingClassifier(random_state=1300)
sklearn_model.fit(x_train_one_hot, y_train)

In [23]:
print(classification_report(y_val, sklearn_model.predict(x_val_one_hot)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      9905
           1       0.75      0.60      0.67        95

    accuracy                           0.99     10000
   macro avg       0.87      0.80      0.83     10000
weighted avg       0.99      0.99      0.99     10000



## Unbox part!

### Instantiating the client

In [24]:
import unboxapi

client = unboxapi.UnboxClient("YOUR_API_KEY_HERE")

### Creating a project on the platform

In [25]:
project = client.create_project(name="Fraud classification", 
                                description="Evaluation of ML approaches to detect frauds")

Creating project on Unbox! Check out https://unbox.ai/projects to have a look!


### Uploading the validation set

In [31]:
# Add the ground truths to the ordinal dataset for Unbox
x_val['is_fraud'] = y_val.values
x_train['is_fraud'] = y_train.values

In [32]:
from unboxapi.tasks import TaskType

dataset = project.add_dataframe(
    df=x_val.sample(1000),
    class_names=class_names,
    label_column_name='is_fraud',
    name="Fraud detection",
    description='this is my fraud dataset',
    task_type=TaskType.TabularClassification,
    feature_names=feature_names,
    categorical_feature_names=categorical_feature_names,
)

Uploading dataset to Unbox! Check out https://unbox.ai/datasets to have a look!


### Uploading the model

First, it is important to create a `predict_proba` function, which is how Unbox interacts with your model

In [34]:
def predict_proba(model, input_features: np.ndarray, col_names, one_hot_encoder, encoders):
    """Convert the raw input_features into one-hot encoded features
    using our one hot encoder and each feature's encoder. """
    df = pd.DataFrame(input_features, columns=col_names)
    encoded_df = one_hot_encoder(df, encoders)
    return model.predict_proba(encoded_df.to_numpy())

Let's test the `predict_proba` function to make sure the input-output format is consistent with what Unbox expects:

In [35]:
# Test the predict function
predict_proba(sklearn_model, x_val[:3][feature_names].to_numpy(), feature_names, data_encode_one_hot, encoders)



array([[9.99496347e-01, 5.03653403e-04],
       [9.99496347e-01, 5.03653403e-04],
       [9.99496347e-01, 5.03653403e-04]])

Now, we can upload the model:

In [37]:
from unboxapi.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    task_type=TaskType.TabularClassification,
    class_names=class_names,
    name='Fraud detection',
    description='this is my fraud classification model',
    feature_names=feature_names,
    train_sample_df=x_train,
    train_sample_label_column_name='is_fraud',
    categorical_feature_names=categorical_feature_names,
    col_names=feature_names,
    one_hot_encoder=data_encode_one_hot,
    encoders=encoders,
)

Bundling model and artifacts...
Uploading model to Unbox! Check out https://unbox.ai/models to have a look!
