<a href="https://colab.research.google.com/github/project4sharing/pycaret_exp/blob/main/direct_pycaret_application_to_creditcard_fraud_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

This example demonstrates the direct application of PyCaret framework to the Kaggle CreditCard Fraud Dataset


## Data - Data Sources
Credit Card Fraud data source hosted in Kaggle contributed by Dhanush NaraYanan R.

https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud/data

This data source is licensed under "CC0: Publich Domain".

Data source is a single ~76Mb data file and containes 1MM samples of 8 features:
1. distance_from_home - numberic - Distance from credit card registration address
2. distance_from_last_transaction - numeric - Distance of the current transation from the previous transaction on the same credit card
3. ratio_to_median_purchase_price - numeric - Ratio of current charge to median purchase price
4. repeat_retailer - categorical - current charge made to frequent store / retailer
5. used_chip - categorical - IC chip used to authorize charge
6. used_pin_number - categorical - PIN used to authorize charge
7. online_order - categorical - current charge made for online purchase
8. fraud - categorical - prediction of whether charge may be fraudulent - this is the target variable that we would like to predict

At a first glance, the initial intuition is feature #1, #2, #5, #6 are "card present" type of purchases, ie, card holder makes purchase in person.

In [4]:
# Acquire prerequisite packages
!pip install gdown
!pip install --upgrade seaborn
!pip install pycaret[full]



In [5]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

from imblearn.over_sampling import SMOTENC

from pycaret.classification import *

from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings

import gdown

warnings.filterwarnings('ignore')

In [6]:
# In order to make this notebook self-sufficient, data has already been persisted in my google drive
# This is to prevent saving my Kaggle key to download the dataset on the fly

gdown.download('https://drive.google.com/uc?id=1cq3EBN238kBUW4R0u4rDYlyv9HqvqsDo', './card_transdata.csv')

Downloading...
From: https://drive.google.com/uc?id=1cq3EBN238kBUW4R0u4rDYlyv9HqvqsDo
To: /home/jovyan/workspace/pycaret_exp/card_transdata.csv
100%|██████████| 76.3M/76.3M [00:06<00:00, 11.4MB/s]


'./card_transdata.csv'

In [7]:
df_credit_card_fraud_org = pd.read_csv('./card_transdata.csv', sep=',', header=0, index_col=False, engine='python')

In [8]:
df_credit_card_fraud_org.info()

numerical_features = [
    'distance_from_home',
    'distance_from_last_transaction',
    'ratio_to_median_purchase_price'
    ]
categorical_features = [
    'repeat_retailer',
    'used_chip',
    'used_pin_number',
    'online_order'
    ]
target_feature = 'fraud'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


In [9]:
# constants
random_seed = 12345

In [10]:
from pycaret.classification import ClassificationExperiment

s = ClassificationExperiment()
s.setup(
    data = df_credit_card_fraud_org,
    target = target_feature,
    numeric_features = numerical_features,
    categorical_features = categorical_features,
    fix_imbalance = True,
    pca = False,
    feature_selection = True,
    data_split_shuffle = True,
    data_split_stratify = True,
    n_jobs = -1,
    # log_experiment = True,
    experiment_name = "202407191340",
    session_id = random_seed)

[LightGBM] [Info] Number of positive: 638818, number of negative: 638818
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.039020 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1785
[LightGBM] [Info] Number of data points in the train set: 1277636, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


Unnamed: 0,Description,Value
0,Session id,12345
1,Target,fraud
2,Target type,Binary
3,Original data shape,"(1000000, 8)"
4,Transformed data shape,"(1577636, 2)"
5,Transformed train set shape,"(1277636, 2)"
6,Transformed test set shape,"(300000, 2)"
7,Numeric features,3
8,Categorical features,4
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x7fe330655c90>

In [11]:
# Compare various models native in PyCaret
best = s.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.9046,0.5354,0.0902,0.3326,0.1418,0.1086,0.1352,10.332
qda,Quadratic Discriminant Analysis,0.9046,0.5354,0.0902,0.3326,0.1418,0.1086,0.1352,10.647
svm,SVM - Linear Kernel,0.8908,0.5287,0.0972,0.2418,0.1294,0.0855,0.1007,14.95
ridge,Ridge Classifier,0.8276,0.5363,0.1758,0.1327,0.1512,0.0574,0.0581,11.726
lr,Logistic Regression,0.8033,0.5363,0.2028,0.1225,0.1527,0.0491,0.0511,13.071
ada,Ada Boost Classifier,0.7078,0.5385,0.3089,0.1064,0.1558,0.0309,0.0368,18.288
knn,K Neighbors Classifier,0.5318,0.5322,0.5043,0.094,0.1584,0.013,0.0219,14.116
dt,Decision Tree Classifier,0.5271,0.5173,0.5052,0.0932,0.1574,0.0115,0.0195,19.491


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

KeyboardInterrupt: 