# ML for detecting credit card transaction fraud

In this notebook I will try to optimize some ML algorithm for identifying credit card transaction fraud. This problem was proposed by Profesor Leandro Maciel from FEA-USP (Faculdade de Economia e Adminsitração da Universidade de São Paulo). He provided a dataset with credit card transactions and information about whether they are fraudulent or not. 

The main problem to overcome is related to unbalanced dataset. Majority of records (99.78%) aren't fraudulent, i.e. Class = 0, and using data as it is can lead to a biased algorithm towards Class = 1. We tried two different techniques for dealing with this, i.e. SMOTE and Undersampling, as so we computed different types of algorithms, such as Logistic Regression (LogReg) and Random Forest (RF). As recommended for unbalanced datasets, we used AUC and AUPRC metrics to evaluate model performance. 

Undersampling seemed to be the best aproach for dealing with this unbalanced dataset and since LogReg and RF performance in terms of AUC and AUPRC was simillar, we decided to use LogReg as our final model mainly because of it's simplicity. 

Profesor Leandro provided final test data but kept in secret the information about it's Class. We then run the selected model for test dataset and this classification was the final product of the project. Professor Leandro recieved the classification of all students' algorithms and compared each other in terms of AUC statistic. Our model had the third best AUC value for test dataset.

In [1]:
import pandas as pd
import numpy as np
# import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import time

In [2]:
df = pd.read_excel('treino.xlsx')

In [3]:
from imblearn.over_sampling import SMOTE

In [4]:
df.columns

Index(['id', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'Class'],
      dtype='object')

Since columns are not specified, we won't consider it for feature selection

In [5]:
df.isna().sum()

id       0
V1       0
V2       0
V3       0
V4       0
V5       0
V6       0
V7       0
V8       0
V9       0
V10      0
V11      0
V12      0
V13      0
V14      0
V15      0
V16      0
V17      0
V18      0
V19      0
V20      0
V21      0
Class    0
dtype: int64

Dataset does not contain any NaN values

In [6]:
df.describe()

Unnamed: 0,id,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V13,V14,V15,V16,V17,V18,V19,V20,V21,Class
count,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,...,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0,164231.0
mean,82116.0,0.020816,0.029036,0.27489,0.023903,-0.063158,0.087255,-0.061199,0.081759,-0.048293,...,-0.01187,0.008342,0.050925,-0.005003,0.011869,-0.002755,-0.00182,0.00529,-0.012604,0.002149
std,47409.550367,0.709996,0.592437,0.580047,0.63171,0.547314,0.61586,0.41405,0.371658,0.520265,...,0.503129,0.387392,0.450866,0.39518,0.340751,0.395404,0.376164,0.226855,0.219044,0.046312
min,1.0,-14.903862,-19.75852,-9.861436,-2.631825,-18.795629,-8.178917,-10.938095,-14.285038,-4.378475,...,-2.186889,-6.544954,-2.077864,-5.869785,-9.552718,-2.992795,-2.140814,-9.339533,-7.34481,0.0
25%,41058.5,-0.445887,-0.296836,-0.048311,-0.398327,-0.422133,-0.311252,-0.326514,-0.04678,-0.369624,...,-0.341358,-0.178984,-0.24197,-0.242971,-0.217476,-0.237657,-0.234916,-0.087263,-0.09183,0.0
50%,82116.0,-0.02629,0.023532,0.359776,0.018997,-0.087875,-0.027857,-0.048578,0.069393,-0.087079,...,-0.021518,0.007464,0.085418,0.015863,-0.01793,-0.004414,0.002972,-0.017977,-0.01841,0.0
75%,123173.5,0.590285,0.42275,0.651813,0.456228,0.209398,0.269618,0.20148,0.217418,0.262494,...,0.318814,0.206405,0.398215,0.257137,0.200453,0.241715,0.236165,0.06787,0.060383,0.0
max,164231.0,1.215247,8.034236,3.072789,6.134471,11.412148,8.116984,19.912049,9.135293,6.711957,...,1.996423,3.03743,2.346662,2.917496,3.663483,1.941749,2.045487,6.792875,11.031473,1.0


In [7]:
df = df.set_index('id')

In [8]:
df.head()

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V13,V14,V15,V16,V17,V18,V19,V20,V21,Class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.299468,0.533339,0.592928,0.094916,0.015414,0.019142,0.045814,0.17743,-0.040845,-0.201868,...,-0.006907,0.082039,0.783691,0.046401,-0.1819,0.201659,0.186339,0.052175,0.081669,0
2,0.529493,-0.094837,0.341711,0.568313,-0.318345,0.201856,-0.302414,0.20824,0.290601,-0.005605,...,-0.934574,0.248362,0.231307,-0.12139,0.045858,-0.088862,-0.320489,-0.132945,0.051778,0
3,0.60563,-0.024632,-0.042535,-0.023267,-0.135464,-0.141456,0.093773,-0.093751,-0.586908,0.282549,...,0.681119,-0.082467,-0.037362,-0.923501,0.024895,0.135814,-0.367899,-0.084424,-0.273891,0
4,-0.346173,0.647783,0.473604,-0.165712,0.12778,-0.221239,0.295904,0.076857,-0.220782,-0.110628,...,0.560604,0.001884,0.409169,0.23617,-0.362497,-0.085438,0.210089,0.112295,-0.136954,0
5,0.24005,-0.688908,0.525568,0.434648,-0.371745,0.799044,-0.332958,0.227423,0.783561,-0.423634,...,0.561863,-0.483898,0.041915,-0.61644,0.548586,-0.956886,-0.379454,0.235959,0.035924,0


Class = 0 are licit transactions and class = 1 ilicit ones. The main problem with this dataset is unbalanced classes. Let's check it

In [9]:
(len(df) - len(df[df['Class'] == 1]))/len(df)

0.997850588500344

99.78% of dataset has Class = 0. Running model with such unbalanced dataset can lead to bias toward the majority class, poor generalization and overfiting.

In [10]:
# Split dataset into features and labels
labels = df.Class
features = df.drop('Class', axis =1) # axis=1 is used to drop column

In [11]:
features.head()

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.299468,0.533339,0.592928,0.094916,0.015414,0.019142,0.045814,0.17743,-0.040845,-0.201868,...,-0.165179,-0.006907,0.082039,0.783691,0.046401,-0.1819,0.201659,0.186339,0.052175,0.081669
2,0.529493,-0.094837,0.341711,0.568313,-0.318345,0.201856,-0.302414,0.20824,0.290601,-0.005605,...,0.176963,-0.934574,0.248362,0.231307,-0.12139,0.045858,-0.088862,-0.320489,-0.132945,0.051778
3,0.60563,-0.024632,-0.042535,-0.023267,-0.135464,-0.141456,0.093773,-0.093751,-0.586908,0.282549,...,0.354953,0.681119,-0.082467,-0.037362,-0.923501,0.024895,0.135814,-0.367899,-0.084424,-0.273891
4,-0.346173,0.647783,0.473604,-0.165712,0.12778,-0.221239,0.295904,0.076857,-0.220782,-0.110628,...,0.101733,0.560604,0.001884,0.409169,0.23617,-0.362497,-0.085438,0.210089,0.112295,-0.136954
5,0.24005,-0.688908,0.525568,0.434648,-0.371745,0.799044,-0.332958,0.227423,0.783561,-0.423634,...,0.81657,0.561863,-0.483898,0.041915,-0.61644,0.548586,-0.956886,-0.379454,0.235959,0.035924


# SMOTE

In [12]:
X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.25, random_state = 5)

In [13]:
train_class_distribution = y_train.value_counts(normalize=True)
print("Training set class distribution:")
print(train_class_distribution)

Training set class distribution:
Class
0    0.997849
1    0.002151
Name: proportion, dtype: float64


In [14]:
test_class_distribution = y_val.value_counts(normalize=True)
print("\nValidating set class distribution:")
print(test_class_distribution)


Validating set class distribution:
Class
0    0.997857
1    0.002143
Name: proportion, dtype: float64


Training and validating set have similar class distribution

In [15]:
y_val[y_val == 1]

id
34129     1
95194     1
24066     1
139050    1
139596    1
         ..
15256     1
141980    1
148882    1
9595      1
32185     1
Name: Class, Length: 88, dtype: int64

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# criando o classificador
clf = LogisticRegression(random_state=0, max_iter=10000).fit(X_train, y_train)

In [17]:
y_pred = clf.predict(X_val)

In [18]:
confusion_matrix(y_val, y_pred)

array([[40965,     5],
       [   86,     2]])

In [19]:
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score

auprc = average_precision_score(y_val, y_pred)
auc = roc_auc_score(y_val, y_pred)

print("AUPRC:", auprc)
print("AUC:", auc)

AUPRC: 0.008588104379423976
AUC: 0.5113026161049105


In [20]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 101)
X_resampled, y_resampled = smote.fit_resample(X_train,y_train)

In [21]:
train_class_distribution = y_resampled.value_counts(normalize=True)
print("Training set class distribution:")
print(train_class_distribution)                              

Training set class distribution:
Class
0    0.5
1    0.5
Name: proportion, dtype: float64


In [22]:
# 2 try using X_resampled and y_resampled
clf_2 = LogisticRegression(random_state=0, max_iter=10000).fit(X_resampled, y_resampled)
y_pred_2 = clf_2.predict(X_val)

In [23]:
confusion_matrix(y_val, y_pred_2)

array([[31316,  9654],
       [   25,    63]])

In [24]:
auprc_2 = average_precision_score(y_val, y_pred_2)
auc_2 = roc_auc_score(y_val, y_pred_2)

In [25]:
print("AUPRC:", auprc)
print("AUC:", auc)

print("AUPRC 2:", auprc_2)
print("AUC 2:", auc_2)

AUPRC: 0.008588104379423976
AUC: 0.5113026161049105
AUPRC 2: 0.005250478837115914
AUC 2: 0.7401366299065835


# Random Forest

In [26]:
# Using Grid Search to find the best parameters
param_grid = {
    'n_estimators': [50],
    'max_depth' : [None,5,10],
    'class_weight': [None, 'balanced', {0: 1, 1: 10}]
}

In [27]:
rf_models = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=5, verbose=1,scoring='roc_auc')
rf_models.fit(X_resampled, y_resampled)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


In [28]:
predictions = rf_models.predict(X_val)

In [29]:
confusion_matrix(y_val, predictions)

array([[40923,    47],
       [   88,     0]])

In [30]:
auprc_3 = average_precision_score(y_val,predictions)
auc_3 = roc_auc_score(y_val, predictions)

In [31]:
print("AUPRC:", auprc)
print("AUC:", auc)

print("AUPRC 2:", auprc_2)
print("AUC 2:", auc_2)

print("AUPRC 3:", auprc_3)
print("AUC 3:", auc_3)


AUPRC: 0.008588104379423976
AUC: 0.5113026161049105
AUPRC 2: 0.005250478837115914
AUC 2: 0.7401366299065835
AUPRC 3: 0.00214330946465975
AUC 3: 0.49942640956797657


# Undersampling

In [32]:
from sklearn.utils import resample

In [33]:
# Separate the majority and minority classes
majority_class = df[df['Class'] == 0]
minority_class = df[df['Class'] == 1]

# Determine the undersampling ratio
undersampling_ratio = 1  # Adjust this based on desired ratio

# Undersample the majority class
undersampled_majority = resample(majority_class,
                                 replace=False,  # Set to False for undersampling
                                 n_samples=len(minority_class) * undersampling_ratio,
                                 random_state=42)

# Combine the minority class and the undersampled majority class
undersampled_data = pd.concat([undersampled_majority, minority_class])

# Shuffle the dataset
undersampled_data = undersampled_data.sample(frac=1, random_state=42)

In [34]:
labels_us = undersampled_data.Class
X_undersampled = undersampled_data.drop('Class', axis=1)

In [35]:
X_train_us, X_val_us, y_train_us, y_val_us = train_test_split(X_undersampled, labels_us, test_size=0.25, random_state = 5)

In [36]:
clf_us = LogisticRegression(random_state=0, max_iter=10000).fit(X_train_us, y_train_us)

In [37]:
y_pred_us = clf_us.predict(X_val_us)

In [38]:
confusion_matrix(y_val_us, y_pred_us)

array([[61, 16],
       [36, 64]])

In [39]:
auprc_4 = average_precision_score(y_val_us,y_pred_us)
auc_4 = roc_auc_score(y_val_us, y_pred_us)

In [40]:
print("AUPRC:", auprc)
print("AUC:", auc)

print("AUPRC 2:", auprc_2)
print("AUC 2:", auc_2)

print("AUPRC 3:", auprc_3)
print("AUC 3:", auc_3)

print("AUPRC 4:", auprc_4)
print("AUC 4:", auc_4)


AUPRC: 0.008588104379423976
AUC: 0.5113026161049105
AUPRC 2: 0.005250478837115914
AUC 2: 0.7401366299065835
AUPRC 3: 0.00214330946465975
AUC 3: 0.49942640956797657
AUPRC 4: 0.7153898305084746
AUC 4: 0.7161038961038961


# Random Forest with undersampling

In [41]:
# Using Grid Search to find the best parameters
param_grid = {
    'n_estimators': [50],
    'max_depth' : [None,5,10],
    'class_weight': [None, 'balanced', {0: 1, 1: 10}]
}

In [42]:
rf_models_us = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=5, verbose=1,scoring='roc_auc')
rf_models_us.fit(X_train_us, y_train_us)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


In [43]:
predictions_us = rf_models_us.predict(X_val_us)

In [44]:
confusion_matrix(y_val_us, predictions_us)

array([[60, 17],
       [35, 65]])

In [45]:
auprc_5 = average_precision_score(y_val_us,predictions_us)
auc_5 = roc_auc_score(y_val_us, predictions_us)

In [46]:
print("AUPRC:", auprc)
print("AUC:", auc)

print("AUPRC 2:", auprc_2)
print("AUC 2:", auc_2)

print("AUPRC 3:", auprc_3)
print("AUC 3:", auc_3)

print("AUPRC 4:", auprc_4)
print("AUC 4:", auc_4)

print("AUPRC 5:", auprc_5)
print("AUC 5:", auc_5)

AUPRC: 0.008588104379423976
AUC: 0.5113026161049105
AUPRC 2: 0.005250478837115914
AUC 2: 0.7401366299065835
AUPRC 3: 0.00214330946465975
AUC 3: 0.49942640956797657
AUPRC 4: 0.7153898305084746
AUC 4: 0.7161038961038961
AUPRC 5: 0.7129840154333746
AUC 5: 0.7146103896103897


Modelo de escolha: Regressão Logística usando Undersampling. Embora o modelo random forest com grid search usando undersampling possa pontuar mais (a depender do acaso), escolhemos a regressão logística por ela ser mais simples, o que faz-nos acreditar que tem maior capacidade de generalização. Ambos os modelos pontuam melhor com Undersampling do que com a técnica SMOTE (valores de AUC e AUPRC mais próximos de 1)

In [49]:
type(X_train_us['V1'].iteritems())

AttributeError: 'Series' object has no attribute 'iteritems'