# Fraudulent Transaction Monitoring

This notebook provides an end-to-end approach to fraud detection in credit card transactions.

This data comes from Kaggle, and you can access the link here:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

Thank you for visiting my page!

# Analytical Steps

Below are the steps I take to conduct this analysis.

* <b>Step 1:</b> Packages and Dataset Import
* <b>Step 2:</b> Target, Predictors, and Time Series Split
* <b>Step 3:</b> Exploratory Data Analysis
* <b>Step 4:</b> Baseline Model Development


# Step 1. Packages and Dataset Import

In [None]:
####Import required python packages
from collections import Counter
from matplotlib import pyplot as plt
import plotly.express as px
import seaborn as sb

import numpy as np
import pandas as pd

import sklearn as sklearn
from sklearn.model_selection import train_test_split,TimeSeriesSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score,recall_score,f1_score,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler,RobustScaler

# Oversampling and under sampling
# from imblearn.over_sampling import RandomOverSampler, SMOTE
# from imblearn.under_sampling import RandomUnderSampler, NearMiss

import scipy.stats as stats

In [None]:
%%sql @noteable
/*Using SQL to query the dataset*/
SELECT*
FROM 'Scratch Data Sets, Articles and Ideas/Credit Card Fraud.csv'

In [None]:
##read in the data set using pandas
fraud_df=pd.read_csv(r"Scratch Data Sets, Articles and Ideas/Credit Card Fraud.csv")

In [None]:
fraud_df.set_index('Time',inplace=True)
fraud_df.sort_index(inplace=True)

y=fraud_df['Class']
X=fraud_df.drop(labels='Class',axis=1)

# Step 2. Target, Predictors, and Time Series Split

The <b>Time</b> column contains the time at which the transaction was completed. This is the column on which we will divide our train and test samples, because this mirrors how we will detect fraud in reality.

Additionally, the <b>Class</b> column contains the fraud indicator. We notice this is imbalanced so we may have to adjust our sample to achieve reasonable model performance scores.

We will set <b>Time</b> as the index of the dataframe and <b>Class</b> as the Y variable.

First we'll create a train and test time-series split on the sample.

In [None]:
tss=TimeSeriesSplit(n_splits=2)

train_split_indices,test_split_indices=tss.split(X)
X_train,X_test=X.iloc[train_split_indices[1],:],X.iloc[test_split_indices[1],:]
y_train,y_test=y.iloc[train_split_indices[1]],y.iloc[test_split_indices[1]]

We can see this time series split in a plot below.

In [None]:
y_train.groupby('Time').mean().plot()
y_test.groupby('Time').mean().plot()

Now we'll create the time series cross-validation splits on the sample. We will use this for hyperparameter tuning after we make a model selection

In [None]:
##TIME SERIES CROSS VALIDATION
cross_val_tss=TimeSeriesSplit(n_splits=5)

##these are the 5 folds we created based on the time
for i, (train_index, test_index) in enumerate(cross_val_tss.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")


# Step 3. Exploratory Data Analysis
I will now begin conducting the exploratory data analysis phase. I will begin understanding the basis of the data, missing values, anomalies and what each column represents.

In [None]:
fraud_df.describe()

In [None]:
##Checking for missing values
fraud_df.isnull().values.any()

There are no missing values.

In [None]:
fraud_df.shape

In [None]:
fraud_df.columns

<b>Observations</b>:
It appears that there are no discrete variables and no missing values. Because we do not have the variable names (they were removed), we cannot make any judgements about these columns. Additionally, we want to retain each outlier, because it could be an indicator of fraudulent activity.

## Correlations Between Features
We'll now understand the correlations between features, as well as between each feature and the target variable.

In [None]:
fraud_df.corr()

Now we can observe correlations through a heatmap.

In [None]:
dataplot= sb.heatmap(fraud_df.corr(),cmap="YlGnBu",cbar=False)
plt.show()

We note that there is not much correlation between the "V..." features. However, several of the features correlate with <b>Amount</b> and <b>Class</b> (which is our target).

Before we move to selecting the features that have high correlations with the target, we are going to standardize every feature. Because we will likely have outliers, we need to use a scaling approach that can be robust to outliers. We will use the <b>RobustScaler</b> approach.

In [None]:
robust_scaler_train=RobustScaler()
robust_scaler_test=RobustScaler()

X_train_standardized=robust_scaler_train.fit_transform(X_train)
X_train_standardized = pd.DataFrame(X_train_standardized, columns=X_train.columns)

X_test_standardized=robust_scaler_test.fit_transform(X_test)
X_test_standardized = pd.DataFrame(X_test_standardized, columns=X_test.columns)


Now we will assess correlation of these features with the target variable, using the point-biserial correlation coefficient, which assesses correlation between a categorical variable and continuous variables. We will set a threshold of <b>.2 as the minimum correlation for a feature to be considered in our model</b>. We may adjust this value later.

In [None]:

point_bi_serial_list=X_train_standardized
point_bi_serial_threshold = .2
pointbiserialr=stats.pointbiserialr
corr_data=pd.DataFrame()
for i in point_bi_serial_list:
    pbc=pointbiserialr(y_train,X_train_standardized[i])
    corr_temp_data=[[i,pbc.correlation,"point_bi_serial"]]
    corr_temp_df=pd.DataFrame(corr_temp_data,columns=['Feature','Correlation','Correlation_Type'])
    corr_data=corr_data.append(corr_temp_df)

# Filter NA and sort based on absolute correlation
corr_data = corr_data.iloc[corr_data.Correlation.abs().argsort()]
corr_data = corr_data[corr_data['Correlation'].notna()]
corr_data = corr_data.loc[corr_data['Correlation'] != 1]

# Add thresholds

# initialize list of lists
data = [['point_bi_serial', point_bi_serial_threshold]]
threshold_df=pd.DataFrame(data,columns=["Correlation_Type","Threshold"])
corr_data=pd.merge(corr_data,threshold_df,on=["Correlation_Type"],how="left")
corr_data2 = corr_data.loc[corr_data['Correlation'].abs() > corr_data['Threshold']]
corr_top_features = corr_data2['Feature'].tolist()

corr_top_features

Now, we'll use the feature scaling to create some boxplots that demonstrate the relationship between each feature and our target. First we have to convert <b>y_train</b> to a categorical variable.

In [None]:
y_train

In [None]:
y_train=y_train.astype("category")
y_test=y_test.astype("category")

y_train=np.where(y_train==0,"Not fraud",y_train)
y_train=np.where(y_train==1,"Fraud",y_train)

y_test=np.where(y_test==0,"Not fraud",y_test)
y_test=np.where(y_test==1,"Fraud",y_test)

corr_train=X_train_standardized
corr_train['Class']=y_train

In [None]:
corr_train['Class']

In [None]:
class_nf = corr_train[corr_train['Class'] == "Not fraud"]
class_f = corr_train[corr_train['Class'] == "Fraud"]

for feature in corr_top_features:
    sb.boxplot(data=[class_nf[feature], class_f[feature]])
    plt.title("Fraud Class by "+feature)
    plt.ylim(-25,5)
    plt.show()

It is clear from these box plots that there is an inverse relationship between each of these highly predictive features and the target <b>Class</b>.

Fraudulent transactions seem to have lower values of <b>['V3', 'V16', 'V10', 'V7', 'V12', 'V14', 'V17']</b>. We cannot interpret what this means in context, as we don't have any description of the features. However, if I did have knowledge of what the features are, I would look conduct a sanity check to ensure that the inverse relationhip makes sense.

# <b>Step 4:</b> Baseline Model Development

Now we will build a baseline model to use to iterate on. We will start with a basic <b>decision tree</b>.

In [None]:

####MODEL TYPE: DECISION TREE
##Now time to train a decision tree model
##Not going to do any cross validation yet until I work on tuning the hyperparameters using grid search
regressor = DecisionTreeClassifier()
regressor.fit(X_train_standardized[corr_top_features], y_train)

y_pred = regressor.predict(X_test_standardized[corr_top_features])
df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})

# The score method returns the accuracy of the model
score = regressor.score(X_test[corr_top_features], y_test)
print(score)

https://medium.com/towards-data-science/task-cheatsheet-for-almost-every-machine-learning-project-d0946861c6d0

****keep thinking about the "problem" Perhaps I want to "tier" the models so that there is a model that catches more transactions, and one that is more accurate (precision vs. recall);

SHAPLEY VALUES