# Fraudulent Transaction Monitoring

This notebook provides an end-to-end approach to fraud detection in credit card transactions.

This data comes from Kaggle, and you can access the link here:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

Thank you for visiting my page!

# Analytical Steps

Below are the steps I take to conduct this analysis.

* <b>Step 1:</b> Packages and Dataset Import
* <b>Step 2:</b> Target, Predictors, and Train-Test Split
* <b>Step 3:</b> Exploratory Data Analysis
* <b>Step 4:</b> Baseline Model Development


# Step 1. Packages and Dataset Import

In [None]:
####Import required python packages
from collections import Counter
from matplotlib import pyplot as plt
import plotly.express as px
import seaborn as sb

import numpy as np
import pandas as pd

import sklearn as sklearn
from sklearn import svm
from sklearn.model_selection import train_test_split,KFold,TimeSeriesSplit,cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score,recall_score,f1_score,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler,RobustScaler

# Oversampling and under sampling
# from imblearn.over_sampling import RandomOverSampler, SMOTE
# from imblearn.under_sampling import RandomUnderSampler, NearMiss

import scipy.stats as stats

In [None]:
%%sql @noteable
/*Using SQL to query the dataset*/
SELECT*
FROM 'Scratch Data Sets, Articles and Ideas/Credit Card Fraud.csv'

In [None]:
##read in the data set using pandas
fraud_df=pd.read_csv(r"Scratch Data Sets, Articles and Ideas/Credit Card Fraud.csv")

In [None]:
fraud_df.set_index('Time',inplace=True)
fraud_df.sort_index(inplace=True)

y=fraud_df['Class']
X=fraud_df.drop(labels='Class',axis=1)

# Step 2. Target, Predictors, and Train-Test Split

The <b>Time</b> column contains the time at which the transaction was completed. This is the column on which we will divide our train and test samples, because this mirrors how we will detect fraud in reality.

The <b>Class</b> column contains the fraud indicator. We notice this is imbalanced so we may have to adjust our sample to achieve reasonable model performance scores.

We will set <b>Time</b> as the index of the dataframe and <b>Class</b> as the Y variable.

First we'll create a train and test split on the sample.

In [None]:
tss=TimeSeriesSplit(n_splits=2)

train_split_indices,test_split_indices=tss.split(X)
X_train,X_test=X.iloc[train_split_indices[1],:],X.iloc[test_split_indices[1],:]
y_train,y_test=y.iloc[train_split_indices[1]],y.iloc[test_split_indices[1]]

In [None]:
##Potential Delete
# X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.2,random_state=42)

We can see this time series split in a plot below.

In [None]:
y_train.groupby('Time').mean().plot()
y_test.groupby('Time').mean().plot()

Now we'll create the cross-validation splits on the sample. We will use this for hyperparameter tuning after we make a model selection

In [None]:
##TIME SERIES CROSS VALIDATION
cross_val_tss=TimeSeriesSplit(n_splits=5)

##these are the 5 folds we created based on the time
for i, (train_index, test_index) in enumerate(cross_val_tss.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")


In [None]:
##POTENTIAL DELETE##
##CROSS VALIDATION
# kf=KFold(n_splits=5)

# ##these are the 5 folds we created based on the time
# for i, (train_index, test_index) in enumerate(kf.split(X)):
#     print(f"Fold {i}:")
#     print(f"  Train: index={train_index}")
#     print(f"  Test:  index={test_index}")


# Step 3. Exploratory Data Analysis
I will now begin conducting the exploratory data analysis phase. I will begin understanding the basis of the data, missing values, anomalies and what each column represents.

In [None]:
fraud_df.describe()

In [None]:
##Checking for missing values
fraud_df.isnull().values.any()

There are no missing values.

In [None]:
fraud_df.shape

In [None]:
fraud_df.columns

<b>Observations</b>:
It appears that there are no discrete variables and no missing values. Because we do not have the variable names (they were removed), we cannot make any judgements about these columns. Additionally, we want to retain each outlier, because it could be an indicator of fraudulent activity.

## Correlations Between Features
We'll now understand the correlations between features, as well as between each feature and the target variable.

In [None]:
fraud_df.corr()

Now we can observe correlations through a heatmap.

In [None]:
dataplot= sb.heatmap(fraud_df.corr(),cmap="YlGnBu",cbar=False)
plt.show()

We note that there is not much correlation between the "V..." features. However, several of the features correlate with <b>Amount</b> and <b>Class</b> (which is our target).

Before we move to selecting the features that have high correlations with the target, we are going to standardize every feature. Because we will likely have outliers, we need to use a scaling approach that can be robust to outliers. We will use the <b>RobustScaler</b> approach.

In [None]:
robust_scaler_train=RobustScaler()
robust_scaler_test=RobustScaler()

X_train_standardized=robust_scaler_train.fit_transform(X_train)
X_train_standardized = pd.DataFrame(X_train_standardized, columns=X_train.columns)

X_test_standardized=robust_scaler_test.fit_transform(X_test)
X_test_standardized = pd.DataFrame(X_test_standardized, columns=X_test.columns)


Now we will assess correlation of these features with the target variable, using the point-biserial correlation coefficient, which assesses correlation between a categorical variable and continuous variables. We will set a threshold of <b>.2 as the minimum correlation for a feature to be considered in our model</b>. We may adjust this value later.

In [None]:

point_bi_serial_list=X_train_standardized
point_bi_serial_threshold = .2
pointbiserialr=stats.pointbiserialr
corr_data=pd.DataFrame()
for i in point_bi_serial_list:
    pbc=pointbiserialr(y_train,X_train_standardized[i])
    corr_temp_data=[[i,pbc.correlation,"point_bi_serial"]]
    corr_temp_df=pd.DataFrame(corr_temp_data,columns=['Feature','Correlation','Correlation_Type'])
    corr_data=corr_data.append(corr_temp_df)

# Filter NA and sort based on absolute correlation
corr_data = corr_data.iloc[corr_data.Correlation.abs().argsort()]
corr_data = corr_data[corr_data['Correlation'].notna()]
corr_data = corr_data.loc[corr_data['Correlation'] != 1]

# Add thresholds

# initialize list of lists
data = [['point_bi_serial', point_bi_serial_threshold]]
threshold_df=pd.DataFrame(data,columns=["Correlation_Type","Threshold"])
corr_data=pd.merge(corr_data,threshold_df,on=["Correlation_Type"],how="left")
corr_data2 = corr_data.loc[corr_data['Correlation'].abs() > corr_data['Threshold']]
corr_top_features = corr_data2['Feature'].tolist()

corr_top_features

Now, we'll use the feature scaling to create some boxplots that demonstrate the relationship between each feature and our target. First we have to convert <b>y_train</b> to a categorical variable.

In [None]:
#potential delete
# fraud_df

In [None]:
#potential delete
# X_train_standardized

In [None]:
#potential delete
# corr_train

In [None]:
# corr_train=X_train_standardized
# fraud_df.columns
standardized_cols=[col for col in fraud_df.columns if "V" in col or "Amount" in col]
fraud_df_subset=fraud_df[standardized_cols]
rs=RobustScaler()
fraud_df_subset_scaled=rs.fit_transform(fraud_df_subset)
fraud_df_standardized=fraud_df
fraud_df_standardized[standardized_cols]=fraud_df_subset_scaled

In [None]:
class_nf = fraud_df_standardized[fraud_df_standardized['Class'] == 0]
class_f = fraud_df_standardized[fraud_df_standardized['Class'] == 1]

In [None]:
for feature in corr_top_features:
    sb.boxplot(data=[class_nf[feature], class_f[feature]])
    plt.title("Fraud Class by "+feature)
    plt.xticks([0,1], ["Not fraud", "Fraud"])
    plt.ylim(-25,5)
    plt.show()

It is clear from these box plots that there is an inverse relationship between each of these highly predictive features and the target <b>Class</b>.

Fraudulent transactions seem to have lower values of <b>['V3', 'V16', 'V10', 'V7', 'V12', 'V14', 'V17']</b>. We cannot interpret what this means in context, as we don't have any description of the features. However, if I did have knowledge of what the features are, I would look conduct a sanity check to ensure that the inverse relationhip makes sense.

# <b>Step 4:</b> Baseline Model Development

Now we will build a baseline model to use to iterate on. We will start with a basic <b>logistic regression</b>, and then we will move to a <b>decision tree</b>. To end our initial analysis, we will build a <b>support vecotr machine</b>.

## Model Type: Decision Tree

In [None]:
model_scores={}
####MODEL TYPE: DECISION TREE
##Now time to train a decision tree model
##Not going to do any cross validation yet until I work on tuning the hyperparameters using grid search
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_standardized[corr_top_features], y_train)

y_pred = decision_tree.predict(X_test_standardized[corr_top_features])
model_df = pd.DataFrame({'Real Values':y_test, 'Decision Tree Predicted Values':y_pred})

# The score method returns the accuracy of the model
score = decision_tree.score(X_test_standardized[corr_top_features], y_test)
model_scores['Decision Tree']={'Accuracy':score}
print(score)

## Model Type: Logistic Regression

In [None]:
####MODEL TYPE: Logistic Regression
logistic=LogisticRegression()
logistic.fit(X_train_standardized[corr_top_features], y_train)

y_pred = logistic.predict(X_test_standardized[corr_top_features])
model_df['Logistic Predicted Values']=y_pred
score=logistic.score(X_test_standardized[corr_top_features],y_test)
model_scores['Logistic Regression']={'Accuracy':score}
print(score)

## Model Type: Support Vector Machine

In [None]:
####MODEL TYPE: Support Vector Machine
svm_model=svm.SVC(kernel='linear')
svm_model.fit(X_train_standardized[corr_top_features],y_train)

y_pred = svm_model.predict(X_test_standardized[corr_top_features])
model_df['SVM Predicted Values']=y_pred
score=svm_model.score(X_test_standardized[corr_top_features],y_test)
model_scores['SVM']={'Accuracy':score}
print(score)

## Assessing Baslines Model Performance Through Cross Validation
Now we will conduct time series cross validation for each of the models. I'm creating a model called <b>CV_report</b> to report the cross-validation metrics. 

#### Definitions of Performance Metrics

<b>Accuracy</b>: Accuracy is calculated as the number of true positives (accurately identified fraud) plus true negatives (accurately identified non-fraud) divided by the total number of predictions.

<b>Precision</b>: Precision is calculated as the number of true positives divided by the number of true positives plus false positives (incorrectly identified fraud).

<b>Recall</b>: Recall is calculated as the number of true positives divided by the number of true positives plus false negatives (fraudulent transactions that were missed).

<b>F1</b>: F1 is a measure of a model’s accuracy and precision. It is calculated as the harmonic mean of precision and recall.


In [None]:
##Write Function
def CV_report(n_folds,model):
    tscv = TimeSeriesSplit(n_splits=n_folds)
    scores_accuracy=cross_val_score(model,X,y,cv=tscv)
    scores_precision=cross_val_score(model,X,y,cv=tscv,scoring='precision')
    scores_recall=cross_val_score(model,X,y,cv=tscv,scoring='recall')
    scores_f1=cross_val_score(model,X,y,cv=tscv,scoring='f1')
    performance_metrics={'Accuracy':
                         {'Mean':dt_scores_accuracy.mean(),
                          'Standard Deviation':scores_accuracy.std()},
                        'Precision':
                         {'Mean':scores_precision.mean(),
                          'Standard Deviation':scores_precision.std()},
                        'Recall':
                         {'Mean':scores_recall.mean(),
                          'Standard Deviation':scores_recall.std()},
                        'F1':
                         {'Mean':scores_f1.mean(),
                         'Standard Deviation':scores_f1.std()}  
                                  }
    return performance_metrics

In [None]:
##Decision Tree Cross Validation Report
cv_decision_tree=DecisionTreeClassifier()
d_tree_report= CV_report(5,cv_decision_tree))
print(d_tree_report)

In [None]:
##Logistic Regression Cross Validation Report
cv_logistic=LogisticRegression()
logistic_report=CV_report(5,cv_logistic)
print(logistic_report)

In [None]:
##Support Vector Machine Cross Validation Report
cv_svm=svm.SVC(kernel='linear')
svm_report=CV_report(5,cv_svm)
print(svm_report)

Below you will find a chart that demonstrates the performance metrics for each of the basic model types.

In [None]:
accuracy_numbers=[d_tree_report['Accuracy']['Mean'],logistic_report['Accuracy']['Mean'],svm_report['Accuracy']['Mean']]
precision_numbers=[d_tree_report['Precision']['Mean'],logistic_report['Precision']['Mean'],svm_report['Precision']['Mean']]
recall_numbers=[d_tree_report['Recall']['Mean'],logistic_report['Recall']['Mean'],svm_report['Recall']['Mean']]
F1_numbers=[d_tree_report['F1']['Mean'],logistic_report['F1']['Mean'],svm_report['F1']['Mean']]


fig,ax=plt.subplots()

ax.bar(range(3),accuracy_numbers,label="Accuracy")
ax.bar(range(3),precision_numbers,label="Precision")
ax.bar(range(3),recall_numbers,label="Recall")
ax.bar(range(3),F1_numbers,label="F1")

labels=['Decision Tree','Logistic Regression','Support Vector Machine']

plt.xticks(range(len(labels)),labels)
plt.xlabel('Model Types')
plt.ylabel('Values')
plt.title("model Performance Metrics")
plt.legend()
pt.show()

#### Conclusion
We want to ensure we capture more of the fraudulent transactions even if that means we identify non fraud as fraud. Therefore we should prioritize Recall. The best performing model in terms of Recall is _____.




TO DO: 
1. Add a definition of each of the model performance(accuracy, precision, recall, F1)
2. Bar graph for model performance on each of the metrics
3. Identify the features that have the most impact on the target using SHapley values


https://medium.com/towards-data-science/task-cheatsheet-for-almost-every-machine-learning-project-d0946861c6d0

****keep thinking about the "problem" Perhaps I want to "tier" the models so that there is a model that catches more transactions, and one that is more accurate (precision vs. recall);

**BOOSTING WORKS BETTER THAN BAGGIN FOR IMBALANCED CLASSES

SHAPLEY VALUES

In [None]:
##GBM
GB=GradientBoostingClassifier()
gb_model=GB.fit(X_train_standardized,y_train)
gb_smote_prediction=gb_model.predict(X_test_standardized[corr_top_features])
