# Credit Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

* Source: https://www.kaggle.com/datasets/whenamancodes/fraud-detection

## About Data
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.


It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.


Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

# 1. Importing relevant libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

from collections import Counter

sns.set()
%matplotlib inline

# 2. Loading raw data

In [None]:
df_raw = pd.read_csv('../input/fraud-detection/creditcard.csv')

# 3. EDA

In [None]:
df_raw.describe(include='all')

In [None]:
# Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. 
# It looks like unimportant. We'll drop that.

df=df_raw.drop(['Time'], axis=1)

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
# We have no null values

In [None]:
df.info()

In [None]:
# All categories are numerical (except target 'Class' which is boolean)

In [None]:
len(df[df['Class'] == 1])

In [None]:
len(df[df['Class'] == 0])

In [None]:
# There is a huge disproportion in data. Only 0,17% is a fraud data.

In [None]:
ax = sns.countplot(x='Class',data=df)
total = float(len(df))
for p in ax.patches:
    percentage="{:.2f}%".format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax.annotate(percentage, (x, y),ha="center")
plt.show()

## Correlations and most important features

In [None]:
df.corr()['Class'].sort_values()

In [None]:
plt.figure(figsize=(15,8))
d = df.corr()['Class'][:-1].abs().sort_values().plot(kind='bar', title='Most important features')

plt.show()

In [None]:
# Let's peak all features with correlation 0.15 and morec = 

c = df.corr()['Class'][:-1].abs() > 0.15

print (c)

In [None]:
sns.jointplot(x='V17', y='V14',hue='Class', data=df, palette = 'dark')

In [None]:
sns.jointplot(x='V17', y='V12',hue='Class', data=df, palette = 'dark')

In [None]:
sns.jointplot(x='V17', y='V10',hue='Class', data=df, palette = 'dark')

In [None]:
sns.jointplot(x='V14', y='V12',hue='Class', data=df, palette = 'dark')

# 4. Feature engineering

## Outlier detection

Let's check the distribution of the features with 0,13 and higher and correlation

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=3,figsize=(13,8))

axes[0,0].hist(df['V17'], bins=60, linewidth=0.5, edgecolor="white")
axes[0,0].set_title("V17 distribution");

axes[0,1].hist(df['V10'], bins=60, linewidth=0.5, edgecolor="white")
axes[0,1].set_title("V10 distribution");

axes[0,2].hist(df['V12'], bins=60, linewidth=0.5, edgecolor="white")
axes[0,2].set_title("V12 distribution");

axes[1,0].hist(df['V16'], bins=60, linewidth=0.5, edgecolor="white")
axes[1,0].set_title("V16 distribution");

axes[1,1].hist(df['V14'], bins=60, linewidth=0.5, edgecolor="white")
axes[1,1].set_title("V14 distribution");

axes[1,2].hist(df['V3'], bins=60, linewidth=0.5, edgecolor="white")
axes[1,2].set_title("V3 distribution");

axes[2,0].hist(df['V7'], bins=60, linewidth=0.5, edgecolor="white")
axes[2,0].set_title("V7 distribution");

axes[2,1].hist(df['V11'], bins=60, linewidth=0.5, edgecolor="white")
axes[2,1].set_title("V11 distribution");

axes[2,2].hist(df['V4'], bins=60, linewidth=0.5, edgecolor="white")
axes[2,2].set_title("V4 distribution");

plt.tight_layout()

It looks like we have a lot of outliers here. We can try to get rid of them.

## Tukey's IQR method

Tukey’s  (1977)  technique  is  used  to  detect  outliers  in  univariate  distributions  for symmetric as well as in slightly skewed data sets. The general rule is that anything not in the range of (Q1 - 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed. 

In [None]:
def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns an index list corresponding to the observations 
    containing more than n outliers according to the Tukey IQR method.
    """
    outlier_indices = []
    
    # iterating over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determining a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        
        # appending the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # selecting observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detecting outliers
Outliers_IQR = detect_outliers(df,2,['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'])

# dropping outliers
df_out = df.drop(Outliers_IQR, axis = 0).reset_index(drop=True)

In [None]:
len(Outliers_IQR)

In [None]:
df_out

In [None]:
# Checking distributions of most important features after dropping outliers

fig, axes = plt.subplots(nrows=3, ncols=3,figsize=(13,8))

axes[0,0].hist(df_out['V17'], bins=60, linewidth=0.5, edgecolor="white")
axes[0,0].set_title("V17 distribution");

axes[0,1].hist(df_out['V10'], bins=60, linewidth=0.5, edgecolor="white")
axes[0,1].set_title("V10 distribution");

axes[0,2].hist(df_out['V12'], bins=60, linewidth=0.5, edgecolor="white")
axes[0,2].set_title("V12 distribution");

axes[1,0].hist(df_out['V16'], bins=60, linewidth=0.5, edgecolor="white")
axes[1,0].set_title("V16 distribution");

axes[1,1].hist(df_out['V14'], bins=60, linewidth=0.5, edgecolor="white")
axes[1,1].set_title("V14 distribution");

axes[1,2].hist(df_out['V3'], bins=60, linewidth=0.5, edgecolor="white")
axes[1,2].set_title("V3 distribution");

axes[2,0].hist(df_out['V7'], bins=60, linewidth=0.5, edgecolor="white")
axes[2,0].set_title("V7 distribution");

axes[2,1].hist(df_out['V11'], bins=60, linewidth=0.5, edgecolor="white")
axes[2,1].set_title("V11 distribution");

axes[2,2].hist(df_out['V4'], bins=60, linewidth=0.5, edgecolor="white")
axes[2,2].set_title("V4 distribution");

plt.tight_layout()

Now features look much more "normal"!

In [None]:
# Let's check if we didn't drop too many important information accidentally

In [None]:
print ('The amount of frauds in df before dropping outliers: ', len(df[df['Class'] == 1]))

In [None]:
print ('The amount of frauds in df afret dropping outliers: ', len(df_out[df_out['Class'] == 1]))

It looks like outliers are very similar to fraud values and we dropped most of them!

Let's create a new df with dropped outliers only.

In [None]:
Outliers_df2 = df.loc[df.index[Outliers_IQR]]

In [None]:
len(Outliers_df2)

In [None]:
Outliers_df2

# 5. Modelling

## 1st data frame

In [None]:
# Train/Test split

X = Outliers_df2.drop('Class',axis=1).values
y = Outliers_df2['Class'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

In [None]:
# Scaling data

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout

In [None]:
X_train.shape

In [None]:
model = Sequential()

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

model.add(Dense(units=29,activation='relu'))
model.add(Dropout(0.3))

model.add(Dense(units=14,activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(units=7,activation='relu'))
#model.add(Dropout(0.1))

model.add(Dense(units=2,activation='relu'))

model.add(Dense(units=1,activation='sigmoid'))

# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=2, patience=35)

In [None]:
model.fit(x=X_train, 
          y=y_train, 
          epochs=3000,
          validation_data=(X_test, y_test), verbose=2,
          callbacks=[early_stop]
          )

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

In [None]:
predictions = (model.predict(X_test) > 0.5)*1

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
# https://en.wikipedia.org/wiki/Precision_and_recall
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

In [None]:
cm_df = pd.DataFrame(confusion_matrix(y_test,predictions))
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
cm_df

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1 = f1_score(y_test, predictions)
print (f1)

In [None]:
CM = confusion_matrix(y_test,predictions)

TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)

# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)

In [None]:
ACC

## 2nd data frame

In [None]:
# Train/Test split

X = df_out.drop('Class',axis=1).values
y = df_out['Class'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train.shape

In [None]:
model = Sequential()

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

model.add(Dense(units=29,activation='relu'))
model.add(Dropout(0.3))

model.add(Dense(units=14,activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(units=7,activation='relu'))
#model.add(Dropout(0.1))

model.add(Dense(units=2,activation='relu'))

model.add(Dense(units=1,activation='sigmoid'))

# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=2, patience=35)

In [None]:
model.fit(x=X_train, 
          y=y_train, 
          epochs=3000,
          validation_data=(X_test, y_test), verbose=2,
          callbacks=[early_stop]
          )

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

In [None]:
predictions_2nd = (model.predict(X_test) > 0.5)*1

In [None]:
# https://en.wikipedia.org/wiki/Precision_and_recall
print(classification_report(y_test,predictions_2nd))

In [None]:
print(confusion_matrix(y_test,predictions_2nd))

In [None]:
cm_df = pd.DataFrame(confusion_matrix(y_test,predictions_2nd))
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
cm_df

In [None]:
CM = confusion_matrix(y_test,predictions_2nd)

TN_2nd = CM[0][0]
FN_2nd = CM[1][0]
TP_2nd = CM[1][1]
FP_2nd = CM[0][1]

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
TPR_2nd = TP_2nd/(TP_2nd+FN_2nd)
# Specificity or true negative rate
TNR_2nd = TN_2nd/(TN_2nd+FP_2nd) 
# Precision or positive predictive value
PPV_2nd = TP_2nd/(TP_2nd+FP_2nd)
# Negative predictive value
NPV_2nd = TN_2nd/(TN_2nd+FN_2nd)
# Fall out or false positive rate
FPR_2nd = FP_2nd/(FP_2nd+TN_2nd)
# False negative rate
FNR_2nd = FN_2nd/(TP_2nd+FN_2nd)
# False discovery rate
FDR_2nd = FP_2nd/(TP_2nd+FP_2nd)

# Overall accuracy
ACC_2nd = (TP_2nd+TN_2nd)/(TP_2nd+FP_2nd+FN_2nd+TN_2nd)

In [None]:
ACC_2nd

In [None]:
f1_2nd = f1_score(y_test, predictions_2nd)
print (f1_2nd)

# 6. Combining results

In [None]:
# COmbining both confusion matrices

TN_final = TN + TN_2nd
FN_final = FN + FN_2nd
TP_final = TP + TP_2nd
FP_final = FP + FP_2nd

# Sensitivity, hit rate, recall, or true positive rate
TPR_final = TP_final/(TP_final+FN_final)

# Precision or positive predictive value
PPV_final = TP_final/(TP_final+FP_final)

# Overall accuracy
ACC_final = (TP_final+TN_final)/(TP_final+FP_final+FN_final+TN_final)

F1_score = 2*((PPV_final*TPR_final)/(PPV_final+TPR_final))

In [None]:
cm_df = pd.DataFrame(np.array([[TN_final, FP_final], [FN_final, TP_final]]), columns=['Predicted 0', 'Predicted 1'])
cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})

cm_df

Combined confusion matrix

In [None]:
print('Overall accuracy final score: ', ACC_final)

In [None]:
print('Overall F1 final score: ', F1_score)