# Fraud Detection from Credit Card History, Machine learning algorithms
The following jupyter notebook contains a binary classification for fraud detection, from a credit card history. For this; we are gonna explore the following three machine learning algorithms:
* Logistic Regression
* Decition Tree
* Linear Support Vector Machine

This jupyter notebook would showcase the following:
1. Confusion matrix for each of the models
2. Cross validation metrics (precision, recall, f1_score, accuracy_score).
3. Plot of probability distributions between real test data vs each models predictions.

In [2]:
import sys
import os
import pandas as pd
import copy

# Get the root project path
root_project_path = os.path.abspath(os.path.join(os.getcwd(), os.pardir))

#Append it to sys
sys.path.append(root_project_path)

#Import the necessary modules
from utils import DataLoader, CreditCardPreprocesser

### Load the data

In [3]:
#Set the folder name and data folder
folder_name = "data"
data_holder_path = os.path.join(os.getcwd(), os.pardir)

#Set the folder name
data_loader = DataLoader(data_folder_name=folder_name,\
    data_folder_path=data_holder_path)

#Get the data
df_data = data_loader.get_dataset()

### Get the preprocessed dataframe

In [4]:
#Create an instance of the Credit card processer
credit_card_processer = CreditCardPreprocesser(df_data=df_data)

#Obtain the df_preprocessed
df_preprocessed = credit_card_processer.fetch_preprocessed_dataframe()

In [5]:
df_preprocessed.shape

(1296675, 96)

In [6]:
df_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 96 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   cc_num                   1296675 non-null  int64  
 1   amt                      1296675 non-null  float64
 2   gender                   1296675 non-null  int64  
 3   zip                      1296675 non-null  int64  
 4   lat                      1296675 non-null  float64
 5   long                     1296675 non-null  float64
 6   city_pop                 1296675 non-null  int64  
 7   unix_time                1296675 non-null  int64  
 8   merch_lat                1296675 non-null  float64
 9   merch_long               1296675 non-null  float64
 10  is_fraud                 1296675 non-null  int64  
 11  merch_zipcode            1296675 non-null  float64
 12  transaction_year         1296675 non-null  int32  
 13  transaction_month        1296675 non-null 

### Lets select the X and Y target

In [7]:
X: pd.DataFrame = df_preprocessed[[col for col in df_preprocessed.columns if col != "is_fraud"]]
y: pd.DataFrame = df_preprocessed["is_fraud"]

In [8]:
y.value_counts()

is_fraud
0    1289169
1       7506
Name: count, dtype: int64

### Lets now oversample it using SMOTE

In [9]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

#Obtain the over sampled new values
X_smote, y_smote = smote.fit_resample(X.astype("float"), y)

In [10]:
y_smote.value_counts()

is_fraud
0    1289169
1    1289169
Name: count, dtype: int64

### Select the continous columns:

Lets select now the continous columns; where we are gonna apply our `Standardscaler()` instance from scikit-learn to perform standardscaling on ONLY the continous features.

In [11]:
#List all the continous features
continous_features = ["cc_num", "amt", "zip", "lat", "long", "city_pop", "unix_time",\
    "merch_lat", "merch_long", "merch_zipcode", "transaction_year", "transaction_month",\
    "transaction_day", "transaction_hour", "transaction_minute", "transaction_second",\
    "birth_year", "birth_month", "birth_day", "merchant_encoded", "merchant_freq",\
    "first_encoded", "first_freq", "last_encoded", "last_freq", "street_encoded",\
    "street_freq", "city_encoded", "city_freq", "job_encoded", "job_freq"]

#Select the continous and not continous 
X_smote_continous = X_smote[continous_features]
X_smote_discontinous = X_smote[[c for c in X_smote.columns if c not in continous_features]]

In [12]:
X_smote_discontinous.head(3)

Unnamed: 0,gender,category_food_dining,category_gas_transport,category_grocery_net,category_grocery_pos,category_health_fitness,category_home,category_kids_pets,category_misc_net,category_misc_pos,...,state_SD,state_TN,state_TX,state_UT,state_VA,state_VT,state_WA,state_WI,state_WV,state_WY
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Lets implement the standardscaler to continous features

In [13]:
# Importing standard scaler
from sklearn.preprocessing import StandardScaler

#Create a standard scaler object and fit x_train
standard_scaler = StandardScaler()
standard_scaler.fit(X_smote_continous)

#Transform x_train and x_test
X_continous_scaled = standard_scaler.transform(X_smote_continous)

In [14]:
# Now lets make a pandas dataframe
X_continous_scaled = pd.DataFrame(X_continous_scaled,\
    columns=X_smote_continous.columns)

In [15]:
X_continous_scaled.shape

(2578338, 31)

In [16]:
X_continous_scaled.head(3)

Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,merch_zipcode,...,first_encoded,first_freq,last_encoded,last_freq,street_encoded,street_freq,city_encoded,city_freq,job_encoded,job_freq
0,-0.314421,-0.84924,-0.730185,-0.493048,0.634479,-0.285429,-1.758447,-0.501972,0.571804,-0.748038,...,-0.081305,1.394393,-0.301495,-0.634992,-0.348338,0.579336,-0.335488,0.202155,-0.255537,-0.058785
1,-0.316509,-0.55298,1.874596,2.008311,-2.006284,-0.296081,-1.758446,2.048599,-2.0024,-0.014631,...,-0.201056,0.424444,-0.227235,-0.472664,-0.348338,1.796959,-0.335488,1.620929,-0.243254,0.686546
2,-0.31648,-0.225952,1.286889,0.698559,-1.5821,-0.283331,-1.758445,0.883023,-1.572728,1.583258,...,-0.27451,-0.748417,-0.054371,0.176166,-0.348338,-1.273834,-0.335488,-1.224102,0.113473,-1.513956


### Now lets create the actual datasets.

In [17]:
# These are the new datasets
X_data = pd.concat([X_continous_scaled, X_smote_discontinous], axis=1)
y_data = copy.copy(y_smote)

In [18]:
#Lets get the shapes
print(X_data.shape)
print(y_data.shape)

(2578338, 95)
(2578338,)


### Lets initialize the following classifiers
Now we are gonna intialize each of the classifiers, and perform cross validation to obtain all of the different metrics.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [20]:
ml_classifiers = {
    "Logistic Regression": LogisticRegression(random_state = 42, max_iter=1000),
    "Decition Tree Classifier": DecisionTreeClassifier(random_state = 42),
    "Linear Support Vector Machine": LinearSVC(C=1.0, max_iter=1000)
}

### Lets split the data here
We would like to perform the following:
1. Split our data into train_validation and test; we would hold the test dataset for a final analysis.
2. Perform cross validation using train_validation; and perform shuffled folds to see its accuracies.
3. Keep the saved cross val scores; for later analysis.

In [None]:
#Lets obtain the data into train_val and test
x_train_val, x_test, y_train_val, y_test = train_test_split(X_data, y_data,\
    test_size=0.2, shuffle=True, random_state=42)
print(x_train_val.shape)
print(x_test.shape)
print(y_train_val.shape)
print(y_test.shape)

(2062670, 95)
(515668, 95)
(2062670,)
(515668,)


### Perform Cross validation for all ML algorithms
Lets now perform cross validation for each of the ML algorithms, to obtain its results.

In [22]:
#Lets create the holders for each metrics
ml_metrics = {}

for name, clf in ml_classifiers.items():
    print(f"\n==========={name}============ Starting")
    #Lets initialize variables for each
    accuracies = []
    precisions = []
    recalls = []
    f1_scores = []

    # Lets write a fot loop that goes 20 times
    for i in range(20):
        print(f"iteration: {i}: {name}")
        #Call the train test split
        x_train, x_val, y_train, y_val = train_test_split(x_train_val, y_train_val, shuffle=True)

        #Lets fit each of the classifiers
        clf.fit(X=x_train, y=y_train)

        #Obtain the predictions for both
        y_pred = clf.predict(x_val)

        #Lets append them
        accuracies.append(accuracy_score(y_val, y_pred))
        precisions.append(precision_score(y_val, y_pred))
        recalls.append(recall_score(y_val, y_pred))
        f1_scores.append(f1_score(y_val, y_pred))

    #Lets add the metrics
    ml_metrics[name + "_accuracies"] = accuracies
    ml_metrics[name + "_precisions"] = precisions
    ml_metrics[name + "_recalls"] = recalls
    ml_metrics[name + "_f1_scores"] = f1_scores

    #Create a dataframe and save it into a dataframe
    df_results = pd.DataFrame(ml_metrics)
    df_results.to_csv('my_data_' + name + '.csv', index=False)
    print(f"\n==========={name}============ Ending")


iteration: 0: Logistic Regression
iteration: 1: Logistic Regression


KeyboardInterrupt: 