<h1><center> Credit Card Fraud Detection <h1><center>

<b>About the project:</b>

The dataset is a simulated credit card transactions that occured in the United States between 2019 - 2020 taken from Kaggle.

The aim this project to investigate:
* What types of purchases are most likely to be fraud?
* What is the relationship between fraud and consumer demograhics (age, gender, location)?
* Is it possible to predict fraudulent credit card activity?

N.B:
All visualisations are interactive created with plolty.express and plotly.graph_objects.

In [1]:
# Import packages:
import pandas as pd
import numpy as np
import os
import kagglehub
import plotly.express as px
import plotly.graph_objects as go
import re
import warnings
warnings.filterwarnings('ignore')
from datetime import datetime
from scipy.stats import gaussian_kde

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils import resample
import time

# Import the credit car fraud detection dataset
path = kagglehub.dataset_download("kartik2112/fraud-detection")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Princess.Domingo\.cache\kagglehub\datasets\kartik2112\fraud-detection\versions\1


In [2]:
# Loop through any .csv files in the Kaggle folder
for file_name in os.listdir(path):
    
    if file_name.endswith('.csv'):
        # Define the full file path
        file_path = os.path.join(path, file_name)
        
        # Load CSV file into a DataFrame
        df = pd.read_csv(file_path)
        
        # Convert the DataFrmes 
        df_name = os.path.splitext(file_name)[0]
        df_name = re.sub(r'(?<!^)(?=[A-Z])', '_', df_name).lower()  # Convert to snake_case
        
        # Create the DataFrame with the modified name
        globals()[df_name] = df
        print(f"DataFrame '{df_name}' has been created.")

DataFrame 'fraud_test' has been created.
DataFrame 'fraud_train' has been created.


In [3]:
def explore_metadata(df):
    print(df.info())
    print(df.head(2))
    print(df.describe())

In [4]:
explore_metadata(fraud_test)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   merchant               555719 non-null  object 
 4   category               555719 non-null  object 
 5   amt                    555719 non-null  float64
 6   first                  555719 non-null  object 
 7   last                   555719 non-null  object 
 8   gender                 555719 non-null  object 
 9   street                 555719 non-null  object 
 10  city                   555719 non-null  object 
 11  state                  555719 non-null  object 
 12  zip                    555719 non-null  int64  
 13  lat                    555719 non-null  float64
 14  long                   555719 non-nu

In [5]:
explore_metadata(fraud_train)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [6]:
def clean_data(df, df2):
    df3 = pd.concat([df, df2], axis=0)

    df3['date'] = pd.to_datetime(df3['trans_date_trans_time'].str[:10])


    df3['name'] = df3['first'] + ' ' + df3['last']

    df3 = df3.drop(['Unnamed: 0', 'first', 'last', 'unix_time'], axis=1, errors='ignore')

    df3['month'] = df3['date'].dt.to_period('M').astype(str)

    df3['is_fraud'] = df3['is_fraud'].replace({1: 'true', 0: 'false'})

    df3['dob'] = pd.to_datetime(df3['dob'], errors='coerce')
    current_date = datetime.now()
    df3['age'] = ((current_date - df3['dob']).dt.days / 365.25).astype('Int64', errors='ignore')

    age_bins = [18, 25, 35, 45, 55, 65, 100]
    age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
    df3['age_bracket'] = pd.cut(df3['age'], bins=age_bins, labels=age_labels, right=False)

    return df3


In [7]:
fraud_data = clean_data(fraud_test, fraud_train)

In [8]:
fraud_data.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,gender,street,city,state,zip,...,dob,trans_num,merch_lat,merch_long,is_fraud,date,name,month,age,age_bracket
0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,M,351 Darlene Green,Columbia,SC,29209,...,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,33.986391,-81.200714,False,2020-06-21,Jeff Elliott,2020-06,56.659822,55-64
1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,F,3638 Marsh Union,Altonah,UT,84002,...,1990-01-17,324cc204407e99f51b0d6ca0055005e7,39.450498,-109.960431,False,2020-06-21,Joanne Williams,2020-06,34.8282,25-34
2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,9333 Valentine Point,Bellmore,NY,11710,...,1970-10-21,c81755dbbbea9d5c77f094348a7579be,40.49581,-74.196111,False,2020-06-21,Ashley Lopez,2020-06,54.069815,45-54
3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,M,32941 Krystal Mill Apt. 552,Titusville,FL,32780,...,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,28.812398,-80.883061,False,2020-06-21,Brian Williams,2020-06,37.311431,35-44
4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,M,5783 Evan Roads Apt. 465,Falmouth,MI,49632,...,1955-07-06,57ff021bd3f328f8738bb535c302a31b,44.959148,-85.884734,False,2020-06-21,Nathan Massey,2020-06,69.36345,65+


In [9]:
explore_metadata(fraud_data)

<class 'pandas.core.frame.DataFrame'>
Index: 1852394 entries, 0 to 1296674
Data columns (total 24 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  object        
 1   cc_num                 int64         
 2   merchant               object        
 3   category               object        
 4   amt                    float64       
 5   gender                 object        
 6   street                 object        
 7   city                   object        
 8   state                  object        
 9   zip                    int64         
 10  lat                    float64       
 11  long                   float64       
 12  city_pop               int64         
 13  job                    object        
 14  dob                    datetime64[ns]
 15  trans_num              object        
 16  merch_lat              float64       
 17  merch_long             float64       
 18  is_fraud               obje

##### How many transactions have been categorised as fraudulent?

In [10]:
def count_fraud_transactions(df):
    df2 = df.groupby('is_fraud')['cc_num'].count().reset_index(name='count')
    total_transactions = len(df)
    df2['percentage'] = ((df2['count'] / total_transactions)*100).round(1)
    return df2

fraud_count = count_fraud_transactions(fraud_data)
fraud_count.head()

Unnamed: 0,is_fraud,count,percentage
0,False,1842743,99.5
1,True,9651,0.5


> Only 0.5% of transactions recorded were fraudulent.

In [11]:
fraud_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   is_fraud    2 non-null      object 
 1   count       2 non-null      int64  
 2   percentage  2 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 180.0+ bytes


#### What is the distribution of fraudulent purchases?

In [12]:
def dist_fraud_payments(df):
    df2 = df[df['is_fraud'] == 'true']

    # Calculate KDE for fraduluent payments
    kde = gaussian_kde(df2['amt'])
    amounts = np.linspace(df2['amt'].min(), df2['amt'].max(), 100)
    density = kde(amounts)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=amounts, y=density, mode='lines', line_color='red'))
    fig.update_layout(
        title='Distribution of Fraudulent Payment Amounts', 
        xaxis_title='Amount', 
        yaxis_title='Density'
    )
    return fig.show()

dist_fraud_payments(fraud_data)

> The KDE graph shows multiple peaks in fraudulent payments, notably around >$30, $300 and $300. This suggests that fraudulent payments often cluster around these values. There are also low-density areas between $400-$600. The sharpest decline occurs after $1000. The credit card company can infer that fraudulent transactions tend to occur when the transaction is below $400. The recommendation would be to increase checks at those price points and see if there is a relationship between the checks and the hopeful decline of fraudulent activity.

#### Which types of purchases are typically fraudulent?

In [13]:
def find_fraudulent_transactions_by_category(df):
    df2 = df.groupby(['is_fraud', 'category'])['cc_num'].count().reset_index(name='count')

    total_category_transactions = df2.groupby('category')['count'].transform('sum')

    df2['fraud_percentage'] = ((df2['count'] / total_category_transactions) * 100).round(1)

    df3 = df2[df2['is_fraud'] == 'true']

    # Create bar plot
    fig = px.bar(df3, x='category', y='fraud_percentage', 
                 title='Percentage of Fraudulent Transactions by Category',
                 text='fraud_percentage')

    fig.update_traces(marker_color='red', texttemplate='%{text}%', textposition='inside')

    # Customize hover template to show percentage
    fig.update_traces(hovertemplate='%{x}:%{y:.1f}%')

    # Remove gridlines for a cleaner look
    fig.update_layout(
        xaxis=dict(showgrid=False),
        yaxis=dict(showgrid=False),
        yaxis_title="Fraud Percentage (%)"
    )

    # Show the figure
    fig.show()
    
# Call the function with your data
find_fraudulent_transactions_by_category(fraud_data)


> The graph demonstrates that most categories have less than 2% of fraudulent transactions, which is positive. Grocery shopping (in store), and online shopping have the highest percentages of fraudulent payments associated (1.3% and 1.6% respectively). The recommendation would be to invest in technology that would be able to identify trends in fraudulent transactions to deter this type of activity.

#### Which gender has committed more fraudulent transactions?

In [14]:
def find_fraudulent_transactions_by_gender(df):
    df2 = df.groupby(['is_fraud', 'gender'])['cc_num'].count().reset_index(name='count')
    
    total_fraud_transactions = df2.groupby('is_fraud')['count'].transform('sum')

    df2['fraud_percentage'] = ((df2['count'] / total_fraud_transactions)*100).round(1)
    
    fig = px.bar(df2, x='is_fraud', y='fraud_percentage', color='gender', 
                  title='Percentage of Fraudulent Transactions by Gender', 
                  color_discrete_sequence=["#de77f7", "#65c2f7"])

    for trace in fig.data:
        gender = trace.name
        trace_data = df2[df2['gender'] == gender]
        
        trace.text = trace_data['fraud_percentage'].astype(str) + '%'
        trace.textposition = 'inside'
        trace.hovertemplate = '%{y:.1f}%'
        
    fig.update_layout(
        xaxis=dict(showgrid=False),    
        yaxis=dict(showgrid=False)    
    )

    
    return fig.show()

find_fraudulent_transactions_by_gender(fraud_data)

> There are more fradulent transactions registered to women, representative of 50.8% of total transactions categorized as fraud. 

In [15]:
# Is there a relationship between seasonalilty and fraudulent transactions?
def find_fraudulent_transactions_by_date(df):
    df2 = df.groupby(['month', 'is_fraud'])['cc_num'].count().reset_index(name='count')
    
    total_transactions = df2.groupby('month')['count'].transform('sum')
    
    df2['fraud_percentage'] = ((df2['count'] / total_transactions) * 100).round(1)

    df3 = df2[df2['is_fraud'] == 'true']
    
    fig = px.bar(df3, x='month', y='fraud_percentage',
                  title='Monthly Fraudulent Transaction Percentage', 
                  color_discrete_sequence=['#65c2f7'])
    
    for trace in fig.data:
        trace.hovertemplate = '%{x}<br>Fraud percentage: %{y:.1f}%'

    average_percentage_month_on_month = df3['fraud_percentage'].mean().round(1)

    fig.add_trace(
        go.Scatter(
            x=df3['month'], 
            y=[average_percentage_month_on_month] * len(df3['month']),
            mode='lines',
            line=dict(dash='dash', color='red'),
            name='Monthly Average',
            hovertemplate='%{y:.f}%'
        )
    )

    return fig.show()

find_fraudulent_transactions_by_date(fraud_data)

> There is a relationship between seasonality and the percentage of fraudulent transactions. Between January 2019 - March 2019, fraudulent activity peaked above the monthly average (0.6%). In the summer, it declined to 0.4% and remained at this level until September 2019. A similar trend was seen in 2020, where between June and August remained 0.4-0.5%. The most noticeable increase ocurred between December to January 2020, equating to a 0.3% increase month-on-month.

#### Is there a relationship between fraudulent activity and the consumer's age?

In [16]:
def fraud_transactions_by_age(df):
    df2 = df.groupby(['age_bracket', 'is_fraud'])['cc_num'].count().reset_index(name='count')
    # Create a DataFrame that stores the true values
    df3 = df2[df2['is_fraud'] == 'true']

    fig = px.bar(df3, x='age_bracket', y='count',
                  title='Fraudulent Transactions Categorized by Age', 
                  text='count')
    
    fig.update_traces(texttemplate='%{text}', textposition='inside')
    
    for trace in fig.data:
        trace.hovertemplate = '%{x}<br>Fraudulent transactions: %{y}'

    return fig.show()

age_fraud = fraud_transactions_by_age(fraud_data)

> The 18-24 bracket have the least amount of fraudulent activity, accounting for 1.78% of total fraudulent purchases compared to the 65+ bucket that responsible for 27.5%. It might worth doing further analysis into what is typical transaction amount of 65+ bracket to identify what preventative measures can be applied.

In [17]:
fraud_data.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,gender,street,city,state,zip,...,dob,trans_num,merch_lat,merch_long,is_fraud,date,name,month,age,age_bracket
0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,M,351 Darlene Green,Columbia,SC,29209,...,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,33.986391,-81.200714,False,2020-06-21,Jeff Elliott,2020-06,56.659822,55-64
1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,F,3638 Marsh Union,Altonah,UT,84002,...,1990-01-17,324cc204407e99f51b0d6ca0055005e7,39.450498,-109.960431,False,2020-06-21,Joanne Williams,2020-06,34.8282,25-34
2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,9333 Valentine Point,Bellmore,NY,11710,...,1970-10-21,c81755dbbbea9d5c77f094348a7579be,40.49581,-74.196111,False,2020-06-21,Ashley Lopez,2020-06,54.069815,45-54
3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,M,32941 Krystal Mill Apt. 552,Titusville,FL,32780,...,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,28.812398,-80.883061,False,2020-06-21,Brian Williams,2020-06,37.311431,35-44
4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,M,5783 Evan Roads Apt. 465,Falmouth,MI,49632,...,1955-07-06,57ff021bd3f328f8738bb535c302a31b,44.959148,-85.884734,False,2020-06-21,Nathan Massey,2020-06,69.36345,65+


In [18]:
def dist_fraud_payments_plus_sixty(df):
    # Exclude any activity that is not relevant to the 65+ cohort
    df2 = df[(df['is_fraud'] == 'true') & (df['age_bracket'] == '65+')]

    # Calculate KDE for fraduluent payments
    kde = gaussian_kde(df2['amt'])
    amounts = np.linspace(df2['amt'].min(), df2['amt'].max(), 100)
    density = kde(amounts)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=amounts, y=density, mode='lines', line_color='red'))
    fig.update_layout(
        title='Distribution of Fraudulent Payment Amounts in 65+ age bracket', 
        xaxis_title='Amount', 
        yaxis_title='Density'
    )
    return fig.show()

dist_fraud_payments_plus_sixty(fraud_data)

> There are fewer peaks in this KDE compared to when we were assessing the wider population. Notable peaks are around the $300 and $900, suggesting that eldery individuals are likely to make fraduluent payments if they are making larger purchases.

#### How is the fraudulent activity geographically dispered?

In [19]:

def fraud_by_state(df):
    df2 = df.groupby(['state', 'is_fraud'])['cc_num'].count().reset_index(name='count')

    df3 = df2[df2['is_fraud'] == 'true']

    # Create the choropleth map
    fig = px.choropleth(df3, 
                        locations="state", 
                        locationmode="USA-states", 
                        color="count",
                        hover_name='state', 
                        hover_data={"state": True, "count": True},  
                        color_continuous_scale=px.colors.sequential.Plasma[::-1], 
                        title="Fraudulent Activity by U.S. State")

    fig.update_traces(hovertemplate='%{location}<br>%{customdata[0]}',  
                      customdata=df3[['count']].values)

    fig.update_geos(scope="usa")

    return fig.show()

fraud_by_state(fraud_data)

> The choropleth graph shows the distribution of fraud purchases across the United States. Fraudulent activity is more concentrated in the East Coast, namely Pennsylvania (572) and New York (730). There has also a high count of fraudulent transactions found in Texas (592) as well as California (402). With the increased popularity in VPNs, the recommendation would be to identify whether the consumers have used a VPN to have an accurate representation of the geographical distribution of fraudulent purchases.

In [20]:
def data_processing(df):   
    # Data pre-processing:
    df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
    df['trans_year'] = df['trans_date_trans_time'].dt.year
    df['trans_month'] = df['trans_date_trans_time'].dt.month
    df['trans_day'] = df['trans_date_trans_time'].dt.day
    df['trans_hour'] = df['trans_date_trans_time'].dt.hour

    # Calculate age:
    df['dob'] = pd.to_datetime(df['dob'])
    df['age'] = (df['trans_date_trans_time'] - df['dob']).dt.days // 365

    # Calculate distance between cardholder's location and merchant's location
    df['distance'] = np.sqrt((df['lat'] - df['merch_lat'])**2 + (df['long'] - df['merch_long'])**2)

    df.drop(columns=['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'first', 'last', 'street', 'city', 
                 'state', 'zip', 'dob', 'trans_num', 'unix_time', 'lat', 'long', 'merch_lat', 
                 'merch_long'], inplace=True)
    return df

In [21]:
fraud_test_df = data_processing(fraud_test)
fraud_test_df.head()

Unnamed: 0,merchant,category,amt,gender,city_pop,job,is_fraud,trans_year,trans_month,trans_day,trans_hour,age,distance
0,fraud_Kirlin and Sons,personal_care,2.86,M,333497,Mechanical engineer,0,2020,6,21,12,52,0.266004
1,fraud_Sporer-Keebler,personal_care,29.84,F,302,"Sales professional, IT",0,2020,6,21,12,30,0.991674
2,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,34496,"Librarian, public",0,2020,6,21,12,49,0.68297
3,fraud_Haley Group,misc_pos,60.05,M,54767,Set designer,0,2020,6,21,12,32,0.250985
4,fraud_Johnston-Casper,travel,3.19,M,1126,Furniture designer,0,2020,6,21,12,65,1.118816


In [22]:
fraud_train_df = data_processing(fraud_train)
fraud_train_df.head()

Unnamed: 0,merchant,category,amt,gender,city_pop,job,is_fraud,trans_year,trans_month,trans_day,trans_hour,age,distance
0,"fraud_Rippin, Kub and Mann",misc_net,4.97,F,3495,"Psychologist, counselling",0,2019,1,1,0,30,0.87283
1,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,F,149,Special educational needs teacher,0,2019,1,1,0,40,0.27231
2,fraud_Lind-Buckridge,entertainment,220.11,M,4154,Nature conservation officer,0,2019,1,1,0,56,0.975845
3,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,M,1939,Patent attorney,0,2019,1,1,0,52,0.919802
4,fraud_Keeling-Crist,misc_pos,41.96,M,99,Dance movement psychotherapist,0,2019,1,1,0,32,0.868505


In [23]:
# 1. Downsample Majority Class for Training
def downsample_data(df):
    majority = df[df['is_fraud'] == 0]
    minority = df[df['is_fraud'] == 1]

    majority_downsampled = resample(majority, 
                                    replace=False, 
                                    n_samples=len(minority), 
                                    random_state=42)
    return pd.concat([majority_downsampled, minority]).sample(frac=1, random_state=42)

# Downsample fraud_train_df
fraud_train_downsampled = downsample_data(fraud_train_df)

# 2. Define Feature Columns
numeric_features = ['amt', 'city_pop', 'age', 'distance', 'trans_year', 'trans_month', 'trans_day', 'trans_hour']
categorical_features = ['category', 'gender', 'job']

# 3. Preprocessing for Features
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('encoder', TargetEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# 4. Split Training Data into Train and Validation Sets
X_train = fraud_train_downsampled.drop(columns=['is_fraud'])
y_train = fraud_train_downsampled['is_fraud']
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

X_test = fraud_test_df.drop(columns=['is_fraud'])
y_test = fraud_test_df['is_fraud']

# 5. Define and Train Optimized Random Forest Model
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(
                          n_estimators=50, 
                          max_depth=10, 
                          max_features='sqrt', 
                          random_state=42, 
                          n_jobs=-1))])

start_time = time.time()
clf.fit(X_train, y_train)
print("Training Time: {:.2f} seconds".format(time.time() - start_time))

# 6. Evaluate on Validation Data
y_val_pred = clf.predict(X_val)
y_val_pred_proba = clf.predict_proba(X_val)[:, 1]

print("\nValidation Metrics:")
print("Classification Report:\n", classification_report(y_val, y_val_pred))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))
print("AUC-ROC Score:", roc_auc_score(y_val, y_val_pred_proba))

# 7. Evaluate on Test Data
y_test_pred = clf.predict(X_test)
y_test_pred_proba = clf.predict_proba(X_test)[:, 1]

print("\nTest Metrics:")
print("Classification Report:\n", classification_report(y_test, y_test_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))
print("AUC-ROC Score:", roc_auc_score(y_test, y_test_pred_proba))


Training Time: 0.41 seconds

Validation Metrics:
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.97      0.96      1495
           1       0.97      0.96      0.96      1508

    accuracy                           0.96      3003
   macro avg       0.96      0.96      0.96      3003
weighted avg       0.96      0.96      0.96      3003

Confusion Matrix:
 [[1444   51]
 [  58 1450]]
AUC-ROC Score: 0.9948994437692396

Test Metrics:
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.96      0.98    553574
           1       0.09      0.93      0.17      2145

    accuracy                           0.96    555719
   macro avg       0.55      0.95      0.57    555719
weighted avg       1.00      0.96      0.98    555719

Confusion Matrix:
 [[533733  19841]
 [   156   1989]]
AUC-ROC Score: 0.9868901640328769


> <b> Validation Metrics:</b>

* The model has a high precision in associating fraudulent activity (96% for non-fraudulent transactions and 97% for fraudulent activity for precision), and false positives were low. The AUC-ROC score being close to 1 suggests that the model is able to distinguish between fraud and non-fraud excellently.




> <b> Test Metrics:</b>

* While the model is nearly perfect in identifying non-fraudulent activity (precision 100%, recall 96%, F1-Score 98%), the precision for fraud cases is quite low (9%). The F1-Score is also unpromising (17%)  suggesting that there are improvements to be made to the model to better identify fraudulent payment. 




* We can use the analysis prior to the model training to identify which specific types of transactions could go through more stringent screening like Online Shopping transactions.

