# Fraud detection using Machine Learning

# This project builds several models to classify credit card transactions as a fraud or non-fraud.
# This is the first version of the code and the main version for the project.

# Models Used:
# - Logistic Regression
# - Random Forest
# - XGBoost
# - Decision Tree

# Evaluation Metrics:
# - ROC AUC Score (primary metric)
# - Precision, Recall, F1-Score

In [None]:
#loading the data
import pandas as pd
data = pd.read_csv('creditcard.csv')

In [None]:
#view the data
data.head()

In [None]:
#information about the data
data.info()

In [None]:
#check missing values
print(data.isnull().sum())

In [None]:
#check for balance and visualise it
print(data["Class"].value_counts())

import seaborn as sns
sns.countplot(x = "Class", data = data)
import matplotlib.pyplot as plt
plt.title("Fraud transactions vs non-fraud transactions")

In [None]:
#preprocess the data
#cant see the significance of time, so im dropping the time column
data = data.drop(["Time"],axis = 1)

#scale amount column using standard scaler
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
data["Amount"] = scale.fit_transform(data[["Amount"]])

In [None]:
#define the data
x = data.drop(["Class"], axis = 1)
y = data["Class"]

In [None]:
#split the data
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y, random_state= 42, stratify= y)

In [None]:
#way 1: 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix

models = {"Logistic regression":LogisticRegression(max_iter= 1000, class_weight= "balanced"),
          "Random forest":RandomForestClassifier(),
          "XGBoost":XGBClassifier(),
          "Decision tree":DecisionTreeClassifier()}
for modelname,model in models.items():
    print(f"{modelname.upper()} RESULTS:")
    model.fit(x_train,y_train)
    y_predict = model.predict(x_test)
    y_proba = model.predict_proba(x_test)[:,1] 
    accuracy = accuracy_score(y_test, y_predict)
    accuracy_roc = roc_auc_score(y_test, y_proba) #used roc cause accuracy is not the best for unbalanced data
    cm = confusion_matrix(y_test, y_predict)
    print(f"Accuracy = {accuracy}\nArea under curve score = {accuracy_roc}\nConfusion matrix = {cm}")
    
    #heat map to visualize the confusion matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"{modelname} Confusion matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

    #showing the other evaluation metrics
    from sklearn.metrics import classification_report
    print(f"{modelname} Classification report:")
    print(classification_report(y_test, y_predict))

In [None]:
#drawing graphs for roc
from sklearn.metrics import roc_curve
for modelname,model in models.items():
    x_axis_fp, y_axis_tp, _ = roc_curve(y_test, model.predict_proba(x_test)[:,1])
    plt.plot( x_axis_fp, y_axis_tp, label = modelname)
plt.xlabel("False positive")
plt.ylabel("True positive")
plt.title("ROC Curve for all models")
plt.legend()
plt.grid(True)
plt.show()

# conclusion
# In this project, I built a machine learning fraud detection system using credit card fraud dataset from kaggle. 
# The dataset presented a problem due to its extreme class imbalance, with fraud transactions being less than 0.2% of the entire data.
# I trained four models: Logistic regression, Decision tree, Random forest and XGBoost.
# To address the extreme imbalance in the dataset, I used stratified sampling when train-test spliting the data to make sure that both training and testing sets had the same proportion of fraud and non-fraud cases as the original dataset.
# To handle the imbalance further, I applied class weighting on Logistic regression model.
# The models were primarily assessed using the ROC-AUC score, precision, recall and F1-score rather than accuracy alone because accuracy alone would be misleading since the dataset is highly skewed(imbalanced).
# Amongst the models, XGBoost and Random Forest showed a strong performance by achieving high ROC-AUC scores and also balanced precision and recall values. Logistic regression with balanced class weights also performed competitively which can be added on to the benefit of interpretability.
# Despite strong performances, no model had a perfect fraud detection. For instance, the best models correctly identified approximately 74% of fraudulent transactions, which is good but it still leaves room for undetected fraud. This highlights the trade-off between catching fraud and avoiding false alarms in real-world applications.
# To further improve performance, future work will consider using other techniques.
# Overall, this small project shows that machine learning provides a valuable foundation for detecting fraud in real-time fraud prevention systems.