# FraudSense — Model Comparison (Google Colab)

This Colab notebook trains and **compares four models** (Logistic Regression, Decision Tree, Random Forest, XGBoost) on the **Kaggle 2013 Credit Card Fraud** dataset (`creditcard.csv`) and reports **Accuracy**, **Recall**, and **F1-score** for each model.  

**How to use in Colab**
1. Upload `creditcard.csv` to the Colab session (use the upload cell below).
2. Run all cells sequentially.

The notebook will:
- show basic EDA (class distribution, amount histogram),
- preprocess and optionally balance the data with SMOTE,
- train each model,
- compute and display metrics,
- and plot a comparison chart for Accuracy, Recall, and F1.


In [None]:
# Install required packages (run once in Colab)
!pip install -q imbalanced-learn xgboost joblib


In [None]:
# Upload dataset (click the file chooser) OR mount Drive and set DATA_PATH accordingly.
from google.colab import files
import os

DATA_PATH = "creditcard.csv"

if not os.path.exists(DATA_PATH):
    print("No local creditcard.csv found — please upload (file chooser will open).")
    uploaded = files.upload()
    for fn in uploaded.keys():
        print("Uploaded:", fn)
else:
    print("Found local", DATA_PATH)

# If you prefer to mount Google Drive:
# from google.colab import drive
# drive.mount('/content/drive')
# DATA_PATH = '/content/drive/MyDrive/path/to/creditcard.csv'


In [None]:
# Imports and helper functions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
TEST_SIZE = 0.2
USE_SMOTE = True  # set False to disable SMOTE


In [None]:
# Load dataset
DATA_PATH = "creditcard.csv"
df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
df.head()

In [None]:
# Basic EDA
print("Class distribution:")
print(df['Class'].value_counts())
print("\nPercentage of frauds:")
print(df['Class'].value_counts(normalize=True) * 100)

# Plot class distribution and Amount histogram
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.bar(['Legit','Fraud'], df['Class'].value_counts().values)
plt.title('Class distribution (counts)')
plt.subplot(1,2,2)
plt.hist(df['Amount'], bins=50)
plt.title('Transaction Amount distribution')
plt.tight_layout()
plt.show()

In [None]:
# Prepare features and target
X = df.drop(columns=['Class'])
y = df['Class']

# Keep feature column order for later
feature_columns = X.columns.tolist()

# Train-test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE,
                                                    stratify=y, random_state=RANDOM_STATE)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

In [None]:
# Scale numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Optionally apply SMOTE to balance training set
if USE_SMOTE:
    sm = SMOTE(random_state=RANDOM_STATE, n_jobs=-1)
    X_train_bal, y_train_bal = sm.fit_resample(X_train_scaled, y_train)
    print("After SMOTE, counts:", pd.Series(y_train_bal).value_counts().to_dict())
else:
    X_train_bal, y_train_bal = X_train_scaled, y_train



In [None]:
# Define models
models = {
    'LogisticRegression': LogisticRegression(max_iter=2000, class_weight='balanced', random_state=RANDOM_STATE),
    'DecisionTree': DecisionTreeClassifier(class_weight='balanced', random_state=RANDOM_STATE),
    'RandomForest': RandomForestClassifier(n_estimators=200, class_weight='balanced', n_jobs=-1, random_state=RANDOM_STATE),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_jobs=-1, random_state=RANDOM_STATE)
}

results = []

for name, model in models.items():
    print(f"Training {name} ...")
    model.fit(X_train_bal, y_train_bal)
    y_pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    results.append({'model': name, 'accuracy': acc, 'recall': rec, 'f1': f1})
    print(f"{name} -> Accuracy: {acc:.4f}, Recall: {rec:.4f}, F1: {f1:.4f}")
    print(classification_report(y_test, y_pred, digits=4))
    print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
    print("-"*50)

metrics_df = pd.DataFrame(results).set_index('model')
metrics_df

In [None]:
# Plot comparison bar charts
metrics_df = metrics_df.sort_values('f1', ascending=False)
metrics_df_plot = metrics_df[['accuracy','recall','f1']]

metrics_df_plot.plot(kind='bar', figsize=(10,6))
plt.title('Model comparison: Accuracy / Recall / F1')
plt.ylabel('Score')
plt.ylim(0,1)
plt.grid(axis='y', linestyle='--', alpha=0.4)
plt.show()

In [None]:
# Save best model by F1
import joblib, os
best_name = metrics_df['f1'].idxmax()
best_f1 = metrics_df['f1'].max()
print("Best model by F1:", best_name, best_f1)

best_model = models[best_name]
os.makedirs('models', exist_ok=True)
joblib.dump({'model_name': best_name, 'model': best_model, 'scaler': scaler, 'feature_columns': feature_columns}, 'models/final_model_colab.pkl')
print("Saved to models/final_model_colab.pkl")

## Conclusion

This notebook compared four models (Logistic Regression, Decision Tree, Random Forest, XGBoost) on the 2013 credit card fraud dataset.  
- The metrics table above shows **Accuracy**, **Recall**, and **F1** for each model.  
- We saved the best model by **F1** as `models/final_model_colab.pkl` for quick deployment.

You can now:
- Use the saved model in a Flask app or API.
- Tune hyperparameters or run cross-validation for improved performance.
- Add more visualizations (ROC curve, precision-recall curve) if needed.
