# Real-Time Fraud Detection Pipeline

This notebook demonstrates the end-to-end pipeline for a real-time fraud detection system. It covers:
1. Data Ingestion: Download and load raw transaction data from Kaggle.
2. Data Transformation: Clean the data and add additional features.
3. Model Training: Train a fraud detection model (Logistic Regression in this example).
4. Model Evaluation: Evaluate the model's performance on test data.
5. Prediction: Run a sample prediction using the trained model.
6. API Testing: Send HTTP requests to the FastAPI endpoints for prediction and history retrieval.

Ensure that your environment is set up (e.g., with a fresh virtual environment and required packages installed) before running the notebook.

In [1]:
import sys
import os
# Add the parent directory (which contains data_pipeline, model, api, etc.) to the Python path.
sys.path.append(os.path.abspath(".."))

In [2]:
# Ingest the raw data (downloads if not already available)
from data_pipeline.ingest import ingest_data

df_raw = ingest_data()
print("Raw data preview:")
print(df_raw.head())

Raw data preview:
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26   

In [3]:
# Transform the raw data and save the transformed CSV
from data_pipeline.transform import transform_data

input_csv = "data/raw_transactions.csv"
output_csv = "data/transformed_data.csv"
df_transformed = transform_data(input_csv, output_csv)
print("Transformed data preview:")
print(df_transformed.head())

Transformed data saved to data/transformed_data.csv.
Transformed data preview:
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V23       V24       V25       V26       V27  \
0  0.098698  0.363787  ... -0.110474  0.066928  0.128539 -0.189115  0.133558   
1  0.085102 -0.255425  ...  0.101288 -0.339846  0.167170  0.125895 -0.008983   
2  0.247676 -1.514654  ...  0.909412 -0.689281 -0.327642 -0.139097 -0.055353   
3  0.377436 -1.387024  ... -0.190321 -1.175575  0.647376 -0.221929  0.062723   
4 -0.270533  0.817739  ... -0.1374

## Model Training

In [4]:
# Train the fraud detection model using the transformed data.
from model.train import train_model

trained_model = train_model("data/transformed_data.csv")

# The model is saved as 'model/fraud_model.pkl' after training.

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.83      0.61      0.71        98

    accuracy                           1.00     56962
   macro avg       0.92      0.81      0.85     56962
weighted avg       1.00      1.00      1.00     56962

Model saved to model/fraud_model.pkl.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Model Evaluation

In [5]:
from model.evaluate import evaluate_model

# Evaluate the model on the transformed data (or use a separate test CSV if available)
evaluation_report = evaluate_model("model/fraud_model.pkl", "data/transformed_data.csv")
print(evaluation_report)

Evaluation Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    284315
           1       0.87      0.70      0.77       492

    accuracy                           1.00    284807
   macro avg       0.93      0.85      0.89    284807
weighted avg       1.00      1.00      1.00    284807

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    284315
           1       0.87      0.70      0.77       492

    accuracy                           1.00    284807
   macro avg       0.93      0.85      0.89    284807
weighted avg       1.00      1.00      1.00    284807



## Model Prediction

In [6]:
from model.predict import predict

# Create a sample feature vector. Note: the model was trained on all columns except "Class".
# You must supply a list of numeric values with the same length as the training features.
# Here, we assume a dummy vector of zeros (adjust as necessary):
sample_features = [0.0] * (len(df_transformed.columns) - 1)  # minus target column "Class"

fraud_probability = predict("model/fraud_model.pkl", sample_features)
print("Sample Fraud Probability:", fraud_probability)

Sample Fraud Probability: 0.022274062758189307




## API Testing

In [None]:
!pip install requests
import requests
import json

# Base URL for the API (ensure the FastAPI server is running, e.g., via uvicorn api.app:app --host 0.0.0.0 --port 8000)
API_URL = 'http://localhost:8000'
API_KEY = 'secret-key'
headers = {"x-api-key": API_KEY}

print("Testing /predict endpoint...")
predict_endpoint = f"{API_URL}/predict"

# Replace with actual features from your transformed data; here we use the same dummy vector
payload = {
    "features": sample_features,
    "transaction_id": "txn_001"
}

response = requests.post(predict_endpoint, json=payload, headers=headers)
print("Prediction API Response:")
print(json.dumps(response.json(), indent=2))

print("\nTesting /history endpoint...")
history_endpoint = f"{API_URL}/history/txn_001"
response_history = requests.get(history_endpoint, headers=headers)
print(json.dumps(response_history.json(), indent=2))