# üîí FraudGuard Training Notebook

**AD-RL-GNN Fraud Detection** | Full training pipeline with mini-batch processing

This notebook trains the FraudGuard model on the IEEE-CIS fraud detection dataset using:
- **NeighborLoader** for memory-efficient mini-batch training
- **FAISS** for similarity graph construction (GPU if available, CPU fallback)
- **FocalLoss** for class-imbalanced learning

**Target Metrics:**
- Specificity: 98.72%
- G-Means Improvement: 18.11%
- P95 Latency: <100ms

## 1Ô∏è‚É£ Setup Environment

In [1]:
# Mount Google Drive for data storage
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Clone repository
!git clone https://github.com/govind104/fraudguard.git
%cd fraudguard

Cloning into 'fraudguard'...
remote: Enumerating objects: 154, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (106/106), done.[K
remote: Total 154 (delta 60), reused 125 (delta 42), pack-reused 0 (from 0)[K
Receiving objects: 100% (154/154), 104.58 KiB | 5.23 MiB/s, done.
Resolving deltas: 100% (60/60), done.
/content/fraudguard


In [None]:
# Install dependencies
# Note: faiss-gpu may not be available on Python 3.12
# The code will fallback to faiss-cpu automatically
# GNN training STILL runs on GPU - only graph building uses CPU FAISS
!pip install -q torch torch-geometric pandas numpy scikit-learn pyyaml structlog

# Try faiss-gpu first, fallback to faiss-cpu
import subprocess
result = subprocess.run(['pip', 'install', '-q', 'faiss-gpu'], capture_output=True)
if result.returncode != 0:
    print('‚ö†Ô∏è faiss-gpu not available, using faiss-cpu')
    print('   (Graph building on CPU, but GNN training still runs on GPU!)')
    !pip install -q faiss-cpu
else:
    print('‚úì faiss-gpu installed')

import torch

# 1. Get exact versions
pt_version = torch.__version__.split('+')[0]  # e.g., 2.5.1
cuda_version = "cu" + torch.version.cuda.replace('.', '')  # e.g., cu124
wheel_url = f"https://data.pyg.org/whl/torch-{pt_version}+{cuda_version}.html"

print(f"PyTorch: {pt_version}, CUDA: {cuda_version}")
print(f"Downloading from: {wheel_url}")

# 2. Install with visible output (force reinstall to fix broken partial installs)
!pip install --force-reinstall torch-scatter torch-sparse -f $wheel_url

# Install repo in editable mode
!pip install -e .

print('\n‚úì Environment setup complete')

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m63.7/63.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m72.5/72.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h‚ö†Ô∏è faiss-gpu not available, using faiss-cpu
   (Graph building on CPU, but GNN training still runs on GPU!)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m104.6 MB/s[0m eta [36m0:00:00[0m
[?25hPyTorch: 2.9.0, CUDA: cu126
Downloading from: https://data.pyg.org/whl/torch-2.9.

In [None]:
import torch
try:
    import torch_scatter
    import torch_sparse
    import fraudguard
    print("‚úÖ Success! Libraries are installed and loaded.")
except ImportError as e:
    print(f"‚ùå Still missing libraries: {e}")
    # Only if you see this error should you go back and install again.

‚úÖ Success! Libraries are installed and loaded.


## 2Ô∏è‚É£ Configuration

In [None]:
import os

# ==============================================
# CONFIGURATION - UPDATE THESE PATHS AS NEEDED
# ==============================================

# Data paths - Point to your Google Drive folders
DATA_DIR = "/content/drive/MyDrive/ieee-fraud-detection"
MODELS_DIR = "/content/drive/MyDrive/fraudguard-models"
LOGS_DIR = "/content/drive/MyDrive/fraudguard-logs"

# Training parameters
SAMPLE_FRAC = 1.0      # Use full dataset (1.0 = 100%)
MAX_EPOCHS = 30
BATCH_SIZE = 4096      # Reduce to 2048 or 1024 if OOM
NUM_NEIGHBORS = [25, 10]  # 2-hop neighborhood sampling

# Create directories
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(LOGS_DIR, exist_ok=True)

print(f"Data: {DATA_DIR}")
print(f"Models: {MODELS_DIR}")
print(f"Logs: {LOGS_DIR}")
print(f"\nBatch size: {BATCH_SIZE}")
print(f"Sample fraction: {SAMPLE_FRAC*100:.0f}%")

Data: /content/drive/MyDrive/ieee-fraud-detection
Models: /content/drive/MyDrive/fraudguard-models
Logs: /content/drive/MyDrive/fraudguard-logs

Batch size: 4096
Sample fraction: 100%


## 3Ô∏è‚É£ Verify GPU and FAISS

In [None]:
import torch
import faiss

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print("\n‚úì GNN training will run on GPU")
else:
    print("\n‚ö†Ô∏è WARNING: No GPU detected. Go to Runtime > Change runtime type > GPU")

# Check FAISS GPU
faiss_gpus = faiss.get_num_gpus() if hasattr(faiss, 'get_num_gpus') else 0
print(f"\nFAISS GPUs: {faiss_gpus}")
if faiss_gpus == 0:
    print("   (Using CPU FAISS for graph building - this is OK)")

PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4
VRAM: 15.8 GB

‚úì GNN training will run on GPU

FAISS GPUs: 0
   (Using CPU FAISS for graph building - this is OK)


## 4Ô∏è‚É£ Load and Preprocess Data

In [None]:
import sys
sys.path.insert(0, '/content/fraudguard')

from pathlib import Path
from src.data.loader import FraudDataLoader
from src.data.preprocessor import FeaturePreprocessor
from src.data.graph_builder import GraphBuilder
from src.utils.config import load_data_config
from src.utils.device_utils import set_seed, get_device

set_seed(42)
device = get_device()
print(f"Using device: {device}")

# Load config and override path with notebook variable
data_cfg = load_data_config()
data_cfg.paths.raw_data_dir = Path(DATA_DIR)

# Load data with corrected path
loader = FraudDataLoader(config=data_cfg)
df = loader.load_train_data(sample_frac=SAMPLE_FRAC)
train_df, val_df, test_df = loader.create_splits(df)

print(f"\nData loaded:")
print(f"  Train: {len(train_df):,}")
print(f"  Val: {len(val_df):,}")
print(f"  Test: {len(test_df):,}")
print(f"  Fraud rate: {df['isFraud'].mean()*100:.2f}%")

Loading faiss with CPU support (no GPU detected).
Using device: cuda

Data loaded:
  Train: 354,324
  Val: 118,108
  Test: 118,108
  Fraud rate: 3.50%


## 5Ô∏è‚É£ Run Full AD-RL-GNN Pipeline

We use the `FraudTrainer` class to orchestrate the full pipeline, including:
1. **AdaptiveMCD**: Intelligent majority downsampling
2. **RL Agent**: Dynamic subgraph selection (Random Walk, K-Hop, K-Ego)
3. **Graph Enhancement**: Adding semantic edges
4. **GNN Training**: CrossEntropyLoss (15x weight)

In [None]:
import torch
from torch_geometric.data import Data
from torch_geometric.loader import NeighborLoader
from src.training.trainer import FraudTrainer
from src.utils.config import load_model_config, load_data_config
from sklearn.metrics import confusion_matrix, f1_score
import numpy as np
import time

# Load Configs & Apply Pro Overrides
model_cfg = load_model_config()
data_cfg = load_data_config()

model_cfg.training["max_epochs"] = 30
model_cfg.adaptive_mcd["alpha"] = 0.5
model_cfg.rl_agent["reward_scaling"] = 2.0
model_cfg.graph.similarity_threshold = 0.80

# Initialize Trainer (Wrapper)
print("üöÄ Initializing Hybrid AD-RL-GNN Pipeline...")
trainer = FraudTrainer(
    model_config=model_cfg,
    data_config=data_cfg,
    device=device
)

# Run the Data Prep Steps
print("‚öôÔ∏è Preprocessing & Building Graph...")
trainer._preprocess(train_df, val_df, test_df)
trainer._build_graph()
trainer._prepare_labels(train_df, val_df, test_df)

# A. AdaptiveMCD: Fixes the Class Imbalance (Updates train_mask)
print("\nüß† Training AdaptiveMCD (The Smart Filter)...")
trainer._train_mcd()

# B. RL Agent: Fixes the Graph Topology (Updates edge_index)
print("\nü§ñ Training RL Agent (The Graph Explorer)...")
trainer._train_rl_and_enhance()

print("\nüì¶ Setting up Mini-Batch Loaders (Fixing Memory Crash)...")

# Create PyG Data Object from the OPTIMIZED Trainer State
optimized_data = Data(
    x=trainer.X_full,
    edge_index=trainer.edge_index, # <--- Contains RL-enhanced edges
    y=trainer.all_labels
)
optimized_data.train_mask = trainer.train_mask
optimized_data.val_mask = trainer.val_mask
optimized_data.test_mask = trainer.test_mask

# Create Loaders (Batch Size 4096 = Low VRAM usage)
train_loader = NeighborLoader(
    optimized_data,
    num_neighbors=[25, 10],
    batch_size=4096,
    input_nodes=optimized_data.train_mask,
    shuffle=True
)

val_loader = NeighborLoader(
    optimized_data,
    num_neighbors=[25, 10],
    batch_size=4096,
    input_nodes=optimized_data.val_mask,
    shuffle=False
)

# Initialize Model & Optimizer
trainer._init_model()
model = trainer.model
optimizer = trainer.optimizer
criterion = trainer.criterion # Already set to Weighted CrossEntropy (15x)

# Training Loop
print(f"\nüöÄ Starting GNN Training (Mini-Batch) on {device}...")
best_gmeans = 0
history = []

for epoch in range(30): # MAX_EPOCHS
    model.train()
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        out = model(batch.x, batch.edge_index)
        loss = criterion(out[:batch.batch_size], batch.y[:batch.batch_size])
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)

    # Validation
    model.eval()
    all_preds, all_true = [], []
    with torch.no_grad():
        for batch in val_loader:
            batch = batch.to(device)
            out = model(batch.x, batch.edge_index)
            pred = out[:batch.batch_size].argmax(dim=1)
            all_preds.extend(pred.cpu().numpy())
            all_true.extend(batch.y[:batch.batch_size].cpu().numpy())

    cm = confusion_matrix(all_true, all_preds, labels=[0, 1])
    tn, fp, fn, tp = cm.ravel()
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
    gmeans = np.sqrt(tpr * tnr)

    print(f"Epoch {epoch+1:>3} | Loss: {avg_loss:.4f} | Spec: {tnr*100:.2f}% | Recall: {tpr*100:.2f}% | G-Means: {gmeans*100:.2f}%")

    if gmeans > best_gmeans:
        best_gmeans = gmeans
        torch.save(model.state_dict(), f"{MODELS_DIR}/fraudguard_best_pro.pt")

print(f"\nTraining Complete. Best G-Means: {best_gmeans*100:.2f}%")

üöÄ Initializing Hybrid AD-RL-GNN Pipeline...
‚öôÔ∏è Preprocessing & Building Graph...

üß† Training AdaptiveMCD (The Smart Filter)...

ü§ñ Training RL Agent (The Graph Explorer)...

üì¶ Setting up Mini-Batch Loaders (Fixing Memory Crash)...


  neighbor_sampler = NeighborSampler(



üöÄ Starting GNN Training (Mini-Batch) on cuda...
Epoch   1 | Loss: 0.7742 | Spec: 95.09% | Recall: 4.97% | G-Means: 21.73%
Epoch   2 | Loss: 0.6708 | Spec: 79.46% | Recall: 20.91% | G-Means: 40.76%
Epoch   3 | Loss: 0.6652 | Spec: 91.50% | Recall: 8.78% | G-Means: 28.35%
Epoch   4 | Loss: 0.6603 | Spec: 88.69% | Recall: 11.30% | G-Means: 31.66%
Epoch   5 | Loss: 0.6508 | Spec: 83.75% | Recall: 15.03% | G-Means: 35.48%
Epoch   6 | Loss: 0.6387 | Spec: 81.91% | Recall: 19.08% | G-Means: 39.54%
Epoch   7 | Loss: 0.6275 | Spec: 75.81% | Recall: 22.86% | G-Means: 41.63%
Epoch   8 | Loss: 0.6220 | Spec: 69.37% | Recall: 28.24% | G-Means: 44.26%
Epoch   9 | Loss: 0.6169 | Spec: 71.01% | Recall: 27.09% | G-Means: 43.86%
Epoch  10 | Loss: 0.6160 | Spec: 63.20% | Recall: 34.31% | G-Means: 46.57%
Epoch  11 | Loss: 0.6119 | Spec: 88.26% | Recall: 14.77% | G-Means: 36.11%
Epoch  12 | Loss: 0.6125 | Spec: 70.31% | Recall: 27.93% | G-Means: 44.32%
Epoch  13 | Loss: 0.6083 | Spec: 86.61% | Recall: 

## 6Ô∏è‚É£ Evaluation & Claims Verification

In [None]:
# Safe Evaluation (Mini-Batch)
print("üîç Benchmarking & Evaluation...")
model.load_state_dict(torch.load(f"{MODELS_DIR}/fraudguard_best_pro.pt"))
model.eval()

# Use neighbors=[25, 10] for fast inference (Latency < 10ms)
test_loader = NeighborLoader(
    optimized_data,
    num_neighbors=[25, 10],
    batch_size=4096,
    input_nodes=optimized_data.test_mask,
    shuffle=False
)

latencies = []
all_preds, all_true = [], []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        start = time.perf_counter()
        out = model(batch.x, batch.edge_index)
        latencies.append((time.perf_counter() - start) * 1000)
        pred = out[:batch.batch_size].argmax(dim=1)
        all_preds.extend(pred.cpu().numpy())
        all_true.extend(batch.y[:batch.batch_size].cpu().numpy())

# Metrics
cm = confusion_matrix(all_true, all_preds, labels=[0, 1])
tn, fp, fn, tp = cm.ravel()
tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
gmeans = np.sqrt(tpr * tnr)
p95_latency = np.percentile(latencies, 95)

print("="*60)
print("FINAL RESULTS (Pro Framework)")
print("="*60)
print(f"Specificity: {tnr*100:.2f}%")
print(f"Recall:      {tpr*100:.2f}%")
print(f"G-Means:     {gmeans*100:.2f}%")
print(f"P95 Latency: {p95_latency:.2f} ms")

üîç Benchmarking & Evaluation...


  neighbor_sampler = NeighborSampler(


FINAL RESULTS (Pro Framework)
Specificity: 63.24%
Recall:      34.47%
G-Means:     46.69%
P95 Latency: 45.51 ms


## 7Ô∏è‚É£ Save Model

In [None]:
trainer.save(f"{MODELS_DIR}/fraudguard_full_pipeline.pt")
print(f"Model saved to {MODELS_DIR}/fraudguard_full_pipeline.pt")

Model saved to /content/drive/MyDrive/fraudguard-models/fraudguard_full_pipeline.pt


In [None]:
from google.colab import runtime
runtime.unassign()