# Fraud Detection Exercise

## Objective
Build and deploy a fraud detection model using AWS SageMaker with focus on handling imbalanced data.

## Tasks Overview
1. Data Generation and Exploration
2. Feature Engineering
3. Handle Imbalanced Data
4. Model Training and Comparison
5. Model Evaluation
6. Model Explainability
7. SageMaker Deployment

## Setup

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
%matplotlib inline

## Task 1: Data Generation and Exploration

### TODO:
- Generate synthetic fraud detection dataset
- Explore the dataset structure
- Analyze class distribution
- Visualize key features

In [None]:
# Generate synthetic fraud detection data
def generate_fraud_dataset(n_samples=100000, fraud_ratio=0.02):
    """
    TODO: Complete this function to generate synthetic fraud data
    
    Features to include:
    - transaction_amount: Transaction amount
    - hour_of_day: Hour when transaction occurred
    - day_of_week: Day of week
    - merchant_category: Type of merchant
    - distance_from_home: Distance from home address
    - distance_from_last_transaction: Distance from previous transaction
    - transaction_velocity: Number of transactions in last hour
    - is_fraud: Target variable
    
    Returns:
        pandas.DataFrame: Generated dataset
    """
    # YOUR CODE HERE
    pass

# Generate dataset
df = generate_fraud_dataset(n_samples=100000, fraud_ratio=0.02)

# Display basic information
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# TODO: Analyze class distribution
# YOUR CODE HERE

In [None]:
# TODO: Visualize feature distributions for fraud vs non-fraud
# Create visualizations comparing features between fraud and legitimate transactions
# YOUR CODE HERE

## Task 2: Feature Engineering

### TODO:
- Create additional temporal features
- Encode categorical variables
- Create interaction features
- Handle missing values if any

In [None]:
# TODO: Feature engineering
# Create new features that might be useful for fraud detection
# Examples:
# - is_weekend
# - is_night_transaction
# - amount_vs_velocity_ratio
# YOUR CODE HERE

## Task 3: Data Preparation and Splitting

### TODO:
- Split data into train/validation/test sets
- Scale numerical features
- Ensure stratification for imbalanced data

In [None]:
# TODO: Prepare features and target
# X = ...
# y = ...
# YOUR CODE HERE

# TODO: Split data (60% train, 20% validation, 20% test)
# Remember to use stratification!
# YOUR CODE HERE

# TODO: Scale features
# YOUR CODE HERE

## Task 4: Handle Imbalanced Data

### TODO:
- Apply SMOTE for oversampling minority class
- Try class weight adjustments
- Compare different strategies

In [None]:
from imblearn.over_sampling import SMOTE

# TODO: Apply SMOTE to training data
# YOUR CODE HERE

# TODO: Check new class distribution
# YOUR CODE HERE

## Task 5: Model Training

### TODO:
- Train baseline Logistic Regression
- Train XGBoost model
- Train LightGBM model
- Compare performance

In [None]:
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgb

# TODO: Train Logistic Regression (baseline)
# YOUR CODE HERE

# TODO: Train XGBoost
# YOUR CODE HERE

# TODO: Train LightGBM
# YOUR CODE HERE

## Task 6: Model Evaluation

### TODO:
- Calculate ROC-AUC scores
- Generate precision-recall curves
- Create confusion matrices
- Calculate cost-sensitive metrics

In [None]:
# TODO: Evaluate all models on validation set
# YOUR CODE HERE

# TODO: Plot ROC curves
# YOUR CODE HERE

# TODO: Plot Precision-Recall curves
# YOUR CODE HERE

In [None]:
# TODO: Calculate cost-sensitive metrics
# Assume: 
# - Cost of false positive (investigating legitimate transaction): $5
# - Cost of false negative (missing fraud): $100
# YOUR CODE HERE

## Task 7: Model Explainability

### TODO:
- Calculate feature importance
- Generate SHAP values
- Create SHAP summary plots
- Explain individual predictions

In [None]:
import shap

# TODO: Calculate SHAP values for best model
# YOUR CODE HERE

# TODO: Create SHAP summary plot
# YOUR CODE HERE

# TODO: Explain a single fraud prediction
# YOUR CODE HERE

## Task 8: SageMaker Deployment Preparation

### TODO:
- Save the best model
- Create inference script
- Prepare model artifacts for SageMaker

In [None]:
import joblib
import boto3
import sagemaker

# TODO: Save the best performing model
# YOUR CODE HERE

# TODO: Prepare for SageMaker deployment
# - Create inference.py script
# - Package model artifacts
# - Upload to S3
# YOUR CODE HERE

## Task 9: Deploy to SageMaker (Advanced)

### TODO:
- Create SageMaker model
- Deploy to real-time endpoint
- Test endpoint with sample data
- Set up monitoring

In [None]:
# TODO: Create and deploy SageMaker endpoint
# YOUR CODE HERE

# TODO: Test endpoint
# YOUR CODE HERE

# TODO: Clean up (delete endpoint when done)
# YOUR CODE HERE

## Reflection Questions

1. What was the impact of handling class imbalance on model performance?
2. Which model performed best and why?
3. What are the trade-offs between precision and recall in fraud detection?
4. How would you handle concept drift in production?
5. What additional features might improve the model?

**Write your answers here:**

1. 
2. 
3. 
4. 
5. 