# üè¶ Synthetic Financial Data: A Privacy-Preserving Walkthrough

**Goal:** Demonstrate how to generate high-fidelity synthetic data from sensitive financial records.

In this notebook, we will:
1.  **Load "Sensitive" Data:** Real-world financial data containing PII (Names, SSNs).
2.  **Train a Generative Model:** Use CTGAN to learn the statistical patterns.
3.  **Generate Synthetic Data:** Create a new dataset that looks real but contains NO real users.
4.  **Evaluate Quality:** Compare the distributions and correlations of Real vs. Synthetic data.
5.  **Test Utility:** Prove that a Machine Learning model trained on synthetic data performs just as well on real data.

In [None]:
# Install requirements if you haven't already
# !pip install -r ../requirements.txt

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import torch
import random
from faker import Faker

# SDV (Synthetic Data Vault) Imports
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
from sdv.evaluation.single_table import evaluate_quality

# ML Utility
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# --- SEEDING FOR REPRODUCIBILITY ---
def set_seed(seed=42):
    """Locks all random number generators for consistent results."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    Faker.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    print(f"Random seed set to: {seed}")

set_seed(42)

# Setup Visuals
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")
%matplotlib inline

print("Libraries loaded and seeds locked! üîí")

## 1. Load the "Sensitive" Data
First, we load the dataset we generated. This represents the **raw, private data** that banks possess but cannot share.

*Note: If you haven't run `src/data_generator.py` yet, this cell will fail.*

In [None]:
# Load the data generated by our script
try:
    real_data = pd.read_csv('../data/raw/sensitive_financial_data.csv')
    print(f"Loaded {len(real_data)} rows of sensitive data.")
except FileNotFoundError:
    print("‚ö†Ô∏è Data not found! Please run 'python src/data_generator.py' in your terminal first.")

# Peek at the data (Notice the PII!)
real_data.head()

## 2. Visualize the Real Data (EDA)
Before we synthesize anything, we need to understand what we are modeling.

Let's look at the distribution of **Income**, **Credit Score**, and **Default Rates**. These are the patterns our AI needs to learn.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot Income (Log-Normal Distribution)
sns.histplot(real_data['Income'], kde=True, ax=axes[0], color='blue')
axes[0].set_title('Real Income Distribution')

# Plot Credit Score (Normal-ish Distribution)
sns.histplot(real_data['CreditScore'], kde=True, ax=axes[1], color='green')
axes[1].set_title('Real Credit Score Distribution')

# Plot Default Rates (Target Variable)
sns.countplot(x='Default', data=real_data, ax=axes[2], palette='viridis')
axes[2].set_title('Loan Default Counts (0=No, 1=Yes)')

plt.tight_layout()
plt.show()

## 3. Train the CTGAN Model
Now for the magic. We will use **CTGAN (Conditional Tabular GAN)**.



**Crucial Step:** We must `drop` the PII columns (Name, SSN, Email, Address). We want the model to learn the *financial math*, not the *people*.

In [None]:
# 1. Drop PII
training_data = real_data.drop(columns=['Name', 'SSN', 'Email', 'Address'])
print("Columns for training:", training_data.columns.tolist())

# 2. Auto-detect metadata (Categorical vs Numerical)
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(training_data)

# 3. Initialize CTGAN
# usage: epochs=300 is decent for a demo. For production, go 500+.
synthesizer = CTGANSynthesizer(metadata, epochs=300, verbose=True)

# 4. Train!
print("Starting training... (This might take a minute)")
synthesizer.fit(training_data)
print("Training Complete!")

## 4. Generate Synthetic Data
Now we can generate as many "fake" financial records as we want. Let's generate 1,000 rows.

In [None]:
synthetic_data = synthesizer.sample(num_rows=1000)
synthetic_data.head()

## 5. Visual Evaluation: Real vs. Synthetic
The ultimate test: **Does the fake data look like the real data?**

We will overlay the distributions. If the lines match, our model is a success.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Compare Income
sns.kdeplot(real_data['Income'], label='Real', shade=True, ax=axes[0], color='blue')
sns.kdeplot(synthetic_data['Income'], label='Synthetic', shade=True, ax=axes[0], color='orange')
axes[0].set_title('Income Distribution Comparison')
axes[0].legend()

# Compare Credit Score
sns.kdeplot(real_data['CreditScore'], label='Real', shade=True, ax=axes[1], color='green')
sns.kdeplot(synthetic_data['CreditScore'], label='Synthetic', shade=True, ax=axes[1], color='red')
axes[1].set_title('Credit Score Distribution Comparison')
axes[1].legend()

plt.show()

## 6. Machine Learning Utility Test
Visuals are nice, but can we **use** this data?

**The Experiment:**
1. Train Model A on **REAL** data.
2. Train Model B on **SYNTHETIC** data.
3. Test both on a held-out **REAL** test set.

If Model B performs similarly to Model A, we have proven that we can build financial models without ever touching sensitive user data.

In [None]:
# Prepare the Real Test Set (The "Exam")
X = real_data.drop(columns=['Name', 'SSN', 'Email', 'Address', 'Default'])
y = real_data['Default']

# Split: 80% Train, 20% Test
X_train_real, X_test_real, y_train_real, y_test_real = train_test_split(X, y, test_size=0.2, random_state=42)

# --- MODEL A: Trained on Real Data ---
model_real = RandomForestClassifier(random_state=42)
model_real.fit(X_train_real, y_train_real)
acc_real = model_real.score(X_test_real, y_test_real)

# --- MODEL B: Trained on Synthetic Data ---
# We use the synthetic data we generated earlier
X_train_syn = synthetic_data.drop(columns=['Default'])
y_train_syn = synthetic_data['Default']

model_syn = RandomForestClassifier(random_state=42)
model_syn.fit(X_train_syn, y_train_syn)
acc_syn = model_syn.score(X_test_real, y_test_real) # TEST ON REAL DATA

# --- RESULTS ---
print(f"Accuracy (Trained on Real):      {acc_real:.4f}")
print(f"Accuracy (Trained on Synthetic): {acc_syn:.4f}")
print("-" * 40)
print(f"Difference: {abs(acc_real - acc_syn):.4f}")

if abs(acc_real - acc_syn) < 0.1:
    print("‚úÖ SUCCESS: Synthetic data retained the utility of the original data!")
else:
    print("‚ùå WARNING: The model lost too much information. Try training for more epochs.")