# üèãÔ∏è Training & Evaluation on Kaggle
## Compression-Aware Video Deepfake Detection

**Pre-requisite:** Face crops uploaded as a Kaggle Dataset (`ffpp-faces-deepfake`)

üìå **Settings ‚Üí Accelerator ‚Üí GPU T4 x2** (or P100)

üìå **Add Data ‚Üí Your Datasets ‚Üí ffpp-faces-deepfake**

## 1Ô∏è‚É£ Setup

In [None]:
# ‚ö†Ô∏è REPLACE with your actual GitHub repo URL
GITHUB_REPO = 'https://github.com/YOUR_USERNAME/compression_aware_deepfake.git'

!git clone {GITHUB_REPO} /kaggle/working/project
%cd /kaggle/working/project
!pip install -q -r requirements.txt

In [None]:
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA:    {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU:     {torch.cuda.get_device_name(0)}')

## 2Ô∏è‚É£ Verify Dataset

Your uploaded dataset should be at:
```
/kaggle/input/ffpp-faces-deepfake/ffpp_faces/
```

‚ö†Ô∏è If your Kaggle dataset name is different, update `DATA_ROOT` below.

In [None]:
import os, glob

# ‚ö†Ô∏è UPDATE this path if your Kaggle dataset has a different name
KAGGLE_INPUT = '/kaggle/input/ffpp-faces-deepfake'

# The face crops might be directly in the dataset or inside a subfolder
print('Kaggle input contents:')
for item in os.listdir(KAGGLE_INPUT):
    print(f'  {item}')

# Find the metadata.csv
csv_candidates = glob.glob(f'{KAGGLE_INPUT}/**/metadata.csv', recursive=True)
if csv_candidates:
    METADATA_CSV = csv_candidates[0]
    DATA_ROOT = os.path.dirname(METADATA_CSV)
    print(f'\n‚úÖ Found metadata at: {METADATA_CSV}')
    print(f'   Data root: {DATA_ROOT}')
else:
    print('\n‚ùå metadata.csv not found! Check your Kaggle dataset.')
    METADATA_CSV = None
    DATA_ROOT = None

In [None]:
import pandas as pd

if METADATA_CSV:
    df = pd.read_csv(METADATA_CSV)
    print(f'Total face crops: {len(df)}')
    print(f'\nBy split:       {dict(df["split"].value_counts())}')
    print(f'By label:       {dict(df["label"].value_counts())}')
    print(f'By compression: {dict(df["compression"].value_counts())}')
    
    # Verify a sample image exists
    sample_path = os.path.join(DATA_ROOT, df.iloc[0]['frame_path'])
    print(f'\nSample image exists: {os.path.exists(sample_path)}')

## 3Ô∏è‚É£ Copy splits.json

Since Kaggle input is read-only, we need to copy the splits file.

In [None]:
# Copy splits if it exists in the dataset, otherwise regenerate
os.makedirs('data/faceforensics', exist_ok=True)

splits_candidates = glob.glob(f'{KAGGLE_INPUT}/**/splits.json', recursive=True)
if splits_candidates:
    !cp {splits_candidates[0]} data/faceforensics/splits.json
    print('‚úÖ Copied existing splits.json')
else:
    # Generate from the metadata CSV
    if METADATA_CSV:
        unique_splits = df['split'].unique()
        splits_dict = {}
        for s in unique_splits:
            # Get unique source video IDs per split
            split_df = df[df['split'] == s]
            vid_ids = sorted(split_df['video_id'].apply(lambda x: x.split('_')[0]).unique().tolist())
            splits_dict[s] = vid_ids
        
        import json
        with open('data/faceforensics/splits.json', 'w') as f:
            json.dump(splits_dict, f, indent=2)
        print('‚úÖ Generated splits.json from metadata')
        for k, v in splits_dict.items():
            print(f'  {k}: {len(v)} videos')

---
## 4Ô∏è‚É£ Train Hybrid Model (Main Experiment)

Training on c23 + c40 with the hybrid (spatial + frequency) architecture.

‚è±Ô∏è **~1‚Äì2 hours** on Kaggle T4 GPU

In [None]:
# Output directory (Kaggle writable area)
!mkdir -p /kaggle/working/results/csv
!mkdir -p /kaggle/working/results/checkpoints
!mkdir -p /kaggle/working/results/plots

!python src/training/train_ffpp.py \
    --metadata_csv {METADATA_CSV} \
    --data_root {DATA_ROOT} \
    --mode hybrid \
    --compressions c23 c40 \
    --epochs 15 \
    --batch_size 16 \
    --lr 1e-4 \
    --output_dir /kaggle/working/results \
    --experiment_name hybrid_c23_c40

## 5Ô∏è‚É£ Train Baseline Models (Ablation)

In [None]:
# Spatial-only baseline
!python src/training/train_ffpp.py \
    --metadata_csv {METADATA_CSV} \
    --data_root {DATA_ROOT} \
    --mode spatial \
    --compressions c23 c40 \
    --epochs 15 \
    --batch_size 16 \
    --output_dir /kaggle/working/results \
    --experiment_name spatial_c23_c40

In [None]:
# Frequency-only baseline
!python src/training/train_ffpp.py \
    --metadata_csv {METADATA_CSV} \
    --data_root {DATA_ROOT} \
    --mode frequency \
    --compressions c23 c40 \
    --epochs 15 \
    --batch_size 16 \
    --output_dir /kaggle/working/results \
    --experiment_name frequency_c23_c40

## 6Ô∏è‚É£ Evaluate on Each Compression Level

In [None]:
# Evaluate hybrid model
!python src/training/evaluate_compression_levels.py \
    --checkpoint /kaggle/working/results/checkpoints/best_hybrid_c23_c40.pth \
    --metadata_csv {METADATA_CSV} \
    --data_root {DATA_ROOT} \
    --mode hybrid \
    --compressions c0 c23 c40 \
    --output_csv /kaggle/working/results/csv/compression_eval_hybrid.csv

In [None]:
# Evaluate spatial model
!python src/training/evaluate_compression_levels.py \
    --checkpoint /kaggle/working/results/checkpoints/best_spatial_c23_c40.pth \
    --metadata_csv {METADATA_CSV} \
    --data_root {DATA_ROOT} \
    --mode spatial \
    --compressions c0 c23 c40 \
    --output_csv /kaggle/working/results/csv/compression_eval_spatial.csv

In [None]:
# Evaluate frequency model
!python src/training/evaluate_compression_levels.py \
    --checkpoint /kaggle/working/results/checkpoints/best_frequency_c23_c40.pth \
    --metadata_csv {METADATA_CSV} \
    --data_root {DATA_ROOT} \
    --mode frequency \
    --compressions c0 c23 c40 \
    --output_csv /kaggle/working/results/csv/compression_eval_frequency.csv

## 7Ô∏è‚É£ Generate Plots

In [None]:
# Merge evaluation CSVs into ablation summary format
import pandas as pd
import os

all_results = []
for mode in ['hybrid', 'spatial', 'frequency']:
    csv_path = f'/kaggle/working/results/csv/compression_eval_{mode}.csv'
    if os.path.exists(csv_path):
        df_eval = pd.read_csv(csv_path)
        df_eval['mode'] = mode
        df_eval['train_compressions'] = 'c23_c40'
        df_eval['experiment'] = f'{mode}_c23_c40'
        all_results.append(df_eval)

if all_results:
    df_all = pd.concat(all_results, ignore_index=True)
    df_all.to_csv('/kaggle/working/results/csv/ablation_summary.csv', index=False)
    print('‚úÖ Ablation summary created')
    display(df_all)
else:
    print('No evaluation results found yet.')

In [None]:
# Generate all paper-ready plots
!python scripts/plot_results.py \
    --results_dir /kaggle/working/results/csv \
    --output_dir /kaggle/working/results/plots

In [None]:
# Display the generated plots
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

plot_dir = '/kaggle/working/results/plots'
for fname in sorted(os.listdir(plot_dir)):
    if fname.endswith('.png'):
        print(f'\n--- {fname} ---')
        img = mpimg.imread(os.path.join(plot_dir, fname))
        fig, ax = plt.subplots(figsize=(10, 6))
        ax.imshow(img)
        ax.axis('off')
        plt.tight_layout()
        plt.show()

## 8Ô∏è‚É£ Results Summary

In [None]:
# Print final results table
summary_csv = '/kaggle/working/results/csv/ablation_summary.csv'
if os.path.exists(summary_csv):
    df_summary = pd.read_csv(summary_csv)
    
    # Pivot table: AUC by mode √ó compression
    pivot_auc = df_summary.pivot_table(values='auc', index='mode', columns='compression')
    print('\nüìä AUC by Model √ó Compression Level:')
    print('='*50)
    display(pivot_auc.round(4))
    
    # Pivot table: F1
    pivot_f1 = df_summary.pivot_table(values='f1', index='mode', columns='compression')
    print('\nüìä F1 by Model √ó Compression Level:')
    print('='*50)
    display(pivot_f1.round(4))
    
    # LaTeX for paper
    print('\nüìù LaTeX table (AUC) for paper:')
    print(pivot_auc.round(4).to_latex())

## 9Ô∏è‚É£ Save Outputs

Kaggle saves everything in `/kaggle/working/` as output. You can:
1. **Download** the `results/` folder from the notebook output
2. **Download** the model checkpoints from `results/checkpoints/`
3. Use these in your Streamlit demo and paper

In [None]:
# List all output files
print('üì¶ Output files for download:')
for root, dirs, files in os.walk('/kaggle/working/results'):
    for f in files:
        full = os.path.join(root, f)
        size = os.path.getsize(full) / (1024*1024)  # MB
        rel = os.path.relpath(full, '/kaggle/working')
        print(f'  {rel}  ({size:.1f} MB)')