# 01 — Data Preprocessing
## Compression-Aware Video Deepfake Detection

This notebook handles:
1. Setting up the environment
2. Preparing FF++ splits
3. Extracting face crops from videos

**Run on:** Google Colab (GPU) or Kaggle

In [None]:
# ── Step 1: Mount Google Drive ──
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# ── Step 2: Clone your GitHub repo ──
# Replace with your actual repo URL
!git clone https://github.com/YOUR_USERNAME/compression_aware_deepfake.git
%cd compression_aware_deepfake

In [None]:
# ── Step 3: Install dependencies ──
!pip install -q -r requirements.txt

In [None]:
# ── Step 4: Verify GPU ──
import torch
print(f'PyTorch:   {torch.__version__}')
print(f'CUDA:      {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU:       {torch.cuda.get_device_name(0)}')

## Prepare Splits

Your FF++ data should be at `/content/drive/MyDrive/FFPP_raw/`.

In [None]:
# ── Step 5: Generate split JSON ──
!python scripts/prepare_ffpp_splits.py \
    --data_root /content/drive/MyDrive/FFPP_raw \
    --output data/faceforensics/splits.json

## Extract Face Crops

This extracts faces from videos and saves them as PNG crops with a metadata CSV index.

**Note:** This step takes time (~1-3 hours depending on the subset). You can start with a small `--max_videos` for testing.

In [None]:
# ── Step 6a: Quick test (5 videos, 10 frames each) ──
!python scripts/extract_faces_ffpp.py \
    --data_root /content/drive/MyDrive/FFPP_raw \
    --output_dir /content/drive/MyDrive/ffpp_faces \
    --splits_json data/faceforensics/splits.json \
    --compressions c23 c40 \
    --manipulations Deepfakes FaceSwap \
    --max_videos 5 --max_frames 10 \
    --device cuda

In [None]:
# ── Step 6b: Full extraction (run this for real experiments) ──
# Uncomment and run when ready:

# !python scripts/extract_faces_ffpp.py \
#     --data_root /content/drive/MyDrive/FFPP_raw \
#     --output_dir /content/drive/MyDrive/ffpp_faces \
#     --splits_json data/faceforensics/splits.json \
#     --compressions c0 c23 c40 \
#     --manipulations Deepfakes FaceSwap \
#     --target_fps 5 --max_frames 50 \
#     --device cuda

In [None]:
# ── Step 7: Verify extraction ──
import pandas as pd

csv_path = '/content/drive/MyDrive/ffpp_faces/metadata.csv'
df = pd.read_csv(csv_path)
print(f'Total face crops: {len(df)}')
print(f'\nSplit distribution:')
print(df['split'].value_counts())
print(f'\nLabel distribution:')
print(df['label'].value_counts())
print(f'\nCompression distribution:')
print(df['compression'].value_counts())

In [None]:
# ── Step 8: Visualize sample crops ──
import matplotlib.pyplot as plt
from PIL import Image
import os

root = '/content/drive/MyDrive/ffpp_faces'
samples = df.sample(8, random_state=42)

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for ax, (_, row) in zip(axes.flat, samples.iterrows()):
    img = Image.open(os.path.join(root, row['frame_path']))
    ax.imshow(img)
    ax.set_title(f"{row['label']} / {row['compression']}")
    ax.axis('off')
plt.tight_layout()
plt.savefig('results/plots/sample_faces.png', dpi=100)
plt.show()