# üîß Step 1: Face Extraction from FaceForensics++ (Colab)

**Purpose:** Extract face crops from your FF++ videos on Google Drive.

**Your FF++ data:** `/content/drive/MyDrive/FFPP_raw/`

**Output:** Face crops + metadata CSV ‚Üí later uploaded to Kaggle for training.

---
‚è±Ô∏è **Estimated time:** Quick test = 5 min, Full extraction = 1‚Äì3 hours

üìå **Runtime:** Go to **Runtime ‚Üí Change runtime type ‚Üí T4 GPU**

## 1Ô∏è‚É£ Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Verify your FF++ data exists
import os
ffpp_root = '/content/drive/MyDrive/FFPP_raw'
print('FF++ root exists:', os.path.isdir(ffpp_root))
if os.path.isdir(ffpp_root):
    print('Contents:', os.listdir(ffpp_root))

## 2Ô∏è‚É£ Clone Your GitHub Repo & Install Dependencies

In [None]:
# ‚ö†Ô∏è REPLACE with your actual GitHub repo URL
GITHUB_REPO = 'https://github.com/YOUR_USERNAME/compression_aware_deepfake.git'

!git clone {GITHUB_REPO} /content/project
%cd /content/project

In [None]:
!pip install -q -r requirements.txt

In [None]:
# Verify GPU + key packages
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA:    {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU:     {torch.cuda.get_device_name(0)}')

from facenet_pytorch import MTCNN
print('MTCNN:   OK')

import pywt
print('PyWavelets: OK')

## 3Ô∏è‚É£ Check Your FF++ Folder Structure

Your data should look like:
```
FFPP_raw/
‚îú‚îÄ‚îÄ original_sequences/youtube/{raw,c23,c40}/videos/*.mp4
‚îî‚îÄ‚îÄ manipulated_sequences/{Deepfakes,...}/{raw,c23,c40}/videos/*.mp4
```

In [None]:
import glob

ffpp_root = '/content/drive/MyDrive/FFPP_raw'

# Check originals
for comp in ['raw', 'c23', 'c40']:
    path = f'{ffpp_root}/original_sequences/youtube/{comp}/videos'
    if os.path.isdir(path):
        vids = glob.glob(f'{path}/*.mp4')
        print(f'  original/{comp}: {len(vids)} videos')
    else:
        print(f'  original/{comp}: NOT FOUND at {path}')

print()

# Check manipulated
for manip in ['Deepfakes', 'Face2Face', 'FaceSwap', 'NeuralTextures']:
    for comp in ['raw', 'c23', 'c40']:
        path = f'{ffpp_root}/manipulated_sequences/{manip}/{comp}/videos'
        if os.path.isdir(path):
            vids = glob.glob(f'{path}/*.mp4')
            print(f'  {manip}/{comp}: {len(vids)} videos')
        else:
            print(f'  {manip}/{comp}: NOT FOUND')

## 4Ô∏è‚É£ Generate Train/Val/Test Splits

In [None]:
!python scripts/prepare_ffpp_splits.py \
    --data_root /content/drive/MyDrive/FFPP_raw \
    --output data/faceforensics/splits.json

In [None]:
# Verify splits
import json
with open('data/faceforensics/splits.json') as f:
    splits = json.load(f)
for k, v in splits.items():
    print(f'  {k}: {len(v)} videos')

## 5Ô∏è‚É£ Extract Face Crops

### Quick Test First (5 videos, 10 frames ‚Äî takes ~2 minutes)
Run the test cell below to make sure everything works before the full extraction.

In [None]:
# Output on Google Drive (persistent storage)
OUTPUT_DIR = '/content/drive/MyDrive/ffpp_faces'

# ‚îÄ‚îÄ QUICK TEST (run this first!) ‚îÄ‚îÄ
!python scripts/extract_faces_ffpp.py \
    --data_root /content/drive/MyDrive/FFPP_raw \
    --output_dir {OUTPUT_DIR} \
    --splits_json data/faceforensics/splits.json \
    --compressions c23 c40 \
    --manipulations Deepfakes FaceSwap \
    --max_videos 5 --max_frames 10 \
    --device cuda

In [None]:
# Verify test output
import pandas as pd

csv_path = f'{OUTPUT_DIR}/metadata.csv'
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    print(f'‚úÖ Total face crops extracted: {len(df)}')
    print(f'\nBy split: {dict(df["split"].value_counts())}')
    print(f'By label: {dict(df["label"].value_counts())}')
    print(f'By compression: {dict(df["compression"].value_counts())}')
    print(f'\nSample rows:')
    display(df.head())
else:
    print('‚ùå metadata.csv not found. Check the extraction output above for errors.')

In [None]:
# Visualize a few face crops
import matplotlib.pyplot as plt
from PIL import Image

if os.path.exists(csv_path):
    samples = df.sample(min(8, len(df)), random_state=42)
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    for ax, (_, row) in zip(axes.flat, samples.iterrows()):
        img_path = os.path.join(OUTPUT_DIR, row['frame_path'])
        if os.path.exists(img_path):
            img = Image.open(img_path)
            ax.imshow(img)
            ax.set_title(f"{row['label']} / {row['compression']}", fontsize=10)
        ax.axis('off')
    plt.suptitle('Sample Face Crops', fontsize=14)
    plt.tight_layout()
    plt.show()

## 6Ô∏è‚É£ Full Extraction (after test looks good)

‚ö†Ô∏è **Only run this after the quick test above succeeds!**

This will extract faces from **all** videos for the selected manipulations and compressions.

‚è±Ô∏è **Takes 1‚Äì3 hours** depending on number of videos.

In [None]:
# ‚ö†Ô∏è DELETE the test output first (so we get a clean full extraction)
!rm -rf /content/drive/MyDrive/ffpp_faces

# ‚îÄ‚îÄ FULL EXTRACTION ‚îÄ‚îÄ
# We extract: original + Deepfakes + FaceSwap at c0, c23, c40
# Sampling: 5 fps, max 50 frames per video
!python scripts/extract_faces_ffpp.py \
    --data_root /content/drive/MyDrive/FFPP_raw \
    --output_dir /content/drive/MyDrive/ffpp_faces \
    --splits_json data/faceforensics/splits.json \
    --compressions c0 c23 c40 \
    --manipulations Deepfakes FaceSwap \
    --target_fps 5 --max_frames 50 \
    --device cuda

In [None]:
# Final verification
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/ffpp_faces/metadata.csv')
print(f'‚úÖ TOTAL face crops: {len(df)}')
print(f'\nBy split:')
print(df['split'].value_counts())
print(f'\nBy label:')
print(df['label'].value_counts())
print(f'\nBy compression:')
print(df['compression'].value_counts())
print(f'\nBy manipulation:')
print(df['manipulation'].value_counts())

## 7Ô∏è‚É£ Zip for Kaggle Upload

Create a zip file of the face crops to upload to Kaggle as a dataset.

In [None]:
# Check size first
!du -sh /content/drive/MyDrive/ffpp_faces/

In [None]:
# Create zip (saved to Google Drive so it persists)
!cd /content/drive/MyDrive && zip -r ffpp_faces.zip ffpp_faces/ -x '*.DS_Store'
!ls -lh /content/drive/MyDrive/ffpp_faces.zip

## ‚úÖ Done! Next Steps:

1. **Download** `ffpp_faces.zip` from Google Drive to your Mac
2. **Upload** to Kaggle as a new private dataset (see instructions below)
3. **Create a Kaggle notebook** using the training notebook from your repo

### How to upload to Kaggle:
1. Go to [kaggle.com/datasets](https://www.kaggle.com/datasets)
2. Click **"+ New Dataset"**
3. Name it: `ffpp-faces-deepfake`
4. Upload `ffpp_faces.zip`
5. Set visibility to **Private**
6. Click **Create**

The data will be available in Kaggle notebooks at:
```
/kaggle/input/ffpp-faces-deepfake/ffpp_faces/
```