The project started as a course assignment to build an end-to-end deepfake detection system using a Vision Transformer. Over time it evolved into a serious attempt at building a real-world generalizable detector, going through 3 complete model implementations and 2 preprocessing pipelines.
"M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection" arXiv 2104.09770
The central artifact: combine frequency domain features (FFT on feature maps) with spatial RGB features (transformer backbone) and fuse them via a Cross Modality Fusion (CMF) block. The model detects manipulation artifacts in both spatial and frequency domains simultaneously.
Architecture:
- Backbone: Swin-Tiny (pretrained ImageNet) — adapted from paper's EfficientNet-b4
- Frequency branch: FFT on raw pixels → small CNN → 256-d embedding
- Fusion: concatenate spatial + frequency → dense layers → binary output
- Trainable params: ~10M
Dataset used for training:
- FaceForensics++ C23 —
Deepfakesfolder only - 1000 real videos + 1000 fake videos
- Preprocessing: MTCNN face detection, 224×224 crops, 20 frames per video
- Split: 70/15/15 at video level
- Training images: ~27,838
Evaluation dataset:
- FF++ C23 held-out test set (same distribution as training)
Results:
| Metric | Value |
|---|---|
| Accuracy | 96.98% |
| AUC | 0.9915 |
| F1 | 0.9695 |
Real-world problem: Tested on internet videos (Obama deepfake PSA, Tom Cruise face swaps, YouTube Shorts). All came back as REAL. The model learned FF++ Deepfakes-specific pixel artifacts from 2019-era autoencoder face swapping. Internet deepfakes use completely different generation methods — the model had never seen them.
After identifying the generalization failure, rebuilt with the paper's exact architecture.
Architecture:
- Backbone: EfficientNet-b4 (pretrained, paper-exact)
- Frequency filter: Learnable complex Hadamard filter (G_real + G_imag) on feature maps via FFT → iFFT
- Multi-scale transformer: 4 patch scales [1, 2, 4, 7] with separate attention heads
- Cross Modality Fusion: QKV cross-attention (RGB → Q, Frequency → K, V)
- Loss: L_cls + L_con (contrastive with margin) — L_seg skipped (no masks in dataset)
- Optimizer: Adam, StepLR ÷10 every 40 epochs, 90 epochs total
- Trainable params: ~10M
Dataset used for training:
- FaceForensics++ C23 — all 4 manipulation types: Deepfakes, Face2Face, FaceSwap, NeuralTextures
- 1000 real + 4000 fake videos
- Preprocessing: MTCNN, 224×224, 30 frames per video
- Split: 70/15/15 automatic from original.csv IDs
- Training images: ~104,812
Evaluation dataset:
- FF++ C23 held-out test set
Results:
| Metric | Value |
|---|---|
| Accuracy | 93.5% |
| AUC | 0.9708 |
| F1 | 0.8456 |
(Lower than 1a because only 20 epochs ran out of 90 — model was still converging)
Real-world problem: Same issue. All internet videos returned REAL. Root cause confirmed: M2TR learns low-level pixel artifact signatures specific to its training distribution. Obama BuzzFeed deepfake (2018, FakeApp+Adobe), Tom Cruise deepfakes (DeepFaceLab 2021), TikTok videos — all use generation methods completely different from FF++ training data.
Hypothesis: more diverse training data would improve real-world generalization.
Architecture: Same M2TR (EfficientNet-b4 + freq filter + CMF)
Dataset used for training:
- FaceForensics++ C23: all 4 types (1000 real + 4000 fake videos)
- Celeb-DF v2: Celeb-real (~590 videos) + YouTube-real (~300) + Celeb-synthesis (~5639 fake videos)
- Preprocessing: MTCNN, 224×224, 30 frames per video
- Split: automatic 70/15/15, built from face_crops folder names
- Training images: ~149,748 total
Evaluation dataset:
- Combined FF++ C23 + Celeb-DF v2 held-out test set
Results:
| Metric | Value |
|---|---|
| Accuracy | 99.02% |
| AUC | 0.9983 |
| F1 | 0.9601 |
Real-world problem: Still failed on internet videos. Adding Celeb-DF improved benchmark metrics significantly but did not fix real-world generalization. The fundamental issue is architectural — M2TR looks for low-level artifact signatures, and no matter how many benchmark datasets you add, it cannot generalize to generation methods it has never seen. This is a known field-wide problem called closed-set bias.
Confirmed with testing:
data2/Celeb-synthesis/id0_id16_0000.mp4→ correctly FAKE ✅ (in training distribution)- Obama deepfake PSA → REAL ❌ (different generation method)
- Tom Cruise YouTube → REAL ❌ (different generation method)
- Personal real video → FAKE ❌ (false positive — model over-sensitive)
M2TR detects manipulation by finding specific low-level signals:
- Frequency domain artifacts left by specific GAN/autoencoder architectures
- Multi-scale spatial inconsistencies at patch boundaries
Both of these are generation-method specific. FF++ autoencoder fakes from 2019 leave completely different frequency signatures than:
- FakeApp (2018) — expression transfer
- DeepFaceLab (2021) — high quality face swap
- Diffusion models (2023+) — no face swap artifacts at all
No amount of data from benchmark datasets fixes this because the model architecture itself is biased toward low-level artifacts.
"Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model" CVPR 2025 | Academia Sinica + Microsoft
DFD-FCG uses CLIP ViT (frozen) as the backbone. CLIP was pretrained on 400 million image-text pairs from the internet — it already has semantic understanding of what real human faces look like, independent of any specific deepfake generation method.
Key difference from M2TR:
- M2TR: learns FF++ artifact signatures → fails on unseen methods
- DFD-FCG: CLIP's weights never change → semantic knowledge is preserved → generalizes to unseen methods
Paper reports 87.2% AUC on WildDeepfake (real internet deepfakes) and 95.0% AUC on Celeb-DF — trained only on FF++. This is the cross-dataset generalization M2TR could never achieve.
- Backbone: CLIP ViT-L/14 (implemented as ViT-B/32 for 8GB VRAM) — completely frozen
- L=12 transformer layers, each layer feeds a decoder block
- Spatial Module: N=4 learnable queries per layer, cross-attention with CLIP key attributes (γs=k), guided by FCG loss to focus on eyes/nose/mouth/skin
- Temporal Module: Patch-Temporal MHSA (PT-MHSA) captures temporal inconsistencies across frames, C1 and C2 convolutional kernels (size=5, paper-exact)
- FCG Loss: InfoNCE contrastive loss aligning learnable queries to facial component regions (τ=0.07, w=0.15)
- 3 FC heads: spatial (FC_s), temporal (FC_t), spatio-temporal (FC_st) — averaged for final prediction
- Trainable params: ~283K (paper: ~250K) — CLIP 151M params all frozen
- FaceForensics++ C23 — all 4 manipulation types (paper trains on FF++ only)
- 1000 real + 4000 fake videos
- Preprocessing (paper-exact):
- 2D-FAN landmark detection (face_alignment library, GPU)
- LRW mean face alignment (20words_mean_face.npy from official DFD-FCG repo)
- 150×150 aligned face crops (paper-exact size)
- 2–4 second non-overlapping clips, T=10 frames uniformly sampled
- Split: automatic 70/15/15 (no official JSON split files in Kaggle download)
- Training clips: 7,536 (real=4,241, fake=3,295)
- Val clips: 1,640 | Test clips: 1,682
- FF++ C23 held-out test set (same as training)
- Paper evaluates cross-dataset on: Celeb-DF, DFDC, FaceShifter, DeeperForensics, WildDeepfake — these would be the true generalization test
| Step | Status |
|---|---|
| Preprocessing | ✅ Complete — 5000 videos, 150×150 aligned crops |
| CLIP loading | ✅ Fixed (CPU load → float32 → GPU) |
| Feature extraction hooks | ✅ Fixed (LND format handled) |
| Spatial + Temporal modules | ✅ Verified correct shapes |
| FCG loss | ❌ CUDA assert — tensor indexing bug being fixed |
| Training | ⏳ Not started yet |
Current blocker: FCG loss patches[idx] needs torch.tensor(idx, device=patches.device) for proper CUDA indexing.
| Phase | Paper | Backbone | Training Data | Test AUC | Internet Videos |
|---|---|---|---|---|---|
| 1a | M2TR | Swin-Tiny | FF++ Deepfakes only | 0.9915 | ❌ All REAL |
| 1b | M2TR | EfficientNet-b4 | FF++ all 4 types | 0.9708 | ❌ All REAL |
| 1c | M2TR | EfficientNet-b4 | FF++ + Celeb-DF | 0.9983 | ❌ All REAL |
| 2 | DFD-FCG | CLIP ViT-B/32 (frozen) | FF++ all 4 types | In progress | Expected: 87%+ |
-
High benchmark accuracy ≠ real-world performance. 99% AUC on FF++ test set means nothing if the model fails on any video not from FF++.
-
More data from the same distribution doesn't fix generalization. Adding Celeb-DF improved benchmark numbers but didn't help on internet videos — both datasets use similar generation methods.
-
Architecture matters more than data for generalization. CLIP's frozen semantic knowledge is the real solution — not more face-swap training data.
-
Cross-dataset generalization is an open research problem. Even DFD-FCG (CVPR 2025 state-of-the-art) only achieves 87% on WildDeepfake and 81% on DFDC. No model gets 99% on unseen internet videos.
-
Preprocessing matters for paper reproduction. MTCNN vs 2D-FAN + LRW alignment produces very different face crops that affect model performance significantly.
- Fix FCG loss CUDA indexing bug (one line change)
- Run training overnight (~8–10 hours for 30 epochs)
- Evaluate on FF++ test set
- Update Flask app inference to use DFD-FCG checkpoint
- Test on internet videos — expected to detect Tom Cruise / political deepfakes correctly