Deepfake Detection Project — Complete Journey

Prj-25 | Group G-05 | CV Course Spring 2026

Overview

The project started as a course assignment to build an end-to-end deepfake detection system using a Vision Transformer. Over time it evolved into a serious attempt at building a real-world generalizable detector, going through 3 complete model implementations and 2 preprocessing pipelines.

Phase 1 — M2TR Paper (First Implementation)

Paper

"M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection" arXiv 2104.09770

Core Idea Taken From Paper

The central artifact: combine frequency domain features (FFT on feature maps) with spatial RGB features (transformer backbone) and fuse them via a Cross Modality Fusion (CMF) block. The model detects manipulation artifacts in both spatial and frequency domains simultaneously.

Experiment 1a — Quick prototype (Swin-Tiny + frequency branch)

Architecture:

Backbone: Swin-Tiny (pretrained ImageNet) — adapted from paper's EfficientNet-b4
Frequency branch: FFT on raw pixels → small CNN → 256-d embedding
Fusion: concatenate spatial + frequency → dense layers → binary output
Trainable params: ~10M

Dataset used for training:

FaceForensics++ C23 — Deepfakes folder only
1000 real videos + 1000 fake videos
Preprocessing: MTCNN face detection, 224×224 crops, 20 frames per video
Split: 70/15/15 at video level
Training images: ~27,838

Evaluation dataset:

FF++ C23 held-out test set (same distribution as training)

Results:

Metric	Value
Accuracy	96.98%
AUC	0.9915
F1	0.9695

Real-world problem: Tested on internet videos (Obama deepfake PSA, Tom Cruise face swaps, YouTube Shorts). All came back as REAL. The model learned FF++ Deepfakes-specific pixel artifacts from 2019-era autoencoder face swapping. Internet deepfakes use completely different generation methods — the model had never seen them.

Experiment 1b — Full M2TR architecture (paper-exact)

After identifying the generalization failure, rebuilt with the paper's exact architecture.

Architecture:

Backbone: EfficientNet-b4 (pretrained, paper-exact)
Frequency filter: Learnable complex Hadamard filter (G_real + G_imag) on feature maps via FFT → iFFT
Multi-scale transformer: 4 patch scales [1, 2, 4, 7] with separate attention heads
Cross Modality Fusion: QKV cross-attention (RGB → Q, Frequency → K, V)
Loss: L_cls + L_con (contrastive with margin) — L_seg skipped (no masks in dataset)
Optimizer: Adam, StepLR ÷10 every 40 epochs, 90 epochs total
Trainable params: ~10M

Dataset used for training:

FaceForensics++ C23 — all 4 manipulation types: Deepfakes, Face2Face, FaceSwap, NeuralTextures
1000 real + 4000 fake videos
Preprocessing: MTCNN, 224×224, 30 frames per video
Split: 70/15/15 automatic from original.csv IDs
Training images: ~104,812

Evaluation dataset:

FF++ C23 held-out test set

Results:

Metric	Value
Accuracy	93.5%
AUC	0.9708
F1	0.8456

(Lower than 1a because only 20 epochs ran out of 90 — model was still converging)

Real-world problem: Same issue. All internet videos returned REAL. Root cause confirmed: M2TR learns low-level pixel artifact signatures specific to its training distribution. Obama BuzzFeed deepfake (2018, FakeApp+Adobe), Tom Cruise deepfakes (DeepFaceLab 2021), TikTok videos — all use generation methods completely different from FF++ training data.

Experiment 1c — Adding Celeb-DF v2 to training data

Hypothesis: more diverse training data would improve real-world generalization.

Architecture: Same M2TR (EfficientNet-b4 + freq filter + CMF)

Dataset used for training:

FaceForensics++ C23: all 4 types (1000 real + 4000 fake videos)
Celeb-DF v2: Celeb-real (~590 videos) + YouTube-real (~300) + Celeb-synthesis (~5639 fake videos)
Preprocessing: MTCNN, 224×224, 30 frames per video
Split: automatic 70/15/15, built from face_crops folder names
Training images: ~149,748 total

Evaluation dataset:

Combined FF++ C23 + Celeb-DF v2 held-out test set

Results:

Metric	Value
Accuracy	99.02%
AUC	0.9983
F1	0.9601

Real-world problem: Still failed on internet videos. Adding Celeb-DF improved benchmark metrics significantly but did not fix real-world generalization. The fundamental issue is architectural — M2TR looks for low-level artifact signatures, and no matter how many benchmark datasets you add, it cannot generalize to generation methods it has never seen. This is a known field-wide problem called closed-set bias.

Confirmed with testing:

data2/Celeb-synthesis/id0_id16_0000.mp4 → correctly FAKE ✅ (in training distribution)
Obama deepfake PSA → REAL ❌ (different generation method)
Tom Cruise YouTube → REAL ❌ (different generation method)
Personal real video → FAKE ❌ (false positive — model over-sensitive)

Why M2TR Cannot Solve Real-World Generalization

M2TR detects manipulation by finding specific low-level signals:

Frequency domain artifacts left by specific GAN/autoencoder architectures
Multi-scale spatial inconsistencies at patch boundaries

Both of these are generation-method specific. FF++ autoencoder fakes from 2019 leave completely different frequency signatures than:

FakeApp (2018) — expression transfer
DeepFaceLab (2021) — high quality face swap
Diffusion models (2023+) — no face swap artifacts at all

No amount of data from benchmark datasets fixes this because the model architecture itself is biased toward low-level artifacts.

Phase 2 — DFD-FCG Paper (Second Implementation)

Paper

"Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model" CVPR 2025 | Academia Sinica + Microsoft

Why This Paper Solves The Problem

DFD-FCG uses CLIP ViT (frozen) as the backbone. CLIP was pretrained on 400 million image-text pairs from the internet — it already has semantic understanding of what real human faces look like, independent of any specific deepfake generation method.

Key difference from M2TR:

M2TR: learns FF++ artifact signatures → fails on unseen methods
DFD-FCG: CLIP's weights never change → semantic knowledge is preserved → generalizes to unseen methods

Paper reports 87.2% AUC on WildDeepfake (real internet deepfakes) and 95.0% AUC on Celeb-DF — trained only on FF++. This is the cross-dataset generalization M2TR could never achieve.

Architecture (paper-exact)

Backbone: CLIP ViT-L/14 (implemented as ViT-B/32 for 8GB VRAM) — completely frozen
L=12 transformer layers, each layer feeds a decoder block
Spatial Module: N=4 learnable queries per layer, cross-attention with CLIP key attributes (γs=k), guided by FCG loss to focus on eyes/nose/mouth/skin
Temporal Module: Patch-Temporal MHSA (PT-MHSA) captures temporal inconsistencies across frames, C1 and C2 convolutional kernels (size=5, paper-exact)
FCG Loss: InfoNCE contrastive loss aligning learnable queries to facial component regions (τ=0.07, w=0.15)
3 FC heads: spatial (FC_s), temporal (FC_t), spatio-temporal (FC_st) — averaged for final prediction
Trainable params: ~283K (paper: ~250K) — CLIP 151M params all frozen

Dataset used for training

FaceForensics++ C23 — all 4 manipulation types (paper trains on FF++ only)
1000 real + 4000 fake videos
Preprocessing (paper-exact):
- 2D-FAN landmark detection (face_alignment library, GPU)
- LRW mean face alignment (20words_mean_face.npy from official DFD-FCG repo)
- 150×150 aligned face crops (paper-exact size)
- 2–4 second non-overlapping clips, T=10 frames uniformly sampled
Split: automatic 70/15/15 (no official JSON split files in Kaggle download)
Training clips: 7,536 (real=4,241, fake=3,295)
Val clips: 1,640 | Test clips: 1,682

Evaluation dataset

FF++ C23 held-out test set (same as training)
Paper evaluates cross-dataset on: Celeb-DF, DFDC, FaceShifter, DeeperForensics, WildDeepfake — these would be the true generalization test

Current Status

Step	Status
Preprocessing	✅ Complete — 5000 videos, 150×150 aligned crops
CLIP loading	✅ Fixed (CPU load → float32 → GPU)
Feature extraction hooks	✅ Fixed (LND format handled)
Spatial + Temporal modules	✅ Verified correct shapes
FCG loss	❌ CUDA assert — tensor indexing bug being fixed
Training	⏳ Not started yet

Current blocker: FCG loss patches[idx] needs torch.tensor(idx, device=patches.device) for proper CUDA indexing.

Summary Table

Phase	Paper	Backbone	Training Data	Test AUC	Internet Videos
1a	M2TR	Swin-Tiny	FF++ Deepfakes only	0.9915	❌ All REAL
1b	M2TR	EfficientNet-b4	FF++ all 4 types	0.9708	❌ All REAL
1c	M2TR	EfficientNet-b4	FF++ + Celeb-DF	0.9983	❌ All REAL
2	DFD-FCG	CLIP ViT-B/32 (frozen)	FF++ all 4 types	In progress	Expected: 87%+

Key Lessons Learned

High benchmark accuracy ≠ real-world performance. 99% AUC on FF++ test set means nothing if the model fails on any video not from FF++.
More data from the same distribution doesn't fix generalization. Adding Celeb-DF improved benchmark numbers but didn't help on internet videos — both datasets use similar generation methods.
Architecture matters more than data for generalization. CLIP's frozen semantic knowledge is the real solution — not more face-swap training data.
Cross-dataset generalization is an open research problem. Even DFD-FCG (CVPR 2025 state-of-the-art) only achieves 87% on WildDeepfake and 81% on DFDC. No model gets 99% on unseen internet videos.
Preprocessing matters for paper reproduction. MTCNN vs 2D-FAN + LRW alignment produces very different face crops that affect model performance significantly.

What Needs To Happen Next

Fix FCG loss CUDA indexing bug (one line change)
Run training overnight (~8–10 hours for 30 epochs)
Evaluate on FF++ test set
Update Flask app inference to use DFD-FCG checkpoint
Test on internet videos — expected to detect Tom Cruise / political deepfakes correctly

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CodeFiles		CodeFiles
E2E_application		E2E_application
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Setup.md		Setup.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deepfake Detection Project — Complete Journey

Prj-25 | Group G-05 | CV Course Spring 2026

Overview

Phase 1 — M2TR Paper (First Implementation)

Paper

Core Idea Taken From Paper

Experiment 1a — Quick prototype (Swin-Tiny + frequency branch)

Experiment 1b — Full M2TR architecture (paper-exact)

Experiment 1c — Adding Celeb-DF v2 to training data

Why M2TR Cannot Solve Real-World Generalization

Phase 2 — DFD-FCG Paper (Second Implementation)

Paper

Why This Paper Solves The Problem

Architecture (paper-exact)

Dataset used for training

Evaluation dataset

Current Status

Summary Table

Key Lessons Learned

What Needs To Happen Next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deepfake Detection Project — Complete Journey

Prj-25 | Group G-05 | CV Course Spring 2026

Overview

Phase 1 — M2TR Paper (First Implementation)

Paper

Core Idea Taken From Paper

Experiment 1a — Quick prototype (Swin-Tiny + frequency branch)

Experiment 1b — Full M2TR architecture (paper-exact)

Experiment 1c — Adding Celeb-DF v2 to training data

Why M2TR Cannot Solve Real-World Generalization

Phase 2 — DFD-FCG Paper (Second Implementation)

Paper

Why This Paper Solves The Problem

Architecture (paper-exact)

Dataset used for training

Evaluation dataset

Current Status

Summary Table

Key Lessons Learned

What Needs To Happen Next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages