Skip to content

mrt786/DeepFake-CV-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deepfake Detection Project — Complete Journey

Prj-25 | Group G-05 | CV Course Spring 2026


Overview

The project started as a course assignment to build an end-to-end deepfake detection system using a Vision Transformer. Over time it evolved into a serious attempt at building a real-world generalizable detector, going through 3 complete model implementations and 2 preprocessing pipelines.


Phase 1 — M2TR Paper (First Implementation)

Paper

"M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection" arXiv 2104.09770

Core Idea Taken From Paper

The central artifact: combine frequency domain features (FFT on feature maps) with spatial RGB features (transformer backbone) and fuse them via a Cross Modality Fusion (CMF) block. The model detects manipulation artifacts in both spatial and frequency domains simultaneously.

Experiment 1a — Quick prototype (Swin-Tiny + frequency branch)

Architecture:

  • Backbone: Swin-Tiny (pretrained ImageNet) — adapted from paper's EfficientNet-b4
  • Frequency branch: FFT on raw pixels → small CNN → 256-d embedding
  • Fusion: concatenate spatial + frequency → dense layers → binary output
  • Trainable params: ~10M

Dataset used for training:

  • FaceForensics++ C23 — Deepfakes folder only
  • 1000 real videos + 1000 fake videos
  • Preprocessing: MTCNN face detection, 224×224 crops, 20 frames per video
  • Split: 70/15/15 at video level
  • Training images: ~27,838

Evaluation dataset:

  • FF++ C23 held-out test set (same distribution as training)

Results:

Metric Value
Accuracy 96.98%
AUC 0.9915
F1 0.9695

Real-world problem: Tested on internet videos (Obama deepfake PSA, Tom Cruise face swaps, YouTube Shorts). All came back as REAL. The model learned FF++ Deepfakes-specific pixel artifacts from 2019-era autoencoder face swapping. Internet deepfakes use completely different generation methods — the model had never seen them.


Experiment 1b — Full M2TR architecture (paper-exact)

After identifying the generalization failure, rebuilt with the paper's exact architecture.

Architecture:

  • Backbone: EfficientNet-b4 (pretrained, paper-exact)
  • Frequency filter: Learnable complex Hadamard filter (G_real + G_imag) on feature maps via FFT → iFFT
  • Multi-scale transformer: 4 patch scales [1, 2, 4, 7] with separate attention heads
  • Cross Modality Fusion: QKV cross-attention (RGB → Q, Frequency → K, V)
  • Loss: L_cls + L_con (contrastive with margin) — L_seg skipped (no masks in dataset)
  • Optimizer: Adam, StepLR ÷10 every 40 epochs, 90 epochs total
  • Trainable params: ~10M

Dataset used for training:

  • FaceForensics++ C23 — all 4 manipulation types: Deepfakes, Face2Face, FaceSwap, NeuralTextures
  • 1000 real + 4000 fake videos
  • Preprocessing: MTCNN, 224×224, 30 frames per video
  • Split: 70/15/15 automatic from original.csv IDs
  • Training images: ~104,812

Evaluation dataset:

  • FF++ C23 held-out test set

Results:

Metric Value
Accuracy 93.5%
AUC 0.9708
F1 0.8456

(Lower than 1a because only 20 epochs ran out of 90 — model was still converging)

Real-world problem: Same issue. All internet videos returned REAL. Root cause confirmed: M2TR learns low-level pixel artifact signatures specific to its training distribution. Obama BuzzFeed deepfake (2018, FakeApp+Adobe), Tom Cruise deepfakes (DeepFaceLab 2021), TikTok videos — all use generation methods completely different from FF++ training data.


Experiment 1c — Adding Celeb-DF v2 to training data

Hypothesis: more diverse training data would improve real-world generalization.

Architecture: Same M2TR (EfficientNet-b4 + freq filter + CMF)

Dataset used for training:

  • FaceForensics++ C23: all 4 types (1000 real + 4000 fake videos)
  • Celeb-DF v2: Celeb-real (~590 videos) + YouTube-real (~300) + Celeb-synthesis (~5639 fake videos)
  • Preprocessing: MTCNN, 224×224, 30 frames per video
  • Split: automatic 70/15/15, built from face_crops folder names
  • Training images: ~149,748 total

Evaluation dataset:

  • Combined FF++ C23 + Celeb-DF v2 held-out test set

Results:

Metric Value
Accuracy 99.02%
AUC 0.9983
F1 0.9601

Real-world problem: Still failed on internet videos. Adding Celeb-DF improved benchmark metrics significantly but did not fix real-world generalization. The fundamental issue is architectural — M2TR looks for low-level artifact signatures, and no matter how many benchmark datasets you add, it cannot generalize to generation methods it has never seen. This is a known field-wide problem called closed-set bias.

Confirmed with testing:

  • data2/Celeb-synthesis/id0_id16_0000.mp4 → correctly FAKE ✅ (in training distribution)
  • Obama deepfake PSA → REAL ❌ (different generation method)
  • Tom Cruise YouTube → REAL ❌ (different generation method)
  • Personal real video → FAKE ❌ (false positive — model over-sensitive)

Why M2TR Cannot Solve Real-World Generalization

M2TR detects manipulation by finding specific low-level signals:

  1. Frequency domain artifacts left by specific GAN/autoencoder architectures
  2. Multi-scale spatial inconsistencies at patch boundaries

Both of these are generation-method specific. FF++ autoencoder fakes from 2019 leave completely different frequency signatures than:

  • FakeApp (2018) — expression transfer
  • DeepFaceLab (2021) — high quality face swap
  • Diffusion models (2023+) — no face swap artifacts at all

No amount of data from benchmark datasets fixes this because the model architecture itself is biased toward low-level artifacts.


Phase 2 — DFD-FCG Paper (Second Implementation)

Paper

"Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model" CVPR 2025 | Academia Sinica + Microsoft

Why This Paper Solves The Problem

DFD-FCG uses CLIP ViT (frozen) as the backbone. CLIP was pretrained on 400 million image-text pairs from the internet — it already has semantic understanding of what real human faces look like, independent of any specific deepfake generation method.

Key difference from M2TR:

  • M2TR: learns FF++ artifact signatures → fails on unseen methods
  • DFD-FCG: CLIP's weights never change → semantic knowledge is preserved → generalizes to unseen methods

Paper reports 87.2% AUC on WildDeepfake (real internet deepfakes) and 95.0% AUC on Celeb-DF — trained only on FF++. This is the cross-dataset generalization M2TR could never achieve.

Architecture (paper-exact)

  • Backbone: CLIP ViT-L/14 (implemented as ViT-B/32 for 8GB VRAM) — completely frozen
  • L=12 transformer layers, each layer feeds a decoder block
  • Spatial Module: N=4 learnable queries per layer, cross-attention with CLIP key attributes (γs=k), guided by FCG loss to focus on eyes/nose/mouth/skin
  • Temporal Module: Patch-Temporal MHSA (PT-MHSA) captures temporal inconsistencies across frames, C1 and C2 convolutional kernels (size=5, paper-exact)
  • FCG Loss: InfoNCE contrastive loss aligning learnable queries to facial component regions (τ=0.07, w=0.15)
  • 3 FC heads: spatial (FC_s), temporal (FC_t), spatio-temporal (FC_st) — averaged for final prediction
  • Trainable params: ~283K (paper: ~250K) — CLIP 151M params all frozen

Dataset used for training

  • FaceForensics++ C23 — all 4 manipulation types (paper trains on FF++ only)
  • 1000 real + 4000 fake videos
  • Preprocessing (paper-exact):
    • 2D-FAN landmark detection (face_alignment library, GPU)
    • LRW mean face alignment (20words_mean_face.npy from official DFD-FCG repo)
    • 150×150 aligned face crops (paper-exact size)
    • 2–4 second non-overlapping clips, T=10 frames uniformly sampled
  • Split: automatic 70/15/15 (no official JSON split files in Kaggle download)
  • Training clips: 7,536 (real=4,241, fake=3,295)
  • Val clips: 1,640 | Test clips: 1,682

Evaluation dataset

  • FF++ C23 held-out test set (same as training)
  • Paper evaluates cross-dataset on: Celeb-DF, DFDC, FaceShifter, DeeperForensics, WildDeepfake — these would be the true generalization test

Current Status

Step Status
Preprocessing ✅ Complete — 5000 videos, 150×150 aligned crops
CLIP loading ✅ Fixed (CPU load → float32 → GPU)
Feature extraction hooks ✅ Fixed (LND format handled)
Spatial + Temporal modules ✅ Verified correct shapes
FCG loss ❌ CUDA assert — tensor indexing bug being fixed
Training ⏳ Not started yet

Current blocker: FCG loss patches[idx] needs torch.tensor(idx, device=patches.device) for proper CUDA indexing.


Summary Table

Phase Paper Backbone Training Data Test AUC Internet Videos
1a M2TR Swin-Tiny FF++ Deepfakes only 0.9915 ❌ All REAL
1b M2TR EfficientNet-b4 FF++ all 4 types 0.9708 ❌ All REAL
1c M2TR EfficientNet-b4 FF++ + Celeb-DF 0.9983 ❌ All REAL
2 DFD-FCG CLIP ViT-B/32 (frozen) FF++ all 4 types In progress Expected: 87%+

Key Lessons Learned

  1. High benchmark accuracy ≠ real-world performance. 99% AUC on FF++ test set means nothing if the model fails on any video not from FF++.

  2. More data from the same distribution doesn't fix generalization. Adding Celeb-DF improved benchmark numbers but didn't help on internet videos — both datasets use similar generation methods.

  3. Architecture matters more than data for generalization. CLIP's frozen semantic knowledge is the real solution — not more face-swap training data.

  4. Cross-dataset generalization is an open research problem. Even DFD-FCG (CVPR 2025 state-of-the-art) only achieves 87% on WildDeepfake and 81% on DFDC. No model gets 99% on unseen internet videos.

  5. Preprocessing matters for paper reproduction. MTCNN vs 2D-FAN + LRW alignment produces very different face crops that affect model performance significantly.


What Needs To Happen Next

  1. Fix FCG loss CUDA indexing bug (one line change)
  2. Run training overnight (~8–10 hours for 30 epochs)
  3. Evaluate on FF++ test set
  4. Update Flask app inference to use DFD-FCG checkpoint
  5. Test on internet videos — expected to detect Tom Cruise / political deepfakes correctly

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages