# INFO 290T Final Project: Detecting AI-Generated Frames

## Team Members
Abhijith Varma Mudunuri,
Atharva Jayesh Patel

## Project Overview

This project aims to develop a custom classifier that can distinguish between real and AI-generated (fake) videos. With the rise of deepfake technology, being able to detect synthetic media is becoming increasingly important. We'll implement and compare multiple feature extraction techniques and classification methods to identify patterns that differentiate authentic videos from those created using AI manipulation.

### Dataset

We are using the Facefusion faceswap diffusion model within the Deepspeak v2 dataset of real videos and their AI-generated counterparts. The dataset contains face videos organized into train and test sets, with both "real" and "fake" labels. The videos are portrait-oriented facial recordings, with the AI-generated ones created using sophisticated face-swapping or face-manipulation technologies.

After extracting the facefusion, we converted them to mov and then further used insightface to extract 2fps maximum of 60 frames per video focused only the face using insightface. Total photo frames where around 53,000

*Python Code Snippets are not runnable, they where taken from the code files we used. To see the specific files, refer to the Code folder within the submission

In [None]:
# Import necessary libraries
import os
import numpy as np
import cv2
import random
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
from scipy import signal
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

## Sample Frame Comparison

### Real vs FaceFusion Frames

#### Frame 0
<div style="display: flex; flex-direction: row;">
    <div style="margin-right: 10px;">
        <p>Real</p>
        <img src="img/real/frame_000.jpg" alt="Real Frame 0" width="320"/>
    </div>
    <div>
        <p>FaceFusion</p>
        <img src="img/fake/frame_000.jpg" alt="Fake Frame 0" width="320"/>
    </div>
</div>

#### Frame 1
<div style="display: flex; flex-direction: row;">
    <div style="margin-right: 10px;">
        <p>Real</p>
        <img src="img/real/frame_001.jpg" alt="Real Frame 1" width="320"/>
    </div>
    <div>
        <p>FaceFusion</p>
        <img src="img/fake/frame_001.jpg" alt="Fake Frame 1" width="320"/>
    </div>
</div>

#### Frame 2
<div style="display: flex; flex-direction: row;">
    <div style="margin-right: 10px;">
        <p>Real</p>
        <img src="img/real/frame_002.jpg" alt="Real Frame 2" width="320"/>
    </div>
    <div>
        <p>FaceFusion</p>
        <img src="img/fake/frame_002.jpg" alt="Fake Frame 2" width="320"/>
    </div>
</div>

#### Frame 3
<div style="display: flex; flex-direction: row;">
    <div style="margin-right: 10px;">
        <p>Real</p>
        <img src="img/real/frame_003.jpg" alt="Real Frame 3" width="320"/>
    </div>
    <div>
        <p>FaceFusion</p>
        <img src="img/fake/frame_003.jpg" alt="Fake Frame 3" width="320"/>
    </div>
</div>

#### Frame 4
<div style="display: flex; flex-direction: row;">
    <div style="margin-right: 10px;">
        <p>Real</p>
        <img src="img/real/frame_004.jpg" alt="Real Frame 4" width="320"/>
    </div>
    <div>
        <p>FaceFusion</p>
        <img src="img/fake/frame_004.jpg" alt="Fake Frame 4" width="320"/>
    </div>
</div>

## Data Pre-processing

Our first step involves extracting facial regions from the videos to focus on the most relevant parts for detecting manipulation. We process the raw video files to extract face frames and organize them into a structured dataset.

We used InsightFace for face detection to extract face frames from the videos at a specified sampling rate, ensuring we capture the temporal aspects of the videos while focusing on the facial regions where manipulations would be most apparent.

In [None]:
# Code for face extraction from videos
def create_directory_structure(base_path):
    """Create the directory structure for extracted face frames"""
    for dataset in ['train', 'test']:
        for label in ['fake', 'real']:
            os.makedirs(os.path.join(base_path, dataset, label), exist_ok=True)
    print("Directory structure created successfully.")

def extract_face_frames(source_dir, target_dir, sample_rate=2, max_frames=60):
    """Extract face frames from videos at specified sampling rate with a maximum cap"""
    import insightface
    from insightface.app import FaceAnalysis
    
    # Initialize InsightFace face detector
    face_analyzer = FaceAnalysis(name='buffalo_l', providers=['CPUExecutionProvider'])
    face_analyzer.prepare(ctx_id=0, det_size=(640, 640))
    
    # Process videos for each dataset (train/test) and label (fake/real)
    for dataset in ['train', 'test']:
        for label in ['fake', 'real']:
            source_path = os.path.join(source_dir, dataset, label)
            target_path = os.path.join(target_dir, dataset, label)
            
            # Skip if source directory doesn't exist
            if not os.path.exists(source_path):
                print(f"Source directory not found: {source_path}")
                continue
            
            # Get list of video files
            video_files = [f for f in os.listdir(source_path) if f.endswith('.MOV') or f.endswith('.mov')]
            print(f"Processing {len(video_files)} videos in {dataset}/{label}")
            
            # Process each video to extract face frames
            # Implementation details omitted for brevity

# Note: This code is provided for reference but won't be executed in this notebook

## Feature Extraction Methods

We implemented multiple feature extraction methods to capture various aspects of the videos that might indicate manipulation:

1. **Histogram of Oriented Gradients (HOG)**: Captures the distribution of gradient directions in localized portions of an image
2. **Fourier Transform Analysis**: Examines frequency domain characteristics to reveal manipulation artifacts
3. **Laplacian Pyramid Decomposition**: Provides multi-scale representation to analyze details at different resolutions
4. **Temporal Synchronization Analysis**: Examines the consistency of motion between facial regions over time
5. **Flicker Inconsistency Detection**: Analyzes frame-to-frame inconsistencies that might indicate manipulation

### 1. Histogram of Oriented Gradients (HOG)

HOG features capture the distribution of gradient directions in localized portions of an image. They are particularly effective at capturing shape information and are robust to illumination changes, making them suitable for detecting facial manipulations.

In [None]:
def compute_hog_features(image, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(2, 2)):
    """Compute HOG features for a single image"""
    # Convert to grayscale if needed
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image
    
    # Resize to a fixed size to ensure consistency
    gray = cv2.resize(gray, (128, 128))
    
    # Create HOG descriptor
    win_size = (128, 128)
    cell_size = pixels_per_cell
    block_size = (cells_per_block[0] * cell_size[0], cells_per_block[1] * cell_size[1])
    block_stride = cell_size
    
    hog = cv2.HOGDescriptor(win_size, block_size, block_stride, cell_size, orientations)
    hog_features = hog.compute(gray)
    
    return hog_features.flatten()

def extract_hog_temporal_features(video_dir):
    """Extract HOG features from frames and compute temporal features"""
    # Implementation details omitted for brevity
    # This function loads all frames from a video directory,
    # computes HOG features for each frame, and then analyzes
    # temporal patterns in these features
    pass

#### HOG Feature Visualization

Here we visualize how the HOG features separate real and fake videos through dimensionality reduction techniques.

![HOG t-SNE Visualization](HOG/all_tsne.png)

The t-SNE visualization of HOG features reveals substantial overlap between real facial videos (orange X's) and FaceFusion-generated videos (blue dots). While HOG captures gradient orientations that should theoretically detect inconsistencies in facial structure, the visualization shows that FaceFusion effectively replicates these gradient patterns. The few clustered regions of same-class points (particularly in the upper right) suggest that certain videos share distinctive gradient signatures, but these aren't consistently aligned with the real/fake classification. The extensive mixing throughout the central region demonstrates that relying solely on gradient-based features provides limited discriminative power for detecting modern deepfakes.

![HOG PCA Visualization](HOG/all_pca.png)

The PCA projection of HOG features shows the first two principal components capturing only about 0.98% of the total variance, with extensive class overlap. This extremely low explained variance indicates that the discriminative information is distributed across many higher dimensions that aren't visualized here. The pattern of data points shows a primary cluster near the origin with several outliers (particularly labeled with specific file identifiers like "faceproject-v2_21_facetalker") that deviate significantly along the x-axis. This suggests that while some videos exhibit distinctive gradient patterns, these aren't consistently associated with either real or FaceFusion-generated content.

### 2. Fourier Transform Analysis

Fourier transform analysis helps us examine the frequency domain characteristics of the videos. This can reveal manipulation artifacts that might be invisible in the spatial domain, as AI-generated content often exhibits different frequency patterns compared to authentic videos.

In [None]:
def compute_fourier_features(image):
    """Compute Fourier transform features for a single image"""
    # Convert to grayscale if needed
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image
    
    # Resize to a fixed size to ensure consistency
    gray = cv2.resize(gray, (128, 128))
    
    # Convert to float for FFT
    gray_float = gray.astype(np.float32) / 255.0
    
    # Apply 2D FFT
    f_transform = np.fft.fft2(gray_float)
    
    # Shift zero frequency component to center
    f_transform_shifted = np.fft.fftshift(f_transform)
    
    # Compute magnitude spectrum (log scale for better visualization)
    magnitude_spectrum = np.log(1 + np.abs(f_transform_shifted))
    
    # Compute phase spectrum
    phase_spectrum = np.angle(f_transform_shifted)
    
    # Extract features from the frequency domain
    # Implementation details omitted for brevity
    
    return np.array([]) # Placeholder for the actual feature vector

#### Fourier Transform Visualization

These visualizations show how the frequency domain features differentiate between real and AI-generated videos.

![Fourier Transform t-SNE Visualization](FOURIER/all_tsne.png)

The t-SNE visualization of Fourier features shows extensive intermixing between real and FaceFusion-generated videos along an S-shaped manifold structure. Despite transforming the data into the frequency domain, the visualization shows no clear separation between classes. This suggests that FaceFusion maintains frequency-domain characteristics nearly identical to real videos, likely because it preserves the overall spatial frequencies while manipulating facial features. The consistent mixing throughout all regions of the plot indicates that spectral analysis alone is insufficient for detecting these sophisticated deepfakes.

![Fourier Transform PCA Visualization](FOURIER/all_pca.png)

The PCA projection of Fourier features shows nearly complete class overlap across a wide value range (-15000 to +20000 on x-axis), with the explained variance of approximately 1.00% confirming that the most discriminative information isn't captured in these first two components. The scattered distribution pattern, with both classes showing similar variance and central tendency, indicates that FaceFusion successfully replicates the frequency characteristics of real videos. This widespread overlap suggests that distinguishing between real and fake videos based solely on these two principal components of frequency domain features would be extremely challenging.

### 3. Laplacian Pyramid Decomposition

Laplacian pyramid decomposition provides a multi-scale representation of the images, capturing details at different resolutions. This helps us analyze both fine-grained and coarse details in the video frames, which is useful for detecting inconsistencies introduced by AI manipulation.

In [None]:
def build_laplacian_pyramid(image, levels=4):
    """Build a Laplacian pyramid for an image"""
    # Ensure image is grayscale
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image
    
    # Resize to ensure consistent dimensions
    gray = cv2.resize(gray, (128, 128))
    
    # Convert to float32 for calculations
    gray = gray.astype(np.float32) / 255.0
    
    # Initialize pyramid
    pyramid = []
    current = gray.copy()
    
    # Build Gaussian pyramid
    gaussian_pyramid = [current]
    for i in range(levels - 1):
        current = cv2.pyrDown(current)
        gaussian_pyramid.append(current)
    
    # Build Laplacian pyramid
    for i in range(levels - 1):
        size = (gaussian_pyramid[i].shape[1], gaussian_pyramid[i].shape[0])
        expanded = cv2.pyrUp(gaussian_pyramid[i + 1], dstsize=size)
        laplacian = gaussian_pyramid[i] - expanded
        pyramid.append(laplacian)
    
    # Add the smallest level of the Gaussian pyramid
    pyramid.append(gaussian_pyramid[-1])
    
    return pyramid

def extract_pyramid_features(pyramid):
    """Extract statistical features from each level of the Laplacian pyramid"""
    # Implementation details omitted for brevity
    pass

#### Laplacian Pyramid Visualization

![Laplacian Pyramid t-SNE Visualization](LAPLACIAN/all_tsne.png)

The t-SNE visualization of Laplacian pyramid features shows both classes following the same curved distribution pattern with significant overlap. Despite the multi-scale decomposition capturing information at different resolutions, the visualization shows that FaceFusion-generated videos maintain consistent characteristics across scales. The lack of separation indicates that even the fine details preserved in upper levels of the pyramid and the broader structures in lower levels share similar patterns between real and deepfake videos, highlighting FaceFusion's effectiveness at maintaining coherent structure across scales.

![Laplacian Pyramid PCA Visualization](LAPLACIAN/all_pca.png)

The PCA projection of Laplacian pyramid features shows substantial class overlap with extreme value ranges (PC1 spans from -400,000 to +1,000,000). The significant mixing of points suggests that the first two principal components alone cannot cleanly distinguish between real and FaceFusion-generated content. The variance explained by these components is moderate, indicating that while they capture some relevant information, much of the discriminative power likely lies in higher dimensions or more complex relationships among features.

### 4. Temporal Synchronization Analysis (SYNCHRO)

This method analyzes the temporal synchronization between different facial regions over time. In real videos, facial movements are naturally synchronized, while AI-generated videos often exhibit subtle inconsistencies in the timing and coordination of these movements.

#### Synchronization Analysis Visualization

![Synchronization t-SNE Visualization](SYNCHRO/tsne_visualization.png)

The t-SNE visualization of temporal synchronization features shows partial clustering with considerable overlap between real and fake videos. While some regions show concentration of same-class points, the classes intermix significantly in other areas. This partial separation suggests that temporal coherence between facial movements provides some discriminative information, but is not sufficient by itself for perfect classification. The mixing indicates that FaceFusion successfully maintains temporal relationships between facial regions that closely mimic those in authentic content, making detection through motion synchronization challenging.

![Synchronization PCA Visualization](SYNCHRO/pca_visualization.png)

The PCA projection shows moderate separation along the principal components of synchronization features, but with extensive overlap in the central region. The distribution suggests that temporal coherence patterns share similarities between real and FaceFusion-generated videos, making complete separation difficult based on these two dimensions alone. The outlier points visible at the extremes of the plot suggest that certain videos exhibit distinctive temporal patterns, but these aren't consistently associated with either class.

### 5. Flicker Inconsistency Detection (FLICKER)

This method focuses on detecting inconsistencies in the flickering patterns between frames. AI-generated videos often have different patterns of brightness and texture changes between consecutive frames compared to authentic videos.

#### Flicker Analysis Visualization

![Flicker Analysis t-SNE Visualization](FLICKER/tsne_visualization.png)

The t-SNE visualization of flicker features shows minimal separation between real and fake videos, with extensive mixing throughout the projection. This substantial overlap suggests that frame-to-frame brightness and texture changes are remarkably similar between real and FaceFusion-generated videos. This finding indicates that FaceFusion has become highly sophisticated at maintaining consistent inter-frame transitions without introducing detectable flickering artifacts. The lack of clear clustering demonstrates that flicker detection alone would be insufficient for reliable classification of modern deepfakes.

## Classification Results for Each Feature Extraction Method

We implemented two primary classification methods to distinguish between real and AI-generated videos:

1. **Support Vector Machines (SVM)**: A robust classification method that finds the optimal hyperplane separating the two classes
2. **Logistic Regression**: A probabilistic classification approach that models the probability of a video being fake

### HOG Feature Classification Results

![HOG Classification Confusion Matrix](HOG/confusion_matrix.png)

The confusion matrix for HOG features shows moderate classification performance with equal misclassifications in both directions. Specifically, we see 4 true positives for both real and AI-generated videos, with 2 false positives and 2 false negatives, yielding an overall accuracy of 67% (8/12). This balanced error distribution indicates that the classifier doesn't show strong bias toward either class, but also doesn't achieve the high accuracy that would be needed for deployment in sensitive applications. The moderate performance aligns with the limited separation observed in the HOG feature visualizations.

![HOG Feature Importance](HOG/feature_importance.png)

The feature importance plot for HOG shows a relatively flat distribution across many gradient features, with no single feature dominating the classification decision. The most important feature contributes only about 0.11 of the overall importance, with a gradual decline across subsequent features. This distributed importance pattern suggests that the model relies on subtle combinations of gradient features rather than obvious distinguishing characteristics. The mix of feature importances indicates that various gradient properties contribute moderately to the classification, but none are strongly indicative of FaceFusion manipulation.


### Fourier Transform Classification Results

![Fourier SVM Confusion Matrix](FOURIER/svm_confusion_matrix.png)

The SVM confusion matrix for Fourier features shows limited classification performance with substantial misclassifications. Specifically, we see 43 true real videos classified correctly and 33 true AI-generated videos classified correctly, alongside 27 false negatives (AI-generated videos classified as real) and 17 false positives (real videos classified as AI-generated). This yields an accuracy of 63% (76/120). The higher rate of false negatives suggests that the frequency domain patterns in many FaceFusion-generated videos closely resemble those of authentic content, making them difficult to detect.

![Fourier Logistic Regression Confusion Matrix](FOURIER/logisticreg_confusion_matrix.png)

The Logistic Regression confusion matrix for Fourier features shows similar limitations to the SVM approach, with considerable classification errors. The model correctly identifies 35 real videos and 41 AI-generated videos, while misclassifying 25 real videos as AI-generated and 19 AI-generated videos as real. This gives an accuracy of 63.3% (76/120). The error pattern differs slightly from the SVM model, with more balanced mistakes between the classes, suggesting that the probabilistic approach handles the overlapping feature distributions differently.

![Fourier Feature Importance](FOURIER/feature_importance.png)

The feature importance plot for Fourier features shows scattered significance across multiple frequency bands and spectral characteristics. The most important frequency feature contributes approximately 0.085 to the model, with a gradual decline across subsequent features. The modest magnitude of most bars corresponds to the limited discriminative power observed in the classification results. While some frequency components show stronger association with one class over the other, no single feature emerges as definitively discriminative.

### Laplacian Pyramid Classification Results

![Laplacian SVM Confusion Matrix](LAPLACIAN/svm_confusion_matrix.png)

The SVM confusion matrix for Laplacian pyramid features shows moderate classification performance with a significant number of misclassifications. The model correctly identifies 34 real videos and 39 AI-generated videos, while misclassifying 26 real videos as AI-generated and 21 AI-generated videos as real. This yields an accuracy of 60.8% (73/120). The multi-scale analysis captures some distinguishing characteristics between real and FaceFusion-generated videos, but the number of incorrect predictions in both classes indicates limitations in its discriminative power.

![Laplacian Logistic Regression Confusion Matrix](LAPLACIAN/logreg_confusion_matrix.png)

The Logistic Regression confusion matrix for Laplacian features shows performance challenges similar to the SVM approach. The model correctly identifies 38 real videos and 38 AI-generated videos, while misclassifying 22 real videos as AI-generated and 22 AI-generated videos as real. This gives an accuracy of 63.3% (76/120). The balanced error patterns suggest that some FaceFusion-generated videos successfully replicate the multi-scale characteristics of authentic content, creating feature overlap that challenges both classification approaches.

![Laplacian Feature Importance](LAPLACIAN/feature_importance.png)

The feature importance plot for Laplacian pyramid features shows distributed significance across scales and statistical properties. The top two features each contribute approximately 0.05 to the model, with a gradual decline across subsequent features. No single feature dominates the classification decision, with modest contributions from multiple features across different scales. This pattern aligns with the moderate classification performance and suggests that while the multi-scale approach captures relevant information, the discriminative cues are subtle and distributed rather than concentrated in specific scales.

### Synchronization Analysis Classification Results

![Synchronization Confusion Matrix](SYNCHRO/confusion_matrix.png)

The confusion matrix for temporal synchronization features shows moderate classification performance with a notable bias toward false negatives. Specifically, we see 5 real videos correctly classified as real and 2 AI-generated videos correctly classified as AI-generated, alongside 3 false negatives (AI-generated videos classified as real) and 0 false positives. This gives an accuracy of 70% (7/10) on this smaller test set. The higher number of false negatives indicates that many FaceFusion-generated videos achieve temporal coherence similar enough to real videos to fool the classifier.

![Synchronization Feature Importance](SYNCHRO/feature_importance.png)

The feature importance plot for temporal synchronization measures shows the most important feature contributing approximately 0.115 to the model, with a gradual decline across subsequent features. The modest contributions from multiple features rather than dominant individual indicators suggests that while synchronization features capture some relevant differences, these differences are subtle and distributed across multiple aspects of temporal behavior. The top features appear to capture lip-jaw correlations, suggesting that mouth movements may provide the most useful temporal cues for detection.

![SYNCHRO ROC Curve Comparison](SYNCHRO/roc_curve_comparison.png)

The ROC curve shows moderate performance with an area under the curve (AUC) that exceeds random chance but falls significantly short of ideal classification. The curve's distance from the top-left corner indicates limitations in simultaneously achieving high sensitivity and specificity. This performance reflects the fundamental challenge of balancing false positive and false negative rates when the feature distributions show significant overlap between classes. The multiple curves shown represent different feature combinations, with the best performing combination (Top 5 Features, AUC ≈ 0.718) still showing substantial room for improvement.

## Advanced Neural Network Approach: Vision Transformer (ViT)

In addition to the traditional computer vision features, we experimented with a Vision Transformer (ViT) approach to directly classify videos without explicit feature engineering. The ViT model processes sequences of video frames and learns to identify patterns characteristic of AI manipulation.

### Vision Transformer Results

![ViT Training Loss](VIT/loss_plot.png)

The ViT training loss curve shows a steady decrease from approximately 0.715 to 0.690 that eventually levels off, with the validation loss generally following but maintaining a gap above the training loss. This persistent gap suggests some degree of overfitting despite the overall learning progress. The training loss shows more volatility than the validation loss, with several spikes and dips, indicating potential challenges in optimization. The convergence pattern indicates that while the model is learning useful patterns, it may be partially memorizing training examples rather than fully generalizing to novel videos.

![ViT Training Accuracy](VIT/accuracy_plot.png)

The accuracy plot shows improvement over training epochs but with significant fluctuations and inconsistent patterns between training and validation performance. Training accuracy improves from approximately 0.45 to 0.53, while validation accuracy shows peaks but no consistent upward trend, fluctuating between approximately 0.48 and 0.54. This unstable learning trajectory suggests that the network struggles to develop robust discriminative representations for the subtle differences between real and FaceFusion-generated videos. The lack of clear improvement in validation accuracy by the final epochs indicates limitations in how well the learned patterns generalize to unseen examples.

![ViT Confusion Matrix](VIT/confusion_matrix.png)

The confusion matrix for the ViT model shows limited classification performance despite its architectural complexity. The model correctly identifies 18 real videos and 9 fake videos, while misclassifying 7 real videos as fake and 16 fake videos as real. This yields an accuracy of 54% (27/50), which is only slightly better than random chance. The high rate of false negatives (fake videos classified as real) indicates that the model struggles to identify distinctive patterns in FaceFusion-generated content, despite leveraging complex spatial and temporal attention mechanisms that should theoretically capture subtle manipulation artifacts.

# Generalizability Considerations

A critical limitation of our study is the modest sample size, which raises concerns about generalizability. With only 12 samples for HOG evaluation, 10 for synchronization analysis, and 50-120 for other methods, our findings should be interpreted cautiously. Moreover, our dataset consists exclusively of FaceFusion-generated deepfakes from the DeepSpeak v2 dataset, which may not represent the full spectrum of deepfake technologies currently available. 

The consistent challenge across all methods is that even our best performance (70% accuracy) falls significantly short of what would be required for real-world deployment, especially in high-stakes contexts like legal evidence or news verification. These limitations are particularly concerning given that deepfake technologies continue to evolve rapidly, likely outpacing the detection approaches evaluated here.

Cross-validation was not implemented across multiple deepfake generation methods, which would be essential to determine whether these findings generalize beyond FaceFusion to other face-swapping or manipulation techniques. Future work should incorporate a more diverse dataset spanning multiple generation algorithms, video qualities, and manipulation types to build more robust and generalizable detection systems.

# Overall Performance Comparison

Comparing the performance of our different feature extraction methods and classification approaches reveals varying degrees of effectiveness in distinguishing between real and FaceFusion-generated videos. Based on our confusion matrices, we can determine the actual accuracies:

- HOG + SVM: 67% (8/12 correct classifications)
- Fourier + SVM: 63% (76/120 correct classifications)
- Fourier + LogReg: 63.3% (76/120 correct classifications)
- Laplacian + SVM: 60.8% (73/120 correct classifications)
- Laplacian + LogReg: 63.3% (76/120 correct classifications)
- SYNCHRO: 70% (7/10 correct classifications, smaller sample)
- ViT: 54% (27/50 correct classifications)

Notably, none of the methods achieved high accuracy, with even our best-performing approach (Temporal Synchronization) reaching only 70% on a small sample that is statistically insiginficant. The traditional feature extraction methods (HOG, Fourier, Laplacian) performed similarly in the 60-67% range, while the more complex Vision Transformer model unexpectedly underperformed at 54%. This generally modest performance across all methods highlights the significant challenge posed by modern deepfake detection.

Our project demonstrates that detecting FaceFusion-generated videos remains a challenging problem requiring multiple complementary approaches. Among our methods, the Temporal Synchronization approach achieved the best results on a small sample, with traditional feature extraction methods showing moderate effectiveness. The substantial overlap between classes across most feature spaces highlights the increasing sophistication of AI generation techniques.
Each feature extraction method revealed different aspects of the detection challenge:

HOG features captured gradient and edge information with moderate success (67% accuracy), but showed significant overlap between classes in the visualization space
Laplacian pyramid decomposition provided multi-scale analysis that revealed some discriminative patterns (60-63% accuracy), but many fake videos still resembled real ones across scales
Temporal synchronization analysis demonstrated potential (70% accuracy on a small sample), but showed that modern AI can increasingly maintain convincing temporal relationships
Fourier transform analysis showed substantial overlap in frequency patterns between real and AI content (63% accuracy), indicating sophisticated frequency domain replication
The Vision Transformer approach underperformed expectations (54% accuracy), suggesting that even complex deep learning architectures struggle with the subtle artifacts in modern deepfakes

The consistent finding across all methods is that no single approach provides robust detection, as FaceFusion has become increasingly sophisticated at replicating authentic visual characteristics across multiple domains. The limited accuracy across all approaches (54-70%) demonstrates the fundamental challenge facing deepfake detection systems.

## Contributions

- **Abhijith Varma Mudunuri**: Feature extraction implementation (HOG, Fourier), dataset preparation, classification modeling, experimental design,  ViT development
- **Atharva Jayesh Patel**: Laplacian pyramid implementation, experimental design, ViT development, SYNCHRO and FLICKER methods, analysis and visualization

Dataset: Faridlab DeepSpeak v2
AI aid: Claude 3.7 Sonnet was used interchangeably throughout the code files. It's aid was not specific to generating subsets for us to cite, we used it's interchangeably and it is interwoven as we needed it.