<a href="https://colab.research.google.com/github/kingloogie/QTM-347-Machine-Learning-Final-Project/blob/main/Final_Documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

#  Detecting AI-Generated Faces Using ResNet50: A Transfer Learning Approach

**Authors:** Emily Ni, Aaron Tse, Alan Yang, Cynthia Zhang
</div>


##  Abstract

In this project, we investigate whether machine learning models can reliably distinguish between real and AI-generated face images using a limited dataset of 1800 images. We apply ResNet50 and conduct three phases of training: freezing the base model, full fine-tuning, and regularized partial fine-tuning. Our final model, which unfreezes the top 30 layers and incorporates dropout, L2 regularization, and early stopping, achieves a validation accuracy of 97.5%, performing competitively with existing models trained on significantly larger datasets. We further employ Vision Transformers on our dataset, attaining a perfect classification accuracy of 100%, as evidenced by the results of our confusion matrix analysis. These results demonstrate the viability of using transfer learning for AI-generated image detection even under constrained data settings.

## Introduction

In today's digital age, AI-generated face images have become increasingly realistic and accessible. Synthetic faces are now frequently exploited for malicious purposes, including online scams, fake identities, and the spread of misleading iinformation on social media.

This project addresses the growing concerns about AI-generated faces by applying machine learning methods to systematically distinguish real human faces from AI-generated ones. Solving this problem is crucial for maintaining authenticity in applications such as social platforms, hiring processes, and biometric verification systems. By enhancing our ability to detect AI-generated images, we contribute to protecting individuals’ personal, financial, and even emotional security. Furthermore, this study provides an opportunity to compare the performance of different deep learning architectures, specifically ResNet50 and Vision Transformers. This comparative analysis could offer practical insights into building more robust and generalizable detection systems.

To solve this problem, we first thought of convolutional neural networks (CNNs), ResNet-50, and Vision Transformers (ViT). While CNNs are well known for their ability to extract local features, we soon realized that they are limited when it comes to modeling complex structures and capturing long-range dependencies. This made them less suitable for our task, which requires not only detecting subtle local details but also understanding the global structure of an image.

Then we decided to focus on comparing ResNet-50 and Vision Transformers. ResNet-50, as shown in previous studies like Keswani (2023), effectively captures fine-grained local features, which could help identify semantic differences between real and generated images. On the other hand, ViT is designed to capture global structure and long-range relationships. According to Malviya et al. (2025), ViT-based models have achieved state-of-the-art performance in detecting AI-generated images, especially from powerful generators such as Stable Diffusion and DALL·E 3.

In our experiments, both models worked reasonably well. However, Vision Transformers slightly outperformed ResNet-50, likely because they can better capture global context — an important factor in distinguishing AI-generated images that may look realistic locally but unnatural globally.

The key components of our approach include dataset preparation, model training, evaluation, and analysis. After running both models separately and comparing their performance, we found that while both have their strengths, ViT showed a marginal advantage. Nevertheless, a limitation of our study is the relatively small sample size, and future work with larger datasets may provide deeper insights.

## Vision Transformer Setup


**Dataset Selection and Preparation:**
We began by selecting the 130k Real vs. Fake Face dataset from Kaggle, which contains a large collection of labeled images, including both real photographs and fake images generated by artificial intelligence models. Due to the constraints of our computational resources, we selected a balanced subset of approximately 1,800 images—consisting of ~900 real and ~900 fake samples. This subset was chosen to maintain a diverse representation of image types while keeping the training process computationally manageable.

**Environment Setup:**
The project environment was configured in Google Colab, leveraging its GPU capabilities to accelerate training. We uploaded the cleaned dataset to Google Drive and ensured that the directory structure matched the requirements of Keras data pipelines. Specifically, separate folders were created for each class (“real” and “fake”), which is critical for Keras’ ImageDataGenerator to automatically label the images during data loading. This step streamlined the data ingestion process and helped avoid manual labeling errors.

**ViT Model Configuration and Parameters:**
For the Vision Transformer model, the parameters we used for the model include:

Batch size — the number of samples processed before the model updates its weights.

Learning rate — the step size at each iteration to minimize the loss function.

Number of epochs — the number of times the entire dataset is passed through the model.

Loss function and optimization steps — controlling how the model measures error and adjusts weights.
These parameters play an essential role in optimizing model performance and achieving the best possible generalization on unseen data.

**Dataset Validation:**
Before initiating model training, we conducted a validation check on the dataset. This included verifying the folder structure, confirming class balance, and visually inspecting random samples from each category. As illustrated in our presentation slides, sample images from the “real” and “fake” classes were displayed to ensure data integrity and to provide a qualitative sense of the classification challenge.



**Model Architecture – Vision Transformer (ViT):**
The Vision Transformer architecture is adapted from the transformer models originally developed for natural language processing. Instead of analyzing word sequences, ViTs process images by:

1. Dividing input images (224 × 224 pixels) into smaller fixed-size patches (e.g., 16 × 16 pixels).

2. Flattening these patches and converting them into embeddings—numerical representations the model can interpret.

3. Adding positional embeddings to preserve spatial information.

4. Passing the embedded patches through a stack of transformer encoder layers that apply multi-head self-attention to learn relationships across all patches.

5. Using the output of a special classification token ([CLS] token) to make the final prediction.

This architecture allows the model to capture both local and global features, making it highly effective for tasks like fake image detection.



**ViT Implementation Pipeline:**

1. Splitting the dataset into training and test sets (80/20 split).

2. Using ImageDataGenerator with data augmentation (e.g., random flips, rotations, zoom) to increase model robustness and reduce overfitting.

3. Converting all images to 224 × 224 pixels and preparing batches compatible with ViT input specifications.

4. Loading a pretrained ViT model from Hugging Face, compiling it with an appropriate optimizer and loss function, and preparing it for fine-tuning on our dataset.

## Vision Transformers Results

We conducted a thorough evaluation of Vision Transformers (ViT) applied to our dataset, following a two-phase approach: baseline testing and final performance assessment.

**Baseline Evaluation:**

To establish a baseline, we began with untuned Vision Transformer evaluations across multiple runs. Using a pretrained ViT model, we ran repeated tests to gauge the stability of performance. The key observations were:

Multiple Runs: The pretrained model was evaluated several times to assess its consistency.

Stability Metric: The average accuracy across runs was calculated as:

Average Accuracy: 0.5748 ± 0.0054

This initial figure provided a benchmark for comparison once the model was fine-tuned and trained on our specific dataset.



**Initial Training and Testing:**

We then carried out initial training using a carefully prepared subset of the dataset. Over a span of 3 epochs, both training and validation accuracy showed a marked improvement:

Epoch 1:

Train Accuracy: 96.74%

Validation Accuracy: 99.43%

Epoch 2:

Train Accuracy: 98.95%

Validation Accuracy: 99.72%

Epoch 3:

Train Accuracy: 99.79%

Validation Accuracy: 100%

Our model performed unexpectedly and exceptionally well. The accuracy curve demonstrated rapid convergence, with the validation set reaching perfect accuracy by the third epoch.

**Final Result:**

Confusion Matrix Analysis -- For the final evaluation, we analyzed the model's performance using a confusion matrix and detailed classification metrics:

Overall Accuracy: 100%

Macro Average (Precision, Recall, F1): 1.00

The confusion matrix illustrates perfect classification with no misclassifications across the test set. Precision, recall, and F1-scores for both classes (real and fake) achieved the maximum possible value of 1.00, reflecting flawless performance.


## ResNet 50 Setup

For the Resnet 50 model, we used the same dataset as ViT and also used Python 3 for model building and training.

Parameter:

-L2 Regularization: 1e-4

-Learning rate: 1e-5

-Early stopping patience: 3 epochs

-Fine-tuning epochs: Maximum of 7 epochs

Structure:

-Backbone: ResNet-50 with include_top=False

-GlobalAveragePooling2D layer

-Dropout layer with a rate of 0.6

-Dense layer (128 units, ReLU activation)

-Dense output layer (1 unit, sigmoid activation)

## ResNet 50 Results

#### Main Results

We conducted three stages of training using ResNet50, with performance improving at each step:

- **Initial Training (Frozen Base)**  
  - Training Accuracy: 63.9%
  - Validation Accuracy: 64.2%
  - The model began learning basic differences between real and fake faces but was limited by the frozen ResNet50 backbone.

- **Full Fine-Tuning (All Layers Unfrozen)**  
  - Training Accuracy: 99.6%
  - Validation Accuracy: 57.1%
  - The model overfit to the training data, showing poor generalization.

- **Final Fine-Tuning (Top 30 Layers Unfrozen + Regularization)**  
  - Training Accuracy: 95.8%  
  - Validation Accuracy: 97.5%
  - Validation Loss: 0.1086
  - Regularized fine-tuning successfully improved generalization and addressed the overfitting problem.

<br>

#### Supplementary Results

**Model Architecture**  
- Base model: `ResNet50(weights='imagenet', include_top=False)`  
- Classification head: `GlobalAveragePooling → Dropout → Dense(128 ReLU) → Sigmoid`

**Training Strategy**  
- **Step 1**: Frozen base, trained classification head only  
  - Optimizer: `Adam`, Learning rate: `1e-4`, Epochs: 7  
- **Step 2**: Unfroze all layers → severe overfitting  
  - Optimizer: `Adam`, Learning rate: `1e-5`, Epochs: 7  
- **Step 3**: Unfroze top 30 layers  
  - Applied `Dropout(0.6)` and `L2 regularization (1e-4)`  
  - Optimizer: `Adam`, Learning rate: `1e-5`  
  - Early stopping with `patience=3`

<br>

#### Summarized Findings and Parameter Choice
In our experiments using ResNet50 to classify real versus AI-generated face images, we tested three training strategies. The most important result came from the final model, where we partially unfroze the top 30 layers of ResNet50, added dropout and L2 regularization, and applied early stopping. This model achieved a training accuracy of 95.8% and a validation accuracy of 97.5%, along with a low validation loss of 0.1086. These outcomes indicate strong generalization and confirm that this balanced approach was highly effective. In contrast, training the model with a frozen base yielded only around 64% accuracy, and full fine-tuning without regularization caused severe overfitting, with training accuracy reaching 99.6% but validation accuracy dropping to 57.1%.

To support these results, we made careful parameter choices at each stage. We started with a frozen ResNet50 base to avoid overfitting early on and used Adam optimizer with a learning rate of 1e-4. During full fine-tuning, we lowered the learning rate to 1e-5 but observed overfitting due to the absence of regularization. In the final model, we corrected this by unfreezing only the top 30 layers and introducing Dropout (0.6) and L2 regularization (1e-4). Early stopping with patience=3 helped prevent further overfitting. All input images were resized, rescaled, and augmented with flips and small rotations, and zooms to increase robustness on our modest dataset (exactly the same data processing methodology as Visual Transformers to faciliate comparison). These parameters enabled our model to perform effectively despite the dataset's limited size.

## Discussion

Our final ResNet50 model achieved a validation accuracy of 97.5%, which is highly competitive given the dataset size and the only 3 stage of preprocessing. This result aligns well with existing literature, where ResNet-based models trained on large datasets such as CelebA or DeepFakeDetection typically report accuracy in the 90–98% range for binary face classification tasks. Despite using a much smaller dataset (~1,800 images) and simpler augmentation techniques, our model achieved performance at the high end of this benchmark, suggesting that our fine-tuning strategy was both effective and efficient.

## Conclusion

We were deeply surprised by the strong performance achieved by both our ResNet and Vision Transformer (ViT) models. Given that our dataset is relatively small, we did not expect such robust results. To investigate further, we also trained a simple convolutional neural network from scratch; to our amazement, it too reached a high degree of accuracy. This consistency across architectures led us to suspect that the underlying classification task may be inherently straightforward or that our cleaned dataset contains very clear, easily distinguishable features. In other words, the simplicity of the data itself may have driven much of the model performance.

In conclusion, we began by identifying and curating a large collection of images, meticulously cleaning and preprocessing them to ensure consistency. We then dove into foundational deep-learning research to understand the inner workings of ResNet and ViT, tuning their hyperparameters to align with our dataset’s characteristics. In conclusion, both models outperformed our initial expectations, and we believe it shows the importance of the modern architectures and the dataset complexity in model success.

## References
1. https://arxiv.org/abs/2503.18812
2. https://medium.com/%40hridaykeswani/detection-of-ai-generated-images-using-rich-and-poor-
texture-contrast-fc2024e3e716
3. https://www.kaggle.com/datasets/shreyanshpatel1/130k-real-vs-fake-face
4. Li, Y., Yang, X., Sun, P., Qi, H., & Lyu, S. (2020). Celeb-DF: a Large-Scale challenging dataset for DeepFake
forensics. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
https://doi.org/10.1109/cvpr42600.2020.00327
5. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Niessner, M. (2019). FaceForensics++:
Learning to Detect Manipulated Facial Images. 2021 IEEE/CVF International Conference on Computer
Vision (ICCV), 1–11. https://doi.org/10.1109/iccv.2019.00009