Skip to content

MalayAgr/MesoNet-DeepFakeDetection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MesoNet - A Deepfake Detector Built Using Python and Deep Learning

The problem of misinformation has concerned me for a long time. Having witnessed the drastic effects of it in both my country and elsewhere, I think my concerns are rightly placed.

Here, I make my small attempt in doing something about it.

Table of Contents

1. Introduction

This project is part of the requirements to finish my Bachelor's degree in Computer Science (2017-2021).

It aims to demonstrate a solution to a small part of the misinformation problem. In particular, I detail here my approach in implementing a CNN-based DeepFake detector, first detailed in a paper published by Darius Afchar (Github) et al. in 2018 [1], called MesoNet. The official implementation (without any training code) is available here.

The overall project consists of three parts:

You're currently reading about Part 1.

2. Approach

2.1. The Code

The main focus in constructing and training the model was to make it modular and portable. A secondary focus was also to make it easier you to use MesoNet without tinkering with the code. With these objectives in mind, the code has been broken up into two packages:

  • mesonet - This is the main package containing modules which construct and build MesoNet variants. While not currently set up as a PyPI package, you can copy the directory to your project and obtain the necessary functionality to build, train and obtain predictions from MesoNet.
  • cli - This package provides a command line interface (CLI) that can be used to both train and obtain predictions from MesoNet. This allows you to use MesoNet without tinkering with the code. Currently, it only supports training the architecture as detailed in the paper. Additionally, the mesonet.py file provides an entrypoint to the CLI.

2.2. The Model

The model, as mentioned above, is based on a paper published by Darius Afchar et al. in 2018 [1]. It is a binary classifier built as a relatively shallow Convolutional Neural Network (CNN), trained to classify images into one of two classes. One class refers to "real" images (images of real people) and the other refers to "fake" images (images generated by DeepFake AI).

Note: The actual names of the classes is arbitrary and can be set according to the your wishes.

By default, the CLI works with the architecture detailed in the paper, which is as follows:

  • 3 X 256 X 256 input layer, with the input being scaled by 255 and augmentations applied on it.
  • Convolutional layer with 8 filters, 3 x 3 in size and stride of 1, followed by a max pooling layer of size 2 x 2.
  • Convolutional layer with 8 filters, 5 x 5 in size and stride of 1, followed by a max pooling layer of size 2 x 2.
  • Two convolutional layers with 16 filters, 5 x 5 in size and stride of 1, followed by max pooling layers with pooling window of 2 x 2.
  • Fully-connected layer with 16 units.
  • Fully-connected output layer with 1 unit and sigmoid activation.
Model
Source: [1]

This leads to a modest 27,977 trainable parameters for the model.

While this architecture is closely followed, experiments with various activation functions have been carried out and the code is designed such that it is extremely convenient to switch the activation function for the entire model. Specifically, in addition to using the standard ReLU activation, experiments with ELU [2] and LeakyReLU [3] have also been carried out.

ReLU is the activation function of choice since there is no apparent risk of dead neurons. Additionally, there exists a LeakyReLU activation after the fully-connected 16-unit layer. There is no apparent reason behind this other than this is what the paper uses.

Additionally, some modern-day conventional practices have been added to the model. Specifically, the following two practices have been adopted:

  • Bath Normalization - Batch Normalization is added after each convolutional layer to improve convergence speed and to combat overfitting.
  • Dropout [4] - Dropout is added after the fully-connected 16-unit layer to combat overfitting.

2.3. The Data

A dataset collected by the authors of the paper is used to train the models provided in this repo, called the DeepFake dataset. You can download it here.

Note: For some reason, the downloaded dataset's training samples are in a folder called train:test. You might face issues when unzipping this. Rename the folder to train.

It contains a training set and a test set. The overall directory structure is as follows:

└── data/
    ├── train/
    │   ├── real/
    │   │   ├── img1.png
    │   │   └── img2.png
    │   └── df/
    │       ├── img1.png
    │       └── img2.png
    └── validation/
        ├── real/
        │   ├── img1.png
        │   └── img2.png
        └── df/
            ├── img1.png
            └── img2.png

Note: df is short for deepfake.

Note: The paper refers to the test set as the validation set due to the jargon used in 2018.

Note: The name of the classes can be changed by renaming the real and forged directories.

The images of faces have been extracted from publicly-available videos on the Internet. According to the paper, for the fake images, 175 videos have been downloaded from different online platforms. Their duration is between 2 seconds and 3 minutes, with a minimum resolution of 854 x 450 px. They have been compressed using the H.264 codec but using different compression levels. More details on dataset collection are available in the paper.

The distribution of images is as follows (source):

Set Size of the forged image class Size of real image class Total
Training 5111 7250 12361
Test 2889 4259 7148
8000 11509 19509

Note: It may be that TensorFlow doesn't detect all 19,509 images. In my case, it detected 19,457 images. The below numbers are with respect to that.

In reality, though, you may NOT achieve the claimed accuracy in the paper. This could be because of the size of the dataset, which has potentially too many images in the test set.

To combat this, the size can be increased using this script. It takes all the images and creates a new dataset by holding back only 10% of the data (randomly) for the test set (instead of the ~36.7% in the original dataset). This will arbitrarily change the distribution. You can keep running the script until you obtain a satisfactory split. In my run, the distribution used was:

Set Size of the forged image class Size of real image class Total
Training 7175 10337 17512
Test 773 1172 1945
7948 11509 19457

Note: It may be the case that the test set is slightly "easier" than the training set. That is, you may notice that your model performs better on the test set by a few points.

Note: The provided script works only on the original dataset, with the structure shown above.

Alternatively, you can use any dataset of your choice as long as the directory structure matches the one above.

Sample images are shown below:

Forged Real
Train
Test

2.4. Requirements

The project has been developed on Python 3.8.8. It is recommended that you use this version to ensure that things do not break.

Other requirements are as follows:

Package Version
TensorFlow 3.4.1
Matplotlib 2.4.1
Scikit-Learn 0.24.1

Note: Worried about Numpy and the other stuff? Don't be. These will be installed automatically by pip if you run the standard command to install packages using a requirements.txt file.

3. Results

This section summarizes results from the two pre-trained models provided in trained_models. Here, "best" is in terms of accuracy.

The dataset used to train these models is available here. In both the cases, the default augmentations used in the paper have been applied on the dataset. These are listed here.

Moreover, 20% of the training data was reserved for the validation set. This led to the following distribution of training data:

Set Size of the forged image class Size of real image class Total
Training 5740 8270 14020
Validation 1435 2067 3502
7175 10337 17512

3.1. Best Model

Training was meant to be carried out for 30 epochs with a batch size of 32. In fact, the model was trained for only 18 epochs since the results were already satisfactory.

For this particular model, when using a learning rate schedule, the number of steps after which one step of decay should be applied is calculated dynamically based on a decay limit (the lowest learning rate), decay steps and the number of epochs. The reason behind this is that using a fixed number made the decay either too slow or too fast. This makes it more gradual. This feature wasn't implemented during training of the second model.

Thus, a learning rate schedule with an initial learning rate of 0.001, decay rate of 0.10 and a maximum decay limit of 0.000001 was also used.

Final metrics after 18 epochs are as follows:

Loss Accuracy
Train 0.1583 93.53%
Validation 0.2027 92.52%

The loss curve is shown below (blue - validation; orange - train):

On the test set, the model reported an accuracy of 96.25%.

The ROC report (generated using sklearn) is as follows:

Precision Recall F1-Score Support
Forged Class 0.96 0.94 0.95 773
Real Class 0.96 0.97 0.97 1172
Accuracy 0.96 1945
Macro Average 0.96 0.96 0.96 1945
Weighted Average 0.96 0.96 0.96 1945

3.2. Next Best Model

Training was meant to be carried out for 18 epochs with a batch size of 64. A learning rate schedule with an initial learning rate of 0.001, decay rate of 0.10, decayed every 5 epochs. The model was trained for 17 epochs.

While training loss and accuracy were not recorded for later reference (noob, I know), the validation accuracy was ~89%.

The loss curve is shown below:

On the test set, the model reported an accuracy of 90.79%.

The ROC report:

Precision Recall F1-Score Support
Forged Class 0.90 0.87 0.88 773
Real Class 0.91 0.93 0.92 1172
Accuracy 0.91 1945
Macro Average 0.91 0.90 0.90 1945
Weighted Average 0.91 0.91 0.91 1945

3.3. Notes on "best"

While by looking at the numbers, it does make sense to call the first model the "best" model, I personally prefer the second model due to its more modest numbers and the test set showing similar performance to the validation set. Your conclusions are on you. 😃

4. Documentation

The documentation for the mesonet module and details on using the CLI are available in the docs folder and here.

5. References