# Identification of Deepfaked Images (and Videos?)
## By Li Run & Rongyi

## Problem Statement

Within the past year, deepfaked media has risen to prominence all over the world. With it being near-impossible to differentiate between real and fake online nowadays, how can we help the average person tell what is real?

Thus the question arises: **Given an image, is it possible to tell if it is deepfaked or not?**

In this project, our aim is to develop an AI model that is capable of **identifying deepfaked images with ≥70% accuracy.**


## Data Collection

We utilized two datasets of images for training our model:
1. https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images
2. https://www.kaggle.com/datasets/dagnelies/deepfake-faces

The first dataset contains approximately 70,000 training images, 5400 test images and 20,000 validation images of faces for both Real and Fake images each.

The second dataset contains approximately 95,600 images of faces. Labelling of the images as real or fake can be found under `metadata.csv`.

## Data Preprocessing

Let us first inspect the contents of `deepfake_faces`.

In [None]:
# Uncomment line below to install tensorflow with cuda 
# !pip install tensorflow[and-cuda] --target=/kaggle/working/cuda-files

In [None]:
# Imports
import pandas as pd
import numpy as np
import tensorflow as tf
from PIL import Image
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import random
from sklearn.utils.class_weight import compute_class_weight
import os, os.path, shutil
from tqdm import tqdm
import cv2
import keras

In [None]:
df = pd.read_csv('/kaggle/input/deepfake-faces/metadata.csv')
df.head()

In [None]:
df[df.videoname == 'aaagqkcdis.mp4']

We took the name of the first image `aaaqgkcdis.jpg` and looked it up in`metadata.csv`, confirming that the image names corresponded to entries within the csv file allowing us to label the images ourselves.

In [None]:
df['label'].value_counts()

From here we can see that the Fake:Real ratio in `deepfake_faces` is about 5:1. We need to handle this class imbalance in our data, which we will do by just taking a sample of 16,000 images from each Fake and Real instead. We also needs to categorise the images since they have not been labelled in the same format as in `deepfake-and-real-images`.


In [None]:
# NOTE: Run if output working folder is still empty

FOLDER_PATH = '/kaggle/input/deepfake-faces/faces_224/'
FAKE_PATH = '/kaggle/working/deepfake-faces/Fake'
REAL_PATH = '/kaggle/working/deepfake-faces/Real'

os.makedirs(FAKE_PATH, exist_ok=True)
os.makedirs(REAL_PATH, exist_ok=True)   

realcount = 0
fakecount = 0
for index, row in tqdm(df.iterrows()):
    img_name = row['videoname'].split('.')[0] + '.jpg'
    old_path = FOLDER_PATH + img_name
    if row['label'] == 'REAL':
        if realcount < 16000:
            new_path = os.path.join(REAL_PATH', img_name)
            realcount += 1
        else:
            continue

    else:
        if fakecount < 16000:
            new_path = os.path.join(FAKE_PATH, img_name)
            fakecount += 1
        else:
            continue
    
    shutil.copy(old_path, new_path)

    
print("Categorisation complete")
        

Let us test if our categorisation worked.

In [None]:
plt.figure(figsize=(15,15))
file_names = os.listdir(FAKE_PATH + '/')
for i in range(25):
    idx = random.randint(0, len(file_names)) #take a random batch out of all the fake images
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    
    
    video_name = file_names[idx][:-4] + '.mp4'
    plt.imshow(cv2.imread(os.path.join(FAKE_PATH, file_names[idx]))
    # Redundant but i want to test if any real images made it in by some miracle
    if(df[df.videoname == video_name].iloc[0]['label']=='FAKE'):
        plt.xlabel('FAKE Image')
    else:
        plt.xlabel('REAL Image')
        
plt.show()

In [None]:
plt.figure(figsize=(15,15))
file_names = os.listdir(REAL_PATH + '/')
for i in range(25):
    idx = random.randint(0, len(file_names)) #take a random batch out of all the fake images
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    
    
    video_name = file_names[idx][:-4] + '.mp4'
    plt.imshow(cv2.imread(os.path.join(REAL_PATH, file_names[idx])))
    # Redundant but i want to test if any fake images made it in by some miracle
    if(df[df.videoname == video_name].iloc[0]['label']=='FAKE'):
        plt.xlabel('FAKE Image')
    else:
        plt.xlabel('REAL Image')
        
plt.show()

As we can see from above, we have successfully separated the image files into 2 different subdirectories, `deepfake-faces/Real` and `deepfake-faces/Fake`.

Next, we will organize the data into their appropriate categories before splitting them into training and test/validation data.

This is achieved by splitting the images from `deepfake_and_real_images` into their training/validation/test sets first since those have already been organised for us, then adding on the images from `deepfake_faces`.

In [None]:
TRAIN_PATH = '/kaggle/input/deepfake-and-real-images/Dataset/Train'
VALIDATION_PATH = '/kaggle/input/deepfake-and-real-images/Dataset/Validation'
TEST_PATH = '/kaggle/input/deepfake-and-real-images/Dataset/Test'

train = tf.keras.utils.image_dataset_from_directory(TRAIN_PATH, labels = 'inferred', image_size=(224,224),)
val = tf.keras.utils.image_dataset_from_directory(VALIDATION_PATH, labels = 'inferred', image_size=(224,224),)
test =  tf.keras.utils.image_dataset_from_directory(TEST_PATH, labels = 'inferred', image_size=(224,224),)

print(train.class_names)

In [None]:
deepfake_faces = tf.keras.utils.image_dataset_from_directory('/kaggle/working/deepfake-faces', labels='inferred', image_size=(224,224),)

print(deepfake_faces.class_names)



In [None]:
deepfake_faces = deepfake_faces.shuffle(10, reshuffle_each_iteration=True)

train_size = int(0.7 * len(deepfake_faces))
test_size = int(0.15 * len(deepfake_faces))
val_size = int(0.15 * len(deepfake_faces))

train2 = deepfake_faces.take(train_size)
test2 = deepfake_faces.skip(train_size)
val2 = test2.skip(val_size)
test2 = test2.take(test_size)

train_merged = train.concatenate(train2)
val_merged = val.concatenate(val2)
test_merged = test.concatenate(test2)

# print(train_merged.class_names)
print(len(train_merged), len(train), len(train2))
print(len(train_merged), len(val_merged), len(test_merged))

We are now done with merging our datasets, and can move on to training our model.

## Training of Model

### Implementation of Rescaling, Data Augmentation & Callbacks

In [None]:
#Rescaling and Resizing
rescale_and_resize = tf.keras.models.Sequential([
    tf.keras.layers.Resizing(224,224),
    tf.keras.layers.Rescaling(1./255)
])

In [None]:
data_augmentation = tf.keras.models.Sequential([
    tf.keras.layers.RandomRotation(0.2),
])

In [None]:
# class CustomCallback(tf.keras.callbacks.Callback):
#   def on_epoch_end(self, epoch, logs={}):
#     if(logs.get('val_accuracy') >= 0.75):
#       print("Accuracy>=75%. Cancelling training.")
# acc_limit_callback = CustomCallback()


CHECKPOINT_PATH = '/kaggle/working/output-models/best_model.keras'

checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=CHECKPOINT_PATH,
    monitor='sparse_categorical_accuracy',
    mode='max',
    save_best_only=True)


callback_list = [checkpoint_callback]

### Custom functions for testing model

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

def produce_cm(model, test_dataset):
    
    true_labels = []
    for images, labels in test_dataset.unbatch().batch(1):
        true_labels.append(labels.numpy())

    test_labels = np.array(true_labels).flatten()

    predictions = model.predict(test_dataset)
    predicted_classes = predictions.argmax(axis=1)
    true_classes = test_labels  # Assuming you have these


    print(classification_report(true_classes, predicted_classes))
    
    cm = confusion_matrix(true_classes, predicted_classes)
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Class 1', 'Class 2'], yticklabels=['Class 1', 'Class 2'])
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title('Confusion Matrix')
    plt.show()


In [None]:
def test_accuracy(model, test_dataset):
    loss, accuracy = model.evaluate(test_dataset)
    print("Test accuracy:", accuracy)
    

In [None]:
from PIL import Image
import numpy as np
from skimage import transform

def load(filename):
    img = keras.preprocessing.image.load_img(filename, target_size = (224, 224))
    img = keras.preprocessing.image.img_to_array(img)
    img = np.expand_dims(img, axis = 0)
    return img

def test_image(model, filename):
    image = load(filename)
    logits = model.predict(image)
    probabilities = np.exp(logits) / np.sum(np.exp(logits))
    print(f'Logits: {logits}')
    print(f'Probabilities: {probabilities}')
    

### Model implementation

### Trying to implement our own CNN

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.Input(shape=(224,224,3)),
    
    rescale_and_resize,
    data_augmentation,

    
    tf.keras.layers.Conv2D(32, kernel_size=5, activation='relu'),
    tf.keras.layers.Conv2D(32, kernel_size=5,activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=[2,2], strides=(2,2)),
    
    tf.keras.layers.Conv2D(64, kernel_size=3, activation='relu'),
    tf.keras.layers.Conv2D(64, kernel_size=3,activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=[2,2], strides=(2,2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(2), #there are 2 different classes 
])


model.compile(optimizer=tf._optimizers.Adam(learning_rate=0.001),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
             )

In [None]:
# history = model.fit(
#     train_merged,
#     validation_data=val_merged,
#     epochs=10,
#     callbacks=callback_list
# )

For our testing, we handpick a few images from either the test dataset or images that have never been introduced before (sourced online).

In [None]:
#idk why this doesnt work lol might be because loading stuff 

# def test_model(model_input):
#     #Screenshot (12) and (13) are both deepfaked images.
#     files = [
#         '/kaggle/input/testimage/Screenshot (12).png', 
#         '/kaggle/input/testimage2/Screenshot (13).png', 
#     ]
    
#     for i in range(10):
#         num = str(random.randint(1,4000))
#         if i%2:
#             file_name = '/kaggle/input/deepfake-and-real-images/Dataset/Test/Real/real_' + num + '.jpg'
#         else: 
#             file_name = '/kaggle/input/deepfake-and-real-images/Dataset/Test/Fake/fake_' + num + '.jpg'
#         files.append(file_name)
    
    
#     for path in files:
#         image = load('/kaggle/input/testimage/Screenshot (12).png')
#         logits = model_input.predict(image)
#         probabilities = np.exp(logits) / np.sum(np.exp(logits))
#         print(f'{path}: Logits = {logits}, Probabilities = {probabilities}')
        
        

In [None]:
# test_image('/kaggle/input/testimage/Screenshot (12).png')

In [None]:
# test_image('/kaggle/input/testimage2/Screenshot (13).png')

In [None]:
# test_image('/kaggle/input/deepfake-and-real-images/Dataset/Test/Real/real_412.jpg')

In [None]:
# test_image('/kaggle/input/deepfake-and-real-images/Dataset/Test/Fake/fake_1004.jpg')


### Trying InceptionV3 with `deepfake-and-real-images` dataset only

In [None]:
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.inception_v3 import preprocess_input

def preprocess(image, label):
    image = preprocess_input(image)
    return image, label

train_preprocessed = train.map(preprocess)
val_preprocessed = val.map(preprocess)
test_preprocessed = test.map(preprocess)




In [None]:
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(224,224,3))
base_model.trainable = False

model_inception = tf.keras.models.Sequential([
    tf.keras.Input(shape=(224,224,3)),
    
    rescale_and_resize,
    data_augmentation,

    base_model,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax'), #there are 2 different classes 
])


In [None]:
model_inception.compile(optimizer=tf._optimizers.Adam(learning_rate=0.001),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=['sparse_categorical_accuracy']
             )

In [None]:
history_real = model_inception.fit(
    train_preprocessed,
    validation_data=val_preprocessed,
    epochs=10
)

In [None]:
test_accuracy(model_inception, test_preprocessed)

In [None]:
produce_cm(model_inception, test_processed)

We observe that the current model is overfitting. Hence we perform hyperparameter tuning.

First we decrease the learning rate of our model.


In [None]:
#HYPERPARAMETER TUNING CODE HERE

### Trying InceptionV3 with merged dataset

In [None]:
train_merged_preprocessed = train_merged.map(preprocess)
val_merged_prerocessed = val_merged.map(preprocess)
test_merged_preprocessed = test_merged.map(preprocess)

In [None]:
base_model2 = InceptionV3(weights='imagenet', include_top=False, input_shape=(224,224,3))
base_model2.trainable = False

model_inception2 = tf.keras.models.Sequential([
    tf.keras.Input(shape=(224,224,3)),
    
    rescale_and_resize,
    data_augmentation,

    base_model2,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax'), #there are 2 different classes 
])


In [None]:
history_inception2 = model_inception2.fit(
    train_merged_preprocessed,
    validation_data=val_merged_preprocessed,
    epochs=10,
)

In [None]:
test_accuracy(model_inception2, test_merged_preprocessed)

### Trying ResNet model

In [None]:
from tensorflow.keras.applications import ResNet50

base_model_resnet = ResNet50(weights='imagenet', include_top=False, input_shape=(224,224,3))

In [None]:
base_model_resnet.trainable = False

model_resnet = tf.keras.models.Sequential([
    tf.keras.Input(shape=(224,224,3)),
    
    rescale_and_resize,
    data_augmentation,

    base_model_resnet,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])


model_resnet.compile(optimizer=tf._optimizers.Adam(learning_rate=0.001),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy']
             )

In [None]:
history_resnet = model_resnet.fit(
    train_merged,
    validation_data=val_merged,
    epochs=10,
)

In [None]:
loss_resnet, accuracy_resnet = model_resnet.evaluate(test_merged)
print("Test accuracy:", accuracy_resnet)

From the accuracy of our model, we can conclude that ResNet is significantly less accurate than InceptionV3 (pre-hyperparameter tuning), hence we will continue using InceptionV3 instead.