# Identification of Deepfaked Images (and Videos?)
## By Li Run & Rongyi

This project aims to train an AI model to be able to identify deepfaked images from real ones with an accuracy of >=XX%.

## Data Collection

We utilized two datasets of images for training our model:
1. https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images
2. https://www.kaggle.com/datasets/dagnelies/deepfake-faces

The first dataset contains approximately 70,000 training images, 5400 test images and 20,000 validation images of faces for both Real and Fake images each.

The second dataset contains approximately 95,600 images of faces. Labelling of the images as real or fake can be found under `metadata.csv`.

## Data Preprocessing

Let us first inspect the contents of `deepfake_faces`.

In [8]:
# Imports
import pandas as pd
import numpy as np
import tensorflow as tf
from PIL import Image
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import random
from sklearn.utils.class_weight import compute_class_weight
import os, os.path, shutil


In [9]:
df = pd.read_csv('/kaggle/input/deepfake-faces/metadata.csv')
df.head()

Unnamed: 0,videoname,original_width,original_height,label,original
0,aznyksihgl.mp4,129,129,FAKE,xnojggkrxt.mp4
1,gkwmalrvcj.mp4,129,129,FAKE,hqqmtxvbjj.mp4
2,lxnqzocgaq.mp4,223,217,FAKE,xjzkfqddyk.mp4
3,itsbtrrelv.mp4,186,186,FAKE,kqvepwqxfe.mp4
4,ddvgrczjno.mp4,155,155,FAKE,pluadmqqta.mp4


In [10]:
df[df.videoname == 'aaagqkcdis.mp4']

Unnamed: 0,videoname,original_width,original_height,label,original
18722,aaagqkcdis.mp4,90,89,FAKE,eklsrnkwog.mp4


We took the name of the first image `aaaqgkcdis.jpg` and looked it up in`metadata.csv`, confirming that the image names corresponded to entries within the csv file allowing us to label the images ourselves.

In [11]:
df['label'].value_counts()

label
FAKE    79341
REAL    16293
Name: count, dtype: int64

From here we can see that the Fake:Real ratio in `deepfake_faces` is about 5:1. We need to handle this class imbalance in our data, which we will do by just taking a sample of 16,000 images from each Fake and Real instead.


In [12]:
folder_path = '/kaggle/input/deepfake-faces/faces_224'
images = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]


print(images[0])

if not os.path.exists('/kaggle/input/deepfake-faces/faces_224/Fake'):
    os.makedirs('/kaggle/input/deepfake-faces/faces_224/Fake')
if not os.path.exists('/kaggle/input/deepfake-faces/faces_224/Real'):
    os.makedirs('/kaggle/input/deepfake-faces/faces_224/Real')   
    
for image in images:
    img_name = image.split('.')[0]
    if df[df.videoname == (img_name + '.mp4')]['label'] == FAKE:
        new_path = os.path.join(os.path.join(folder_path, 'Fake'), img_name)
    else:
        new_path = os.path.join(os.path.join(folder_path, 'Real'), img_name)
        
    old_path = os.path.join(folder_path, img_name)
    
    shutil.move(old_path, new_path)
        
        

zrivvmjwai.jpg


Next, we will organize the data into their appropriate categories before splitting them into training and test/validation data.

This is achieved by splitting the images from `deepfake_and_real_images` into their training/validation/test sets first since those have already been organised for us, then adding on the images from `deepfake_faces`.

In [13]:
train = tf.keras.utils.image_dataset_from_directory('/kaggle/input/deepfake-and-real-images/Dataset/Train', labels = 'inferred', image_size=(224,224),)
val = tf.keras.utils.image_dataset_from_directory('/kaggle/input/deepfake-and-real-images/Dataset/Validation', labels = 'inferred', image_size=(224,224),)
test =  tf.keras.utils.image_dataset_from_directory('/kaggle/input/deepfake-and-real-images/Dataset/Test', labels = 'inferred', image_size=(224,224),)

print(train.class_names)

Found 140002 files belonging to 2 classes.
Found 39428 files belonging to 2 classes.
Found 10905 files belonging to 2 classes.
['Fake', 'Real']


In [14]:
#TODO:
#somehow separate contents of deepfake-faces/faces_224 into 2 different subfolders with 'Fake' and 'Real' as names (how the fuck do i move them all into a new folder)
#take a sample of 16,000 of each
#convert to image dataset using the same command
#merge them together using - e.g. combined_dataset = dataset_1.concatenate(dataset_2) --> kinda done alr? cant test yet but eh
#

In [15]:
deepfake_faces = tf.keras.utils.image_dataset_from_directory('/kaggle/input/deepfake-faces/faces_224', image_size=(224,224),)

print(deepfake_faces.class_names)

Found 0 files belonging to 0 classes.


ValueError: No images found in directory /kaggle/input/deepfake-faces/faces_224. Allowed formats: ('.bmp', '.gif', '.jpeg', '.jpg', '.png')

In [None]:
# deepfake_faces = deepfake_faces.shuffle()

train_and_test2, val2, = tf.keras.utils.split_dataset(deepfake_faces, left_size=0.8, shuffle=True) #80%, 20% -> train+test, val
train2, test2 = tf.keras.utils.split_dataset(train_and_test2, left_size=0.875, shuffle=True) #70%, 10% -> train, test (87.5%, 12.5%)

train_merged = train.concatenate(train2)
val_merged = val.concatenate(val2)
test_merged = test.concatenate(test2)


We are now done with processing our datasets.

## Training of Model

Do we want to do transfer learning or try our own architecture first lol