# Fake pictures detector 

## Topic

In this project, I will be going through many real people pictures and fake photoshop made pictures, using dimensionality reduction technique, more specifically PCA, reduce the dimendion of the pictures and then apply classification algorithms on them. The row pictures are downloaded and transformed into arrays, only one color channel is kept with the values at each pixel stored into an array, PCA is then apply twice, first time to reduce the photo matrices into their 3 principal components and second to reduce those 3 components in to one principal components which contains 50% of the information in the picture.
Different classification algorithms are then applied on the decomposed matrices to try and predict whether a certain picture is real or fake.

## Objectives

- Reduce the dimension of the photos
- Apply classification algorithms to determine whether a picture is real or fake

## Summary

- Importing Libraries
- Quick look at the dataset
- Data pre-processing
- Logistic Regression
- Naive Bayes Classifier
- Support Vector Machines
- XGBoost Classifier
- K Nearest Neighbours
- Decision Tree Classifier

## Importing libraries

In [None]:
import numpy as np
from matplotlib.pyplot import imread
import matplotlib.pyplot as plt
import os
import matplotlib.image as mplib 
import pandas as pd
from sklearn.decomposition import PCA
from PIL import Image
import cv2
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import  classification_report, confusion_matrix,f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost 
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

## The Dataset
The dataset is two folders containing real images of different people and photoshop made images.  

In [None]:
# get two lists containing the names of the files
list_real = os.listdir(r"C:\Users\imane\OneDrive\Desktop\eigenfaces\training_real")
list_fake = os.listdir(r"C:\Users\imane\OneDrive\Desktop\eigenfaces\training_fake")

In [None]:
print(len(list_real))

In [None]:
print(len(list_fake))

I have 1081 images of real people and 960 images of photoshoped fake images.

In [None]:
# get paths to the image folders
path_real = r"C:\Users\imane\OneDrive\Desktop\eigenfaces\training_real"
path_fake = r"C:\Users\imane\OneDrive\Desktop\eigenfaces\training_fake"

In [None]:
# get a list of the paths to real images
pics_real = [os.path.join(path_real, i) for i in list_real]

In [None]:
# get a list of the paths to fake images
pics_fake = [os.path.join(path_fake, i) for i in list_fake]

In [None]:
# get a list of the paths to the real photos
paths_real = [os.path.join(path_real,i) for i in list_real]

In [None]:
# get a list of the paths to the fake photos
paths_fake = [os.path.join(path_fake,i) for i in list_fake]

In [None]:
arrays_real = []
for p in paths_real:
    img = Image.open(p)
    array = np.asarray(img)
    arrays_real.append(array)

In this part, I opened every pictures in the real picture folder, transformed it into an array and stored all the arrays in a list.

In [None]:
arrays_real

In [None]:
arrays_fake = []
for p in paths_fake:
    img = Image.open(p)
    array = np.asarray(img)
    arrays_fake.append(array)

I did the same for fake images.

In [None]:
arrays_fake

In [None]:
img = arrays_real[1]

In [None]:
# get the shape of images
img.shape

The images are in shape 600*600*3 , meaning length and width of 600 times the three color channels.

In [None]:
# split the picture into its 3 channels
blue,green,red = cv2.split(img)

Here I plotted the picture in its 3 channels red blue and green.

## Data Pre-processing

In [None]:
b_real = []
g_real = []
r_real = []
for a in arrays_real:
    blue, green, red = cv2.split(a)
    b_real.append(blue)
    g_real.append(green)
    r_real.append(red)


In [None]:
b_fake = []
g_fake = []
r_fake = []
for a in arrays_fake:
    blue, green, red = cv2.split(a)
    b_fake.append(blue)
    g_fake.append(green)
    r_fake.append(red)

In the above, I iterated through every picture array, split it into the three channels and stored each channels information into a list.

In [None]:
b_real

In [None]:
g_fake

I then printed channel red for the real pictures and channel green for the fake pictures, the lists contains pixel data in the subsequant channel.

In [None]:
r_real[0] = r_real[0]/255
r_real[0]

While experimenting with one picture, I first regularised the pixel data by deviding the array by 255.

In [None]:
pca_3 = PCA(n_components=3)
pca_3.fit_transform(r_real[0])

I then applied PCA with 3 components on the regularised array.

In [None]:
#  ration of variance -  identify how significant is each principal component 
print(pca_3.explained_variance_ratio_)

Here I printed the eigenvalues of the first 3 components, we see that the first principal components only holds 0,24 of the information in the data.

In [None]:
print(pca_3.singular_values_)

In [None]:
# Get the principal components (eigenvectors)
pca_3.components_

In [None]:
pca_1 = PCA(n_components = 1)
pca_1.fit_transform(pca_3.components_)
pca_1.explained_variance_ratio_

Then I applied a second PCA on my 3 principal components to try and reduce the picture into only one array. We see that the only principal components holds 50% of the information in the data, which is not bad considering how much dementiality reduction has been done.

In [None]:
pca_3 = PCA(n_components = 3)
pca_1 = PCA(n_components = 1)

Here I created two PCAs one with 3 princiapl components and the second with 1.

In [None]:
features_real = []
for a in r_real:
    a = a/255
    pca_3.fit_transform(a)    
    pca_1.fit_transform(pca_3.components_)    
    features_real.append(pca_1.components_)

Choosing only the red channel, I regularised every picture array, transformed it with PCA and kept its three components, reapplied PCA and got only the one principal component.

In [None]:
features_real

In [None]:
# Transform the features list into an array
features_real = np.array(features_real)

In [None]:
# get the shape of the array
features_real.shape

In [None]:
# keep only two dimensions in the array
features_real = np.squeeze(features_real)

In [None]:
features_real

In [None]:
# Transform the array into a pandas dataframe
df_real = pd.DataFrame(features_real)

In [None]:
# Adding the target column, 1 for real
df_real["target"] = 1

In [None]:
df_real

In the above, I transformed every picture into one array line and stored it in a pandas dataframe, then added the target feature which is 1 for real pictures and 0 for fake pictures. Below I did the same for fake pictures.

In [None]:
features_fake = []
for a in r_fake:
    a = a/255
    pca_3.fit_transform(a)    
    pca_1.fit_transform(pca_3.components_)    
    features_fake.append(pca_1.components_)

In [None]:
features_fake = np.array(features_fake)
features_fake = np.squeeze(features_fake)
df_fake = pd.DataFrame(features_fake)
df_fake["target"] = 0
df_fake

In [None]:
df = df_real.append(df_fake)

I added the real pictures dataframe to the fake pictures dataframe.

In [None]:
df = df.sample(frac = 1)

Then shuffled the rows with target 1 with eows with target 0.

In [None]:
df

In [None]:
# Separate the target from the dataset
target = df['target'].copy()
target

In [None]:
df = df.drop(columns = ["target"])

In [None]:
# split the dataset for training and testing samples
X_train, X_test, y_train, y_test = train_test_split( df, target, test_size=0.2, random_state=42)

In [None]:
cat_train = pd.DataFrame(y_train)
cat_train["count"] = 1
cat_train = cat_train.groupby("target").sum().reset_index()
x = cat_train["target"]
y = cat_train["count"]
plt.bar(x, y)

Here I just wanted to make sure that the amount of real photos in the training set is not too far from the number of fake photos.

## Logistic Regression

In [None]:
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train,y_train)

running the grid search to find best parameters for the base line model.

In [None]:
# Get best parameters
logreg_cv.best_params_

In [None]:
# Fit the data
lr = LogisticRegression(C= 0.001, penalty='l2')
lr.fit(X_train, y_train)

In [None]:
# accuracy on training
lr.score(X_train, y_train)

In [None]:
lr_pred = lr.predict(X_test)

In [None]:
# Accuracy on test set
acc_lr = lr.score(X_test, y_test)
acc_lr

In [None]:
f1_lr = f1_score(y_test, lr_pred)
f1_lr

In [None]:
cr_lr = classification_report(y_test,lr_pred )
print(cr_lr)

The classification report shows a poor job done by my base line model, expecially when it comes to fake photos detection. The accuracy of the model didn't change much between training and testing.

In [None]:
cm_lr = confusion_matrix(y_test,lr_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm_lr)
ax.grid(False)# defining parameter range
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm_lr[i, j], ha='center', va='center', color='red')

Here the model basically predicted everything as real

## Naive Bayes Classifier

In [None]:
nb_classifier = GaussianNB()

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
gs_NB = GridSearchCV(estimator=nb_classifier, 
                 param_grid=params_NB, 
                 verbose=1, 
                 scoring='accuracy') 
gs_NB.fit(X_train, y_train)

gs_NB.best_params_

Performing a quick grid search to determine best params.

In [None]:
nb = GaussianNB(var_smoothing= 1.0)
nb.fit(X_train, y_train)
nb.score(X_train, y_train)

In [None]:
nb_pred = nb.predict(X_test)

In [None]:
acc_nb = nb.score(X_test, y_test)
acc_nb

In [None]:
f1_nb = f1_score(y_test, nb_pred)
f1_nb

In [None]:
cr_nb = classification_report(y_test,nb_pred )
print(cr_nb)

The results of Naive Bayes are more balances then those of Logistic regression, although the accuracy droped a bit with NB, its F1 score, ie average of precision and recall, have balances out.

In [None]:
cm_nb = confusion_matrix(y_test,nb_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm_nb)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm_nb[i, j], ha='center', va='center', color='red')

The confusion matrix shows better performance on detecting real photos then fake one.

## Support Vector Machines

In [None]:
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}
 
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
 
# fitting the model for grid search
grid.fit(X_train, y_train)

Here I ran a grid serch for the parameters

In [None]:
grid.best_params_

In [None]:
svc = SVC(C= 0.1, gamma = 1, kernel= 'rbf')

In [None]:
svc.fit(X_train, y_train)

In [None]:
svc.score(X_train, y_train)

In [None]:
svc_pred = svc.predict(X_test)

In [None]:
acc_svc = svc.score(X_test, y_test)
acc_svc

In [None]:
f1_svc = f1_score(y_test, svc_pred)
f1_svc

In [None]:
cr_svc = classification_report(y_test,svc_pred )
print(cr_svc)

The performance of SVC is poor and similar to that of logistic regression, it's normal because I have 600 features which can though for an SVC

In [None]:
cm_svc = confusion_matrix(y_test,svc_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm_svc)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm_svc[i, j], ha='center', va='center', color='red')

## XGBoost

In [None]:
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)

In [None]:
xgb.fit(X_train, y_train)

In [None]:
xgb.score(X_train, y_train)

In [None]:
xgb_pred = xgb.predict(X_test)

In [None]:
acc_xgb = xgb.score(X_test, y_test)

In [None]:
f1_xgb = f1_score(y_test, xgb_pred)
f1_xgb

In [None]:
cr_xgb = classification_report(y_test,xgb_pred )
print(cr_xgb)

The accuracy of XGB is 1 on the training data, which means that the algo has most probably overfit the training data, its accuracy on the test set is much more reasonable. In terms of F1 score, it's average and slightly better for detecting real pictures.

In [None]:
cm_xgb = confusion_matrix(y_test,xgb_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm_xgb)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm_xgb[i, j], ha='center', va='center', color='red')

## K Nearest Neighbour

In [None]:
knn = KNeighborsClassifier()

In [None]:
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)

In [None]:
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False,verbose=1)

In [None]:
grid_search = grid.fit(X_train, y_train)

In [None]:
grid_search.best_params_

The grid search has determined 20 neighbours to be the best parameter.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 20)

In [None]:
knn.fit(X_train, y_train)

In [None]:
knn.score(X_train, y_train)

In [None]:
knn_pred = knn.predict(X_test)

In [None]:
acc_knn = knn.score(X_test, y_test)

In [None]:
f1_knn = f1_score(y_test, knn_pred)
f1_knn

In [None]:
cr_knn = classification_report(y_test,knn_pred )
print(cr_knn)

The accuracy of KNN is average as well as its F1 score, although it has done slightly better at detecting real pictures.

In [None]:
cm_knn = confusion_matrix(y_test,knn_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm_knn)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm_knn[i, j], ha='center', va='center', color='red')

## Decision Tree Classifier

In [None]:
dec_tree = DecisionTreeClassifier()

In [None]:
params = {"criterion": ["gini", "entropy"],
       "max_depth": range(5,10),
       "min_samples_split": range(40,50),
       "min_samples_leaf": range(80,90)}

In [None]:
grid = GridSearchCV(dec_tree, param_grid = params)

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_params_

In [None]:
tree = DecisionTreeClassifier(criterion = "entropy", max_depth = 5,min_samples_leaf = 80, min_samples_split = 41)

In [None]:
tree.fit(X_train, y_train)

In [None]:
tree.score(X_train, y_train)

In [None]:
tree_pred = tree.predict(X_test)

In [None]:
acc_tree = tree.score(X_test, y_test)
acc_tree

In [None]:
f1_tree = f1_score(y_test, tree_pred)
f1_tree

In [None]:
cr_tree = classification_report(y_test,tree_pred )
print(cr_tree)

The combined F1 score for both classes is higher than other algorithms although the Decision Tree also did better on real pictures than detecting fake one, the model didn't overfit and its accuracy on the test set is average.

In [None]:
cm_tree = confusion_matrix(y_test,tree_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm_tree)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm_tree[i, j], ha='center', va='center', color='red')

The Decision tree did also better at detecting real images

## Models comparision

In [None]:
acc = [acc_lr, acc_nb, acc_svc, acc_xgb, acc_knn, acc_tree]
f1 = [f1_lr, f1_nb, f1_svc, f1_xgb, f1_knn, f1_tree]
models = ["LR", "NB", "SVC", "XGB", "KNN", "DT"]

In [None]:
X_axis = np.arange(len(models))
plt.figure(figsize=(8, 8))  
plt.bar(X_axis - 0.2, acc , 0.4, label = 'Accuracy')
plt.bar(X_axis + 0.2, f1, 0.4, label = 'F1 score')
  
plt.xticks(X_axis, models)
plt.xlabel("Models")
plt.title("Performance of Models")
plt.legend()

The above plot shows the performance of the different algorithms used. Now from first glance, it seems that Logistic regression and SVC are doing the best jobs, but that's only because these two models can only predict fairly well on real images, while their predictions on fake pictures are extremely weak. Decision Tree looks like the next thing to look at, its accuracy is about average and its average F1 score is influenced by the F1 score for real pictures.

## Conclusion

In this notebook I tried to build models capable of detecting fake pictures. Firstly, I tried to reduce the dimensionality of the images using PCA, then I attempted with different classification algorithms. The results of my models are average if not too bad (like LR and SVC), and I can argue its because of :
- The method of pre-processing I used: maybe the PCA that kept only half of the information  in the picture has let go of so much information necessary for the models to differentiate between the two classes.
- Machine learning algorithms are not able to handle well this particular problem and that neural networks should be given a shot.