# Lab Two: Exploring Image Data

#### *Harrison Noble, Henry Lambson*


## 1. Business Understanding

### 1.1 Overview of the Dataset

The dataset we selected contains pictures of faces that are either real, or have been edited. The edited photos have been categorized into easy, mid, and hard in terms of the difficulty to determine that they are fake. The dataset made it clear that these groupings are subjective and as such will not be used as explicit categories. Instead we will just be focusing on whether the image is real or fake. There are 960 fake images and 1081 real images in the dataset. 

### 1.2 Purpose of the Data

This data was gathered in 2019 and according to reference 1, it was gathered for the purpose of training an algorithm to distinguish fake or edited images of faces from real, unedited ones. 

### 1.3 Prediction Task for Dataset

To determine whether a photo on social media has been edited or doctored in some way. This could be useful to find and dispose of bot or fake accounts.  

### 1.4 Data Importance

This data is important because it can help stop the spread of misleading images such as scammers using fake images for their accounts, or a deep-fake generated incriminating or slandering image of someone. Social media has become such a large part of society so an algorithm that helps detect doctored photos could be useful in a moderating capacity, whether that be flagging images that are detected to be fake, or banning the accounts that use the photos. Since the dataset is only made up of faces, these would be the only kinds of images that could be detected by the algorithm.       

### 1.5 Prediction Algorithm Performace to be Considered Useful

We believe that for this algorithm to be useful, it would need to be quite accurate in order to prevent unjust bans or flags on social media images. To that end, we would want to reduce the amount of false-positives so that regular users are not affected. Given the number of images posted to social media in a day, we would want our algorithm to have a success rate somewhere in the range of 95-98%. Instead of outright banning users, the algorithm would flag an account to be reviewed manually if multiple infractions are detected.   

## 2. Data Preparation

### 2.1 Read in and Preprocess Data


In [1]:
import os
import zipfile
import glob
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt

#unzip data (only do once, uncomment to unzip data)
# with zipfile.ZipFile('dataset.zip', 'r') as zipf:
#     zipf.extractall('./data')

fake_face_path = 'data/real_and_fake_face/training_fake/*'
real_face_path = 'data/real_and_fake_face/training_real/*'

def load_data(path, data_type):
    files = glob.glob(path)
    
    img_list = []
    f_name = []
    for file in files:
        #create image, resize to 200x200, convert to grayscale
        img = Image.open(file)
        img = img.resize((200, 200))
        img = img.convert('L')
        #convert image to numpy array and flatten
        data = np.asarray(img)
        data = data.flatten()
        #add image to list of images
        img_list.append(data)
        
        _, fname = os.path.split(file)
        f_name.append(fname)
        
    return np.asarray(img_list), f_name

#load fake images (0 signifies fake)
fake_list, f_names_fake = load_data(fake_face_path, 0)
#load real images (1 signifies real)
real_list, f_names_real = load_data(real_face_path, 1)

print('Shape of fake images array:', fake_list.shape)
print('Shape of real images array:', real_list.shape)

data = np.concatenate((fake_list, real_list), axis=0)
names = f_names_fake + f_names_real

print('Shape of combined images array:', data.shape)
print(len(names))
#TODO

Shape of fake images array: (0,)
Shape of real images array: (0,)
Shape of combined images array: (0,)
0


### 2.3 Visualize Images

[['data/real_and_fake_face/training_fake/mid_243_0011', 'jpg'], ['data/real_and_fake_face/training_fake/hard_23_1110', 'jpg'], ['data/real_and_fake_face/training_fake/hard_185_1100', 'jpg'], ['data/real_and_fake_face/training_fake/mid_326_1111', 'jpg'], ['data/real_and_fake_face/training_fake/mid_343_1111', 'jpg']]


## 3. Data Reduction

### 3.1 Linear Dimensionality Reduction Using Principal Components Analysis

In [None]:
#Helper function taken from your "04. Dimension Reduction and Images" notebook
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })

In [None]:
from sklearn.decomposition import PCA
#TODO

### 3.2 Linear Dimensionality Reduction Using Randomized Principal Components Analysis

In [None]:
#TODO

### 3.3 Comparing PCA and Randomized PCA

In [None]:
#TODO

### 3.4 Feature Extraction

In [None]:
#TODO

### 3.5 Analysis of Feature Extraction for Prediction Task

In [None]:
#TODO

## 4. Exceptional Work: ....

In [None]:
#TODO

## References

1. Real and Fake Face Dataset. https://www.kaggle.com/ciplab/real-and-fake-face-detection