# Lab 7 - Convolutional Neural Networks

### Eric Smith and Jake Carlson

## Introduction
In this lab, we will develop a multi-layer perceptron to perform classification on the CIFAR-10 data set. Similar to Lab 3, we will subset the data set to images of trucks and automobiles. The original data set has 60,000 images. 50,000 of these are training images and 10,000 are test images. The images are 32x32 pixels and contain objects from 10 classes. The classes are listed below.
- airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck

This data set was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton for their paper <i>Learning Multiple Layers of Features from Tiny Images</i>. In this study, the authors use several filters to train their model to learn interesting regularities in the set of images, rather than focus on correlations between nearby pixels [1].

For this lab, we will use the images of automobile, trucks, and birds. The trucks, in this case, are semi-trucks. We have been guaranteed by the people who generated the data set that these three classes are mutually exclusive. The automobile class will have images of sedans and SUVs. The truck class will have big trucks only. Neither class has images of pickup trucks.

## Business Understanding

### Motivations
The law treats cars and trucks differently on the road. Trucks often have to stop at weigh stations so their contents can be verified. It would be useful to have a tool that can distinguish between cars and trucks. Once a truck has been identified, a record of the truck and its location can be made so that Customs or local authorities can make sure the truck is checked at the next weigh station.

The classification system developed could be deployed in conjunction with CCTV cameras on the highway. This would give authorities real time metrics on how many trucks are passing through an area. If a truck passes by two cameras, our model could incorporate the location of each camera and the time between sightings. This would reduce the necessity of having police officers on the road to monitor the speed of semi-trucks.

If a truck is identified as speeding, a police officer could be dispatched to monitor the vehicle. Using a distributed network of cameras on the highway would mean officers could spend more time patrolling residential and commerical areas. The average annual income for a Texas state trooper is \$60,612 [2]. Positioning a trooper on the highway costs roughly \$31 an hour. Meanwhile, the cost of running a CCTV camera 24/7 is approximately 54 cents per month [3].

If a trooper is positioned on the highway, people alter their behaviors because they recognize that they are being monitored. If a criminal organization is transporting illicit substances, they can have a lead car drive ahead of the transport truck so officers can be located before the truck passes through an area. However, people often don't recognize when they are being monitored by CCTV camera.

### Objectives
Our main objective is to accurately pick out a semi-truck from a sea of automobiles. A state trooper can accurately distinguish between a semi-truck and an automobile 100% of the time. But troopers rotate in and out of an area, leaving gaps in the amount of time a road is being monitored. Take the following simplified case: one trooper is assigned to watch a highway for one business day where they start at 8am, end at 5pm, and take an hour for lunch. A second trooper rotates in to monitor the highway starting at 6pm and ending at 3am. The percentage of time the road is covered is given by
<br><br>
$$t_{officer} = \frac{24 - ((6-5) + 1 + (6-3))}{24}\times100 = 79.2\%$$


<br><br>
So 79% is our threshold to beat. In order for our algorithm to be useful to authorities, it must minimize the number of trucks that slip through undetected. We will do this by measuring the performance of our model with Recall such that
<br><br>

$$Recall = \frac{TP}{TP + FN}$$
<br>
Where TP is the true positive rate, and FN is the false negative rate.

Therefore, our objective is to minimize the number of false negatives produced and reach 79% recall to be a viable replacement for police officers.

## Data Preparation

### Data Cleaning
We will start by loading the images and subsetting to 1000 images. We will use a ratio of cars to trucks that most closely matches real-world driving conditions. A project at The George Washington University [2] puts the percentage of highway vehicles that are trucks anywhere between 5% and 25% depeding on the stretch of road. We will use 25% because it balances the classes somewhat while still conforming to a real-world estimate of the ratio between trucks and cars. We will reduce the dimensionality of our images by transforming them to gray scale. This will reduce the number of features for each image from 3,072 to 1,024.

In [2]:
import numpy as np
import pandas as pd

df_labels = pd.read_csv('../Lab3/data/labels.csv')
df_labels = df_labels[ df_labels.label.isin(['automobile', 'truck']) ]
df_labels = pd.concat([df_labels[df_labels.label == "truck"].sample(n=250),
                      df_labels[df_labels.label == "automobile"].sample(n=750)])

df_labels.head()

Unnamed: 0,id,label
21093,21094,truck
47437,47438,truck
47504,47505,truck
6784,6785,truck
2215,2216,truck


In [19]:
from PIL import Image

# reads a png and returns a list of all pixel values in order r, g, b
def get_img_as_rgb_row(image_path):
    img = Image.open(image_path)
    if len(img.split()) == 4:
        # remove alpha if present
        r, g, b, a = img.split()
        img = Image.merge("RGB", (r, g, b))
    r, g, b = img.split()
    r = list(r.getdata())
    g = list(g.getdata())
    b = list(b.getdata())
    # convert to gray scale
    img_list = [(r[i] * 0.2989 + g[i] * 0.5870 + b[i] * 0.1140) for i in range(len(r))]
    return img_list

# generate column names
cols = ['label']
for i in range(1024):
    cols.append("{}".format(i))

# create df and extract color values for all car and truck images
df = pd.DataFrame(columns=cols, index=range(len(df_labels.id.tolist())))
data_dir = "../Lab3/data/cifar-10/"
idx = 0
for r in df_labels.iterrows():
    entry = [r[1].label]
    entry.extend(get_img_as_rgb_row("{}{}.png".format(data_dir, r[1].id)))
    df.loc[idx] = entry
    idx += 1
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 1025 entries, label to 1023
dtypes: object(1025)
memory usage: 7.8+ MB


In [20]:
label_dict = {
    'truck': 0,
    'automobile': 1
}
df['label_int'] = [label_dict[x] for x in df.label]
df.head()

Unnamed: 0,label,0,1,2,3,4,5,6,7,8,...,1015,1016,1017,1018,1019,1020,1021,1022,1023,label_int
0,truck,158.756,200.98,248.317,245.317,247.203,248.203,250.203,249.975,247.975,...,236.824,235.824,235.824,235.824,234.824,233.938,233.938,232.938,233.938,0
1,truck,147.931,130.922,123.11,152.324,161.039,191.824,242.378,252.975,253.747,...,49.3927,47.795,35.0459,42.5397,87.0962,59.3179,72.4244,76.7275,27.1284,0
2,truck,80.263,83.8066,89.8769,92.2465,87.247,84.2473,83.2474,87.247,88.3609,...,173.905,172.308,172.297,170.596,168.596,164.597,160.597,156.597,156.483,0
3,truck,125.853,121.555,124.142,125.598,122.585,109.454,108.399,109.22,108.323,...,1.4837,2.1847,4.0705,9.07,22.0687,40.0669,47.9521,49.066,50.0659,0
4,truck,167.966,148.458,122.005,87.7433,57.4814,49.2542,55.2876,59.7341,56.6635,...,96.4897,94.6362,93.1355,93.6085,93.4945,87.3471,78.6129,71.1236,65.46,0


In [21]:
df.to_csv('./clean-data/vehicles.csv')

In [29]:
import pandas as pd

df = pd.read_csv('./clean-data/vehicles.csv', index_col=0)

X = df.drop(['label', 'label_int'], axis=1).astype(np.float)
y = df['label_int'].astype(np.int)

print(X.shape)
print(y.shape)

(1000, 1024)
(1000,)


### Train Test Split
We will pull out 10% of our samples to serve as a test set for our classifier. This will allow us to gauge the generalization performance of our different models given their different architectures. We will train each model on the training set that contains 90% of our samples. Because of the large training time, we will only train and test once for each architecture.

In [30]:
from sklearn.model_selection import StratifiedShuffleSplit
from keras.utils import to_categorical

col_names = X.columns.values

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=64)
for train_idx, test_idx in sss.split(X.values, y.values):
    # X_train - 80% training attribute set
    # X_test - 20% test attribute set
    # y_train - 80% training labels
    # y_test - 20% training labels
    X_train, X_test = pd.DataFrame(X.values[train_idx], columns=col_names), pd.DataFrame(X.values[test_idx], columns=col_names)
    y_train, y_test = pd.DataFrame(y.values[train_idx], columns=["label_int"]), pd.DataFrame(y.values[test_idx], columns=["label_int"])

y_train, y_test = y_train.values.flatten(), y_test.values.flatten()

## Modeling

## References
Alex Krizhevsky, 2009: <a href="http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf">Learning Multiple Layers of Features from Tiny Images</a>

Face the Facts USA, 2013: <a href="https://www.facethefactsusa.org/facts/get-numbers-truck">Get the numbers of that truck</a>