# Lab 3 - Exploring Image Data

### Eric Smith and Jake Carlson

## Introduction
For this lab we have choosen the CIFAR-10 image data set. The original data set has 60,000 images. 50,000 of these are training images and 10,000 are test images. The images are 32x32 pixels and contain objects from 10 classes. The classes are listed below.
- airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck

For this lab, we will use the automobile and truck images.

## Business Understanding

### Motivations
The law treats cars and trucks differently on the road. Trucks often have to stop at weigh stations so their contents can be verified. It would be useful to have a tool that can distinguish between cars and trucks. Once a truck has been identified, a record of the truck and its location can be made so that Customs can make sure the truck is checked at the next weigh station.

### Objectives
We want to be able to accurately predict the object in the picture with 90% accuracy.

## Data Understanding

### Data Attributes
The following is a list of attributes in the data, their data types, and a brief description of the attribute.


## Data Quality

Because we only want to look at automobiles and trucks, we will read the image labels and select all of the images that have these labels assigned to them. We will then convert the images to table data for analysis.

In [1]:
import numpy as np
import pandas as pd

df_labels = pd.read_csv('./data/labels.csv')
df_labels.head()

Unnamed: 0,id,label
0,1,frog
1,2,truck
2,3,truck
3,4,deer
4,5,automobile


In [2]:
df_labels = df_labels[ df_labels.label.isin(['automobile', 'truck']) ]
df_labels.head()

Unnamed: 0,id,label
1,2,truck
2,3,truck
4,5,automobile
5,6,automobile
14,15,truck


In [3]:
df_labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 49999
Data columns (total 2 columns):
id       10000 non-null int64
label    10000 non-null object
dtypes: int64(1), object(1)
memory usage: 234.4+ KB


We still have 10,000 images in this data set, so we will take a random sample of 500 images from each class. In the future we could adapt this ratio to more closely match the ratio of cars ot trucks on the road.

In [5]:
df_labels = pd.concat([df_labels[df_labels.label == "truck"].sample(n=500),
                      df_labels[df_labels.label == "automobile"].sample(n=500)])
df_labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 35844 to 17008
Data columns (total 2 columns):
id       1000 non-null int64
label    1000 non-null object
dtypes: int64(1), object(1)
memory usage: 23.4+ KB


In [6]:
# reading pngs in python:
# https://www.daniweb.com/programming/software-development/threads/253957/converting-an-image-file-png-to-a-bitmap-file
from PIL import Image

# reads a png and returns a list of all pixel values in order r, g, b
def get_img_as_rgb_row(image_path):
    img = Image.open(image_path)
    if len(img.split()) == 4:
        # remove alpha if present
        r, g, b, a = img.split()
        img = Image.merge("RGB", (r, g, b))
    r, g, b = img.split()
    img_list = []
    img_list.extend(list(r.getdata()))
    img_list.extend(list(g.getdata()))
    img_list.extend(list(b.getdata()))
    return img_list

# generate column names
cols = ['label']
for i in ['r', 'g', 'b']:
    for j in range(1024):
        cols.append("{}{}".format(i,j))

# create df and extract color values for all car and truck images
df = pd.DataFrame(columns=cols, index=range(len(df_labels.id.tolist())))
data_dir = "./data/cifar-10/"
idx = 0
for r in df_labels.iterrows():
    entry = [r[1].label]
    entry.extend(get_img_as_rgb_row("{}{}.png".format(data_dir, r[1].id)))
    df.loc[i] = entry
    idx += 1
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1001 entries, 0 to b
Columns: 3073 entries, label to b1023
dtypes: object(3073)
memory usage: 23.5+ MB


In [7]:
df.head()

Unnamed: 0,label,r0,r1,r2,r3,r4,r5,r6,r7,r8,...,b1014,b1015,b1016,b1017,b1018,b1019,b1020,b1021,b1022,b1023
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


## References
Alex Krizhevsky, 2009: <a href="http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf">Learning Multiple Layers of Features from Tiny Images</a>