# Statefarm Distracted Driver Classification (using Keras)
## Satchel Grant

The goal of this notebook is to classify the State Farm Distracted Drivers dataset using Keras. I also implement a generator for data feeding to reduce the memory consumption during training. And I practice using ThreadPool for quicker image processing during testing.

### Overview
I begin by reading in the file paths to each of the training images. The total dataset gives training images, testing images with no labels, and a sample submission csv. 

The training images are split into ten folders each denoting their image classification. I read them in and convert their classification to an integer from 0-9.

I then create a data augmentation function to make rotated and translated copies of each of the training images. This enlarges the dataset which tends to help the classification model generalize. Note, I have not used this data yet.

Next I create a generator to read in images in batches from their file names.

I then define the model in Keras. The convolutional layers each are composed of 3 different filters (1x1, 3x3, and 5x5). The filters are then each run on individually the incoming activations with 'same' padding. Each of the resulting, outgoing activations are stacked by sample. This negates the need to pick a filter size. The outgoing activations are then max pooled (2x2 filter and stride). The model then runs 2 dense layers followed by an output layer. Each convolutional and dense layer use the 'elu' activation function for its improvements on the 'relu' function except for the final outputs which use a softmax function. The model uses batchnormalization at each layer and uses dropout most extensively when switching from convolutional to dense layers. See the README.md for more details.

The model works well on the training and validation set considering its relatively small size. After a total of 12 epochs using the adam optimizer, it was producing greater than 99% accuracy on both the training and validation set without using the aumented data. Additionally, the total training was performed within 6 hours on a CPU.

I used ThreadPool for reading the testing images in concurrently with evaluating them on the model. This saved about 20 minutes on a CPU.

For the submission, I got best results by adjusting the prediction confidence to 95% and distributing the remaining 5% equally to each of the other classes. The final log loss on Kaggle was 1.8 on the private board and 2.0 on the public board. This translates to roughly 65% accuracy. Needless to say, I overfit the training and validation data. This is likely due in part to multiple pictures of the same driver in the training data. To fix this, I plan on using the augmented data and making more use of dropout. I may use fake labeling on the test set, too, which has empirical success.

### Initial Imports

In [13]:
import numpy as np
import cv2
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
from sklearn.utils import shuffle
import scipy.misc as sci
import time
import sys
from PIL import Image

%matplotlib inline

def show_img(img):
    plt.imshow(img)
    plt.show()

### Read in Data

First I read in the file paths of each of the image files. I create an array parallel to the image paths array to store the corresponding labels.


In [2]:
external_drive_path = '/Volumes/WhiteElephant/'
home_path = os.getcwd()
os.chdir(external_drive_path)

In [3]:
path = './statefarm_drivers/imgs/train'

def read_paths(path, no_labels=False):
    file_paths = []
    labels = []
    labels_to_nums = dict()
    for dir_name, subdir_list, file_list in os.walk(path):
        if len(subdir_list) > 0:
            label_types = subdir_list
            for i,subdir in enumerate(subdir_list):
                labels_to_nums[subdir] = i
        for img_file in file_list:
            if '.jpg' in img_file.lower():
                file_paths.append(os.path.join(dir_name,img_file))
                if no_labels: labels.append(img_file)
                else: labels.append(labels_to_nums[dir_name[-2:]])
    if no_labels: return file_paths, labels
    n_labels = len(label_types)
    return file_paths, labels, n_labels
    

file_paths, labels,n_labels = read_paths(path)
file_paths, labels = shuffle(file_paths, labels)

print("Number of data samples: " + str(len(file_paths)))
print("Number of Classes: " + str(n_labels))

Number of data samples: 22424


In [4]:
def one_hot_encode(labels, n_classes):
    one_hots = []
    for label in labels:
        one_hot = [0]*n_classes
        one_hot[label] = 1
        one_hots.append(one_hot)
    return np.array(one_hots,dtype=np.float32)

labels = one_hot_encode(labels,n_labels)

### Data Augmentation

The following cell contains data augmentation functions to increase the amount of useable data. The final aumenting function includes a rotation with random translations up to 20 pixels along both axes. The images are then saved as jpgs to be used later. The augmentation code is only meant to be used once.

The reason why I save the images instead of including them directly in the generator is that I want a random sampling of each of the drivers during batching for training. Training performance is better on completely randomized sets.

In [79]:
import random

def rotate(img, angle, ones):
    rot_img = sci.imrotate(img, angle).astype(np.float32)
    color_range = 255
    rand_filler = np.random.random(rot_img.shape).astype(np.float32)*color_range
    rot_img[ones[:,:,:]!=1] = rand_filler[ones[:,:,:]!=1]
    return rot_img

def translate(img, row_amt, col_amt):
    color_range = 255
    translation = np.random.random(img.shape).astype(img.dtype)*color_range
    if row_amt > 0:
        if col_amt > 0:
            translation[row_amt:,col_amt:] = img[:-row_amt,:-col_amt]
        elif col_amt < 0:
            translation[row_amt:,:col_amt] = img[:-row_amt,-col_amt:]
        else:
            translation[row_amt:,:] = img[:-row_amt,:]
    elif row_amt < 0:
        if col_amt > 0:
            translation[:row_amt,col_amt:] = img[-row_amt:,:-col_amt]
        elif col_amt < 0:
            translation[:row_amt,:col_amt] = img[-row_amt:,-col_amt:]
        else:
            translation[:row_amt,:] = img[-row_amt:,:]
    else:
        if col_amt > 0:
            translation[:,col_amt:] = img[:,:-col_amt]
        elif col_amt < 0:
            translation[:,:col_amt] = img[:,-col_amt:]
        else:
            return img.copy()
    return translation


def add_augmentations(paths, rot_angles=[10,-10], row_shift=5, col_shift=3):
    img = mpimg.imread(paths[0])
    ones = [sci.imrotate(np.ones_like(img),rot_angles[i]) for i in range(len(rot_angles))]
    for path in paths:
        img = mpimg.imread(path)
        for i,angle in enumerate(rot_angles):
            add_augmentation(img,path,angle,row_shift,col_shift,ones[i])

def add_augmentation(img,path,angle,row_shift,col_shift,ones):
    new_img = rotate(img,angle,ones)
    new_img = translate(new_img,random.randint(-row_shift,row_shift),random.randint(-col_shift,col_shift))
    new_img = new_img.astype(np.uint8)
    split_path = path.split('/')
    i = 1
    if angle < 0: i = 2
    split_path[-1] = 'augmented_'+ str(i)+"_"+ split_path[-1]
    new_path = '/'.join(split_path)
    jpeg = Image.fromarray(new_img)
    jpeg.save(new_path)

        

In [80]:
add_augmentations(file_paths, row_shift=20, col_shift=20)

### Generator
Next I create a generator to read in the images in batches for training. This reduces the necessary memory required to train on the whole dataset.

There are two functions, convert_images is used to read in the images from the paths array and image_generator is used to enforce the batch sizing and proper labeling. convert_images is used again later for the testing set.

In [10]:
split_index = int(.75*len(file_paths))
X_train_paths, y_train = file_paths[:split_index], labels[:split_index]
X_valid_paths, y_valid = file_paths[split_index:], labels[split_index:]
batch_size = 128
train_steps_per_epoch = len(X_train_paths)//batch_size + 1
if len(X_train_paths) % batch_size == 0: train_steps_per_epoch = len(X_train_paths)//batch_size
valid_steps_per_epoch = len(X_valid_paths)//batch_size
resize_dims = (120,120)


def convert_images(paths, resize_dims, img_depth=3):
    images = []
    for i,path in enumerate(paths):
        img = mpimg.imread(path)
        images.append(sci.imresize(img, resize_dims))
    return np.array(images,dtype=np.float32)

def image_generator(file_paths, labels, batch_size, resize_dims=(120,120),testing=False,img_depth=3):
    while 1:
        for batch in range(0, len(file_paths), batch_size):
            images = convert_images(file_paths[batch:batch+batch_size],resize_dims,img_depth=img_depth)
            if testing: yield images
            else: 
                batch_labels = labels[batch:batch+batch_size]
                yield images, batch_labels


train_generator = image_generator(X_train_paths, y_train, batch_size)
valid_generator = image_generator(X_valid_paths, y_valid, batch_size)

### Keras Imports

In [7]:
from keras.models import Sequential, Model
from keras.layers import Conv2D, MaxPooling2D, Dense, Input, concatenate, \
        Flatten, Dropout, Lambda
from keras.layers.normalization import BatchNormalization


Using TensorFlow backend.


### Model Architecture

The model consists of 4 convolutional stacks followed by 2 dense layers and a final output layer. I had good success with this model while predicting the required steering angle from an image of a track for a car to drive around a track in real time. It is also a lightweight model making it quick and easy to train.


In [8]:
stacks = []
conv_shapes = [(1,1),(3,3),(5,5)]
conv_depths = [8,10,10,10]
pooling_filter = (2,2)
pooling_stride = (2,2)
dense_shapes = [150,50,n_labels]

inputs = Input(shape=(resize_dims[0],resize_dims[1],3))
zen_layer = BatchNormalization()(inputs)

for shape in conv_shapes:
    stacks.append(Conv2D(conv_depths[0], shape, padding='same', activation='elu')(inputs))
layer = concatenate(stacks,axis=-1)
layer = BatchNormalization()(layer)
layer = MaxPooling2D(pooling_filter,strides=pooling_stride,padding='same')(layer)
layer = Dropout(0.02)(layer)

for i in range(1,len(conv_depths)):
    stacks = []
    for shape in conv_shapes:
        stacks.append(Conv2D(conv_depths[i],shape,padding='same',activation='elu')(layer))
    layer = concatenate(stacks,axis=-1)
    layer = BatchNormalization()(layer)
    layer = Dropout(i*10**-2)(layer)
    layer = MaxPooling2D(pooling_filter,strides=pooling_stride, padding='same')(layer)

layer = Flatten()(layer)
layer = Dropout(0.5)(layer)

for i in range(len(dense_shapes)-1):
    layer = Dense(dense_shapes[i], activation='elu')(layer)
    layer = BatchNormalization()(layer)

outputs = Dense(dense_shapes[-1], activation='softmax')(layer)

### Training and Validation
The next cell trains the model using the adam optimizer and categorical_crossentropy. The adam optimizer is most efficient because it has specific learning rates for each parameter in the net and it uses momentum. Both of these techniques improves the efficiency of the training process.

I use the categorical_crossentropy loss function because this a good loss function for classification problems.

In [54]:
model = Model(inputs=inputs,outputs=outputs)
model.load_weights('model.h5')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(train_generator, train_steps_per_epoch, epochs=2,
                    validation_data=valid_generator,validation_steps=valid_steps_per_epoch)
model.save('model.h5')


Epoch 1/2
Epoch 2/2


### Testing
The following cells are for testing on the unlabeled dataset. Similar to the training images, I first read in the image paths. I then create a seperate process for reading in the images using ThreadPool. The images are read in while the model evaluates the samples.

In [81]:
path = './statefarm_drivers/imgs/test'
test_paths = read_paths(path,no_labels=True)
print(str(len(test_paths))+' testing images')

79726 testing images


In [86]:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=1) # Creates new process
test_divisions = 20 
portion = len(test_paths)//test_divisions+1 # Size of portion of images to read in
async_result = pool.apply_async(convert_images,(test_paths[0*portion:portion*(0+1)],resize_dims))
predictions = []
batch_size = 100 # Batch size used for keras predict function

In [87]:
total_base_time = time.time()

for i in range(1,test_divisions+1):
    base_time = time.time()
    
    print("Begin set")
    X_set = async_result.get()
    print("Finish reading images, time: " + str(time.time()-base_time)+'secs')
    
    if i < test_divisions:
        async_result = pool.apply_async(convert_images,(test_paths[i*portion:portion*(i+1)],resize_dims))
        
    predictions.append(model.predict(X_set,batch_size=batch_size,verbose=0))
    
    print("Execution Time: " + str((time.time()-base_time)/60)+'min\n')
print("Total Execution Time: " + str((time.time()-total_base_time)/60)+'mins')

Begin set
Finish reading images, time: 54.493602991104126

Execution Time: 3.8545568148295084min
Begin set
Finish reading images, time: 0.058259010314941406

Execution Time: 2.9133670806884764min
Begin set
Finish reading images, time: 0.06974577903747559

Execution Time: 2.934571580092112min
Begin set
Finish reading images, time: 0.07256913185119629

Execution Time: 3.0980912685394286min
Begin set
Finish reading images, time: 0.0618901252746582

Execution Time: 3.050136148929596min
Begin set
Finish reading images, time: 0.07086491584777832

Execution Time: 3.112758632500966min
Begin set
Finish reading images, time: 0.060233116149902344

Execution Time: 3.10655996799469min
Begin set
Finish reading images, time: 0.05544614791870117

Execution Time: 3.01642191807429min
Begin set
Finish reading images, time: 0.05763888359069824

Execution Time: 3.0244153141975403min
Begin set
Finish reading images, time: 0.06082797050476074

Execution Time: 3.0216683983802795min
Begin set
Finish reading im

The next cell verifies that the number of predictions is equal to the number of images in the test set.

In [92]:
print("Number of Images: " + str(len(test_paths)))
total = 0
for prediction in predictions:
    
    total += len(prediction)
print("Number of predictions: " + str(total))
if len(test_paths) == total: print("Success!")
else: print("Failure")

Number of Images: 79726
Number of predictions: 79726
Success!


The following cell writes the predictions to a csv file in the format specified by the sample_submission.csv from kaggle.

In [90]:
counter = 0
with open('./statefarm_drivers/submission.csv', 'w') as f:
    f.write('img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9\n')
    for i,logit_group in enumerate(predictions):
        for j,logit in enumerate(logit_group):
            id_ = test_labels[counter]
            counter+=1 # I use a counter here because the size of the logit_groups changes
            f.write(id_+',')
            for k,element in enumerate(logit):
                if k == logit.shape[0]-1: f.write(str(element)+'\n')
                else: f.write(str(element)+',')

Got 1.8 log loss on private leader board. That translates to roughly 65% accuracy. This means I definitely overfit the dataset.