# State Farm Distracted Driver Detection: Take 3
_Can computer vision spot distracted drivers?_

---

## Lesson 3 Homework Assignment

Dataset: https://www.kaggle.com/c/state-farm-distracted-driver-detection

### Dealing with Overfitting
In [`lesson2-hmwk.ipynb`](https://github.com/iconix/fast.ai/blob/master/nbs/lesson2-hmwk.ipynb), my final results (after 5 epochs) were as follows:
> `loss: 0.6260 - acc: 0.7907 - val_loss: 1.6719 - val_acc: 0.4978`

When `val_acc >> acc`, this is a clear indicator of overfitting on the training data.

On the bright side, this means that my neural net architecture is complex enough to model the data. The next step is to generalize my architecture a bit more.

Here is the prioritized list of approaches to reducing overfitting provided during class:
1. Add more data
2. Use data augmentation
3. Use architectures that generalize well
4. Add regularization (dropout, L2/L1 regularization)
5. Reduce architecture complexity

### Downloading and creating the datasets

This week, I am skipping the download, split, and create of training, validation, test, and sample datasets, relying instead on the data split from last week. See [`lesson2-hmwk.ipynb`](https://github.com/iconix/fast.ai/blob/master/nbs/lesson2-hmwk.ipynb) for those steps, if needed.

Additionally, I will be starting with the weights from last week below (`finetune2.h5`).

### Basic Configuration

In [1]:
import os

current_dir = os.getcwd()
LESSON_HOME_DIR = current_dir
DATA_HOME_DIR = current_dir + '/data/statefarm/'

# point to your training images
train_dir = DATA_HOME_DIR + 'train'

# point to the 'driver_imgs_list.csv'
lookup = DATA_HOME_DIR + 'driver_imgs_list.csv'

# point to the validation directory, which will be created in the next block
val_dir = DATA_HOME_DIR + 'valid'

sample_dir = DATA_HOME_DIR + 'sample'

test_dir = DATA_HOME_DIR + 'test'

#path = DATA_HOME_DIR + 'sample/'
path = DATA_HOME_DIR
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)

In [2]:
%matplotlib inline

In [3]:
import utils; reload(utils)
from utils import *

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.


In [4]:
batch_size=64

### Load pre-trained model layers

In [5]:
model = vgg_ft(10)

In [None]:
finetune2_path = model_path+'finetune2.h5'

model.load_weights(finetune2_path);

In [None]:
model.summary()

In [None]:
layers = model.layers
last_conv_idx = [index for index,layer in enumerate(layers) 
                     if type(layer) is Convolution2D][-1]

conv_layers = layers[:last_conv_idx+1]
# Dense layers - also known as fully connected or 'FC' layers
fc_layers = layers[last_conv_idx+1:]

### Data Augmentation

Define a generator that includes data augmentation (convenient feature of Keras).

In [None]:
gen = image.ImageDataGenerator(rotation_range=10, width_shift_range=0.1, 
       height_shift_range=0.1, shear_range=0.15, zoom_range=0.1, 
       channel_shift_range=10., horizontal_flip=True)

In [None]:
trn_batches = get_batches(path+'train', gen, batch_size=batch_size)
# NB: We don't want to augment or shuffle the validation set
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)

Jeremy's [explanation](http://forums.fast.ai/t/lesson-3-discussion/186/33) as to why we aren't training the convolutional layers here:
> The early layers are so general (e.g. remember Zeiler's visualizations - layer 1 just finds edges and gradients) that it's extremely unlikely that you'll need to change them, unless you're looking at very different kinds of images. e.g. if you're classifying line drawings, instead of photos, you'll probably need to retrain many conv layers too.

In [None]:
for layer in layers[:last_conv_idx+1]: layer.trainable=False

In [None]:
# Updating slowly because it is finely tuned
K.set_value(model.optimizer.lr, 0.00001)

In [None]:
model.fit_generator(trn_batches, samples_per_epoch=trn_batches.nb_sample, nb_epoch=8, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

After 8 epochs, we are overfitting much much less, which is great!
> `loss: 1.9278 - acc: 0.5002 - val_loss: 1.7463 - val_acc: 0.5099`

In fact, now we're underfitting very slightly. This seems like a good time to try decreasing dropout a smidge, just to see what happens.

Let's save our intermediate weights first, just in case.

In [6]:
finetune_hw3_1_path = model_path+'finetune_hw3_1.h5'
if not os.path.exists(finetune_hw3_1_path):
    model.save_weights(finetune_hw3_1_path)
model.load_weights(finetune_hw3_1_path)

#### Intermediate Kaggle submission

Interestingly, despite the significant decrease in overfitting, `val_acc` only improved a small amount from `0.4978`... I'd like to try submitting these results to Kaggle, just to see how this result reflects in the rankings.

In [22]:
from IPython.display import FileLink

subm_name = 'subm_hmwk3_1.gz'
subm_path = path + 'results/' + subm_name

#Get the classes from the validation batch
val_preprocess = get_batches(path+'valid', shuffle=False, batch_size=1)
classes = sorted(val_preprocess.class_indices, key=val_preprocess.class_indices.get)

Found 4126 images belonging to 10 classes.


Note on `.predict` vs `.predict_generator`: "the precomputed data will be large. Then it is likely that you will encounter errors such as OOM or **kernel death** during training. In this case, you might want to use model.fit_generator() instead of model.fit()" ([source](http://forums.fast.ai/t/fine-tuning-vgg-taking-very-long/3825/11)) - I kept running into kernel death with `.predict`...

In [23]:
test_batches = get_batches(path+'test', shuffle=False, batch_size = batch_size)
preds = model.predict_generator(test_batches, test_batches.nb_sample)

Found 79726 images belonging to 1 classes.


In [24]:
# (number of classes - 1) = 9 (http://forums.fast.ai/t/moving-up-the-ncfm-leaderboard-by-100-positions-do-clip/1035/6)
def do_clip(arr, mx): return np.clip(arr, (1-mx)/9, mx)

In [51]:
subm = do_clip(preds, 0.9)

In [52]:
submission = pd.DataFrame(subm, columns=classes)
submission.insert(0, 'img', [a[8:] for a in test_batches.filenames])
submission.head()

Unnamed: 0,img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9
0,img_81601.jpg,0.011417,0.011111,0.011111,0.011111,0.011111,0.463585,0.020937,0.011111,0.045846,0.443723
1,img_14887.jpg,0.346917,0.121693,0.011111,0.098112,0.011111,0.061683,0.047707,0.017239,0.234232,0.060724
2,img_62885.jpg,0.011111,0.011111,0.423249,0.015064,0.147754,0.044589,0.280264,0.011111,0.08356,0.011111
3,img_45125.jpg,0.011111,0.011111,0.256628,0.011111,0.011111,0.011111,0.133316,0.011111,0.593049,0.011111
4,img_22633.jpg,0.031959,0.01132,0.084662,0.011111,0.011111,0.034428,0.477008,0.011111,0.323893,0.026945


In [53]:
submission.to_csv(subm_path, index=False, compression='gzip')
FileLink(subm_path)

Private Score: `1.41527` ... slight improvement!

And interestingly, weeks ago, my Public Score tended to be the lower/better score; now, it is `1.48239` in comparison. See [this Quora post](https://www.quora.com/What-is-the-difference-between-public-and-private-leaderboard-in-Kaggle/answer/Giuliano-Janson) for an explanation:
> The public LB is computed on a portion of the test set, the private is computed on the remainder of the test set (not the whole test set).
"Fitting the LB" is a Kaggle term used to describe when you're tuning your models to perform well on the public LB. There is an art and a science in doing so and experience Kagglers are able to make the most out of it without overfitting. If not done well, that usually lends itself to worse scores on the private LB, sometimes disasters. In general the key is to build a model that generalizes well.

### Increase Dropout
We're going to experiment with increasing the dropout rate from Vgg16's 50%.

In [None]:
conv_model = Sequential(conv_layers)
fc_model = Sequential(fc_layers)

def get_fc_model():
    model = Sequential([
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(2, activation='softmax')
        ])

    for l1,l2 in zip(model.layers, fc_layers): l1.set_weights(proc_wgts(l2))

    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model