# Dog Breed Identification: DLNN Final Project
Ian Battin & Liam Schmid




In this project, we are tackling the Kaggle Competition: [Dog Breed Identification](https://www.kaggle.com/c/dog-breed-identification/).

In this competition, you are given a dataset containing images of 120 different breeds of dogs. It is our job to classify images in the test set into these 120 different breeds.

Being an image based challenge, CNNs are the obvious choice. We will discuss the different strategies and optimizations we tried as well as provide all the code used at each step.

## Preparing the data
The data is provided in two folders - train and test - where each file is an image. There is a labels.csv file which maps an image filename to a label (breed).

To prepare our data for use, we needed to move all of the images in the training set into seperate folders grouped by breed:

In [None]:
import sys
import os
import shutil
import numpy as np
from pathlib import Path


if len(sys.argv) != 2:
    print("Usage: python PrepareData.py <labels_file>")

    
def generateIDMapping(labelsFile):
    idMap = {}

    with open(labelsFile) as f:
        for line in f:
            if line.rstrip() != 0 and line.rstrip() != "id,breed":
                id, breed = line.rstrip().split(
                    ',')[0], line.rstrip().split(',')[1]
                idMap[id] = breed
    return idMap


def makeBreedDirectories(idMap, source, dest):
    if os.path.exists(dest):
        shutil.rmtree(dest)

    for filename in os.listdir(source):
        id = filename.split('.')[0]
        breed = idMap[id]

        srcPath = source+"/"+filename
        destPath = dest+"/"+breed+"/"+filename

        path = Path(dest+"/"+breed)
        path.mkdir(parents=True, exist_ok=True)
        shutil.copyfile(srcPath, destPath)


def makeValidationDirectories():
    for dirname in os.listdir('train'):
        samples = np.asarray(os.listdir('train/'+dirname))
        valSamples = np.random.choice(samples, len(samples)//4, replace=False)

        if os.path.exists('validation/'+dirname):
            shutil.rmtree('validation/'+dirname)
        for sample in valSamples:
            path = Path('validation/'+dirname)
            path.mkdir(parents=True, exist_ok=True)

            shutil.move('train/'+dirname+'/'+sample,
                        'validation/'+dirname+'/'+sample)


def makeTestDirectory(source, dest):
    if os.path.exists(dest):
        shutil.rmtree(dest)

    dest = dest + "/images"
    path = Path(dest)
    path.mkdir(parents=True, exist_ok=True)

    for filename in os.listdir(source):
        srcPath = source+"/"+filename
        destPath = dest+"/"+filename
        shutil.copyfile(srcPath, destPath)


idMap = generateIDMapping('data/labels.csv')
makeBreedDirectories(idMap, 'data/train', 'train')
makeValidationDirectories()
makeTestDirectory('data/test', 'test')

## Creating image generators
Now that our training data is set up, we can begin preparing for training our models.

First, we'll need to create image generators for our data:

In [None]:
from keras.preprocessing.image import ImageDataGenerator

def get_generators(train_dir_name, val_dir_name, test_dir_name):
    # All images will be rescaled by 1./255
    train_data_gen = ImageDataGenerator(rescale=1./255)
    val_data_gen = ImageDataGenerator(rescale=1./255)
    test_data_gen = ImageDataGenerator(rescale=1./255)

    train_generator = train_data_gen.flow_from_directory(
        train_dir_name,
        color_mode="rgb",
        target_size=(299, 299),
        batch_size=32,
        class_mode="categorical")
    val_generator = val_data_gen.flow_from_directory(
        val_dir_name,
        color_mode="rgb",
        target_size=(299, 299),
        batch_size=32,
        class_mode="categorical")
    test_generator = test_data_gen.flow_from_directory(
        test_dir_name,
        color_mode="rgb",
        target_size=(299, 299),
        batch_size=1,
        class_mode=None,
        shuffle=False)

    for data_batch, labels_batch in train_generator:
        print('data batch shape:', data_batch.shape)
        print('labels batch shape:', labels_batch.shape)
        break

    return train_generator, val_generator, test_generator

train_dir_name = 'train/'
val_dir_name = 'validation/'
test_dir_name = 'test/'

train_generator, val_generator, test_generator = get_generators(
        train_dir_name, val_dir_name, test_dir_name)

## Building our Model
We used many different types of models, I'll go in order of our techniques and incremental improvements made

### Unmodified InceptionV3
Our first test was to simply use the pretrained InceptionV3 model. This model was trained and tested using the ImageNet dataset, of which our dataset is a subset of. This means it should be able to already classify our 120 dog breeds. We simply added a Dense layer of 120 neurons to get an output dimension that matches 120 dog breeds.

In [None]:
import numpy as np
from keras import layers
from keras import models
from keras import optimizers
from keras.applications import InceptionV3

def build_inception_model():
    convBase = InceptionV3(
        weights='imagenet', 
        include_top=True, 
        input_shape=(299, 299, 3))
    convBase.trainable = False

    model = models.Sequential()
    model.add(convBase)
    model.add(layers.Dense(120, activation='softmax'))

    model.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

    return model

model = build_inception_model()

Now lets train it and save it:

In [None]:
def train_model(model, train_gen, val_gen, epochs=30, verbose=False):
    history = model.fit_generator(
        train_gen,
        steps_per_epoch=100,
        epochs=epochs,
        validation_data=val_gen,
        validation_steps=50,
        verbose=verbose)

    return history

train_model(model, train_generator, val_generator, 
    epochs=50, verbose=True)
model.save("./Inception.h5")

With this data, we can already see that it's doing decent, but not great. It's only reaching accuracies in the mid 70s, and validation accuracys in the mid 80s. Lets go ahead and generate predictions for our test set and submit to kaggle to see our score:

In [None]:
def classify_images(model, label_map, test_generator, verbose=False):
    steps = test_generator.n

    predictions = model.predict_generator(
        test_generator,
        steps=steps,
        verbose=verbose)

    labels = sorted(list(label_map.keys()))
    
    with open("predictions.csv", 'w') as pred_file:
        pred_file.write('id,{}\n'.format(",".join(labels)))
        for index, prediction in enumerate(predictions):
            id = re.split("[./]", test_generator.filenames[index])[-2]
            prediction_list = prediction.tolist()
            confidence_vals = ",".join(map(str, prediction_list))
            pred_file.write("{},{}\n".format(id, confidence_vals))

            if verbose:
                max_val = labels[prediction_list.index(max(prediction_list))]
                print("Image '{}' classified as a {}".format(id, max_val))
                
classify_images(
    model, 
    train_generator.class_indices, 
    test_generator, 
    verbose=True)

Submitting the outputted 'predictions.csv' file on kaggle gives us a log loss score of 2.86574 which isn't great.

### Custom Model using extacted features from InceptionV3
The default inception model is trained to classify 1000 labels, of which our 120 is a subset of. That's a lot of overkill and we could benefit by simply extracting the features from it and adding our own custom dense layer ontop of that. We also added some Dropout layers to prevent overfitting as well as changing the optimzer which seemed to improve the learning rate.

In [None]:
def build_custom_inception_model():
    convBase = InceptionV3(
        weights='imagenet', 
        include_top=False, 
        input_shape=(299, 299, 3))
    convBase.trainable = False

    model = models.Sequential()
    model.add(convBase)
    model.add(layers.Flatten())
    model.add(layers.Dropout(0.3))
    model.add(layers.Dense(768, activation='relu'))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(120, activation='softmax'))

    optimizer = optimizers.SGD(lr=0.001, momentum=0.09)
    model.compile(optimizer=optimizer,
                loss='categorical_crossentropy',
                metrics=['accuracy'])

    return model

model = build_custom_inception_model()
train_model(model, train_generator, val_generator, 
    epochs=50, verbose=True)
model.save("./Inception.h5")

classify_images(
    model, 
    train_generator.class_indices, 
    test_generator, 
    verbose=True)

We can see our accuracy has jumped to around 97-98% by 50 epochs, with a valiation accuracy around the same. This is a huge improvement. Submitting this to kaggle gets us a score of 0.46835.