## Comp 9417 Project: Dog Breed Classification

This project was based off of the Kaggle competition for Dog Breed Classification:
https://www.kaggle.com/c/dog-breed-identification

# Looking At Our Data

Before we want to do anything with our data or even think about creating our model, we should first have a look at what our data actually contains.

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

from tensorflow.keras.models import Model
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

This function used to create a confusion matrix was taken from: https://www.kaggle.com/paultimothymooney/identify-dog-breed-from-image

In [None]:
# Plot confusion matrix
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.figure(figsize = (60,60))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
#Load the data
dataPath = 'input/dog-breed-identification'
dataFrame = pd.read_csv(os.path.join(dataPath,'labels.csv'))

Firstly we can have a look at how much data we have, and the distribution we have for all our breeds

In [None]:
distribution = dataFrame["breed"].value_counts()
distribution.rows = ["breed", "number"]

#Creating a horizontal bar plot

plt.figure(figsize = (50,80))
sns.set(style="darkgrid")
sns.set(font_scale = 4)
ax = sns.barplot(distribution, distribution.index)
plt.show()

Now that we have a graphical representation, we just wanted to also explicity extract some data from our graph, so that we had a clear representation of our dataset.

In [None]:
def class_percentages(labels):
    class_map={}
    for i in labels:
        if str(i) not in class_map:
            class_map[str(i)]=1
        else:
            class_map[str(i)]+=1
    return class_map

p=class_percentages(dataFrame["breed"])

print("Class with maximum images is the " + str(max(p, key=p.get)) + "  " + str(p[max(p, key=p.get)]))
print("Class with minimum images is the " + str(min(p, key=p.get)) + "  " + str(p[min(p, key=p.get)]))
print("Total size of our dataset is " + str(len(dataFrame["breed"])))

# Extraction Of Features

Creating a neural network from scratch for computer vision problems is extremely hardware demanding. We decided that we would use a pre-trained model to extract generalised features from our images.

### Pre trained model
For our pre-trained model, we decided to go with the VGG16 model from keras, trained on the ImageNet dataset.

In [None]:
base_model = VGG16(weights="imagenet")

In [None]:
model = Model(inputs=base_model.input, outputs=base_model.output)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

### Creating Our Generator

So that we do not have to load every image at once, as that would put immence strain on the computer, we will instead use a generator to yield each image when it is required.

In [None]:
dataFrame['breed'] = LabelEncoder().fit_transform(dataFrame['breed'])
y = dataFrame['breed'] 
onehot = OneHotEncoder()
y = onehot.fit_transform(np.expand_dims(y, axis=1)).toarray()

#Generator
def generator(dataFrame):
    pathTrain = 'input/dog-breed-identification/train'
    while True:
        for i in range(int(dataFrame.shape[0])):
            imgPath = os.path.join(pathTrain, dataFrame.iloc[i]['id']+ '.jpg')
    
            img = image.load_img(imgPath, target_size=(224, 224))
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = preprocess_input(x)
            yield (x,y[i])
                    
gen = generator(dataFrame)

### Extracting Features

We will now use our model to run through all of our images and extract the key features. Once the key features of each image are extracted, this will make predictions far more accurate with our model which we will train.

In [None]:
X_pred = model.predict(gen,steps=10221, verbose=1)

## Making our Predictions

Now that we have our extracted features we can start making our predictions using RandomForestClassifier from sklearn.

Firstly we wil create a basic train_test_split of our data, and then once we have separated our data out we will fit our RandomForestClassifier using our new X_train and y_train

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pred, dataFrame.iloc[:10221]['breed'])

In [None]:
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, y_train)

Now that we have fit our RandomForestClassifier we can use it to make predictions on our testing data.

In [None]:
y_pred = clf.predict(X_test)

Finally we can get the overall accuracy score of our model.

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Dog identification accuracy: ", accuracy)

### Extra data

For the sake of getting a more in depth understanding of our model's performance we will also get the classification report, and create a confusion matrix to understand how our model performed when identifying each individual dog breed.

In [None]:
confusion_mtx = confusion_matrix(y_test, y_pred)
print('Classification Report')
target_names = distribution.index
print(classification_report(y_test, y_pred, target_names=target_names, zero_division=0))

In [None]:
plot_confusion_matrix(confusion_mtx, target_names)