# Dog Breed Classification

**Overview:**
- What I've learned:
- Loading, Testing and Cleaning Data
- Preparing Data
- Visualizing Data
- Splitting Train Data: ~ Optional
- Building Model
- Training Model
- Evaluating Model
- Submission

### What I've learned:
- Concepts:
- Functions:
    - [pandas.DataFrame.describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)
    - [value_counts()](https://stackoverflow.com/questions/22391433/count-the-frequency-that-a-value-occurs-in-a-dataframe-column)
    - [pandas.DataFrame.head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)
    

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
%matplotlib inline

**Steps:**

1. Download and move the dataset to `../dataset` folder.
2. Unpack the zipped files (optionally delete the zip files after unpacking)
3. Visualize/Inspect the dataset.
4. Follow chapter 5 of 'Deep Learning with Python' by Franchois Chollet

In [1]:
from helper_scripts import my_func_utils

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
my_func_utils.unzip_dataset('..\datasets\dog_breed_identification')

F:\projects\kaggle\competitions
labels.csv.zip
sample_submission.csv.zip
test.zip
train.zip


### Loading, Testing  and Cleaning Data:

In [None]:
# loading data
data_train = pd.read_csv("../datasets/digit-recognizer/train.csv")
data_test = pd.read_csv("../datasets/digit-recognizer/test.csv")

**TLDR**: The data is clean ;) 

In [None]:
#data_train.head()

In [None]:
#data_test.head()

Cool way to quickly check if you've null values. If `unique` > 1 then you've null values.

In [None]:
#data_train.isnull().any().describe()

In [None]:
#data_test.isnull().any().describe()

Checking the ferquency of occurrence of each sub class.

In [None]:
#data_train['label'].value_counts()

### Preparing Data:

In [None]:
# randomizing the train data
data_train = data_train.sample(frac=1)

In [None]:
# params
img_rows, img_cols, img_chnls = 28, 28, 1
input_shape = (img_rows, img_cols, img_chnls)
num_classes = 10

In [None]:
# train data
y_train = to_categorical(data_train['label'],num_classes) # one-hot encoding
X_train = data_train.drop(labels = ['label'],axis = 1)

# test data
X_test = data_test

**Note:** You can only perform the reshape on the numpy.ndarray, so you need to use either `X_train.values` or `np.array(X_train)` for the reshaping.

In [None]:
type(X_train), type(X_train.values), type(np.array(X_train)), type(y_train)

In [None]:
# reshaping the input data into images
X_train = X_train.values.reshape(-1, *input_shape) # -1 means infer the dimension
X_test = X_test.values.reshape(-1, *input_shape) # -1 means infer the dimension

# labels (not needed)

# noramlizing the pixels values
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

### Visualizing Data:

In [None]:
fig = plt.figure(figsize=(12,8))
for i in range(num_classes):
    plt.subplot(3,5,i+1)
    index = np.random.choice(np.where(data_train.iloc[:, 0][:]==i)[0])
    img = X_train[index].reshape(28,28)
    plt.imshow(img, interpolation='none')
    plt.xticks([]), plt.yticks([])
    plt.title(i)
    plt.tight_layout()

### Splitting Train Data: ~ Optional
This is an optional step, but an important one nonetheless. Perform this step, if you want to control the train/val split (especially if you've an unbalanced dataset, i.e. uneven class split/distribution). If you're skipping this step, or if it's not applicable, you should set the `validation_split` to appropriate fraction of the `train` data in the `fit` method.

**Reason:** In some unbalanced datasets a simple random split could cause inaccurate evaluation during the validation. To avoid that, use `stratify = True` option in train_test_split function (**Only for >=0.17 sklearn versions**).

In [None]:
# split the train and the validation set for the fitting
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state=2018, stratify = y_train)

In [None]:
import keras
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Dropout, Flatten, Input, Conv2D, MaxPool2D
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator

### Building Model:
**Custom model:** Load a pre-trained model or create your own here!

In [None]:
import my_func_utils

In [None]:
def custom_model(input_shape=None):

    if input("If you want to load a model, enter 'yes'.\n") == 'yes': 
        return my_func_utils.load_model()
    
    assert input_shape != None
    
    model = Sequential()

    model.add(Conv2D(filters = 32, kernel_size = (5,5), padding = 'Same', activation ='relu', input_shape = (28,28,1)))
    model.add(Conv2D(filters = 32, kernel_size = (5,5), padding = 'Same', activation ='relu'))
    model.add(MaxPool2D(pool_size=(2,2)))
    model.add(Dropout(0.25))


    model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu')) 
    model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
    model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
    model.add(Dropout(0.25))


    model.add(Flatten())
    model.add(Dense(256, activation = "relu"))
    model.add(Dropout(0.5))
    
    model.add(Dense(10, activation = "softmax"))

    
    optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-8, decay=0.0)
    
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

### Training:
**Set Training Parameters:**

In [None]:
epochs = 2 # Turn epochs to 30 to get 0.9967 accuracy
batch_size = 32

# callbacks
anneal_lr = ReduceLROnPlateau(monitor='val_acc', patience=3, verbose=1, factor=0.5, min_lr=1e-5)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=1e-4, patience=5, verbose=0, mode='auto')

**Without data augmentation:**

In [None]:
model = custom_model(input_shape)
#model.summary()

In [None]:
history_no_aug = model.fit(X_train, y_train, 
                    batch_size = batch_size, 
                    epochs = epochs, 
                    callbacks = [anneal_lr, early_stopping],
                    validation_data = (X_val, y_val),
                    #verbose  = 0,
                   )

In [None]:
score = model.evaluate(X_val, y_val)

In [None]:
print('Validation Loss: {:8.4f}, Validation Accuracy: {:4.2f}'.format(score[0],score[1]*100))

**With data augmentation:**

In [None]:
model = custom_model(input_shape)
#model.summary()

In [None]:
datagen = ImageDataGenerator(rotation_range=10,
                             zoom_range = 0.1, 
                             width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
                             height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
                            )
datagen.fit(X_train)

**Note**:  The above data augmantations were chosen because they closely resemble the variation in data that can occur in real-time. 

Augmentations such as, `vertical_flip`, `horizontal_flip` have been skipped, as they don't lend themselves well, to the current dataset, i.e. using `vertical_flip`, `horizontal_flip` together could lead to misclassification of 6 and 9, 2 and 5, etc.

In [None]:
history_aug = model.fit_generator(datagen.flow(X_train,y_train, 
                                               batch_size=batch_size),
                                               epochs = epochs, 
                                               validation_data = (X_val,y_val),
                                               verbose = 2, 
                                               steps_per_epoch=X_train.shape[0] // batch_size,
                                               callbacks=[anneal_lr, early_stopping],
                                               #verbose  = 0,
                                             )

In [None]:
score = model.evaluate(X_val, y_val)
print('Validation Loss: {%.4f}, Validation Accuracy: {%.2f}'.format(score[0],score[1]*100))

In [None]:
if input("If you want to save the model, enter 'yes'.\n") == 'yes': 
        return my_func_utils.save_model()

### Evaluating Model:
**Train vs Validation Plot - Accuracy and Loss**:

In [None]:
# Plot the loss and accuracy curves for training and validation 
fig, ax = plt.subplots(2,1)
ax[0].plot(history_no_aug.history['loss'], color='b', label="Training loss")
ax[0].plot(history_no_aug.history['val_loss'], color='r', label="validation loss",axes =ax[0])
legend = ax[0].legend(loc='best', shadow=True)

ax[1].plot(history_no_aug.history['acc'], color='b', label="Training accuracy")
ax[1].plot(history_no_aug.history['val_acc'], color='r',label="Validation accuracy")
legend = ax[1].legend(loc='best', shadow=True)

**Confusion Matrix:** Custom Confusion Matrix Visualization

In [None]:
# Look at confusion matrix 

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion Matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Here we'll plot the confusion matrix for our predictions on the validation dataset.

In [None]:
from sklearn.metrics import confusion_matrix

# plot the confusion matrix
y_pred = model.predict(X_val)
plot_confusion_matrix(confusion_matrix(y_val, y_pred), classes = range(10)) 

**Investigating for Errors:**

I want to see the most important errors . For that purpose i need to get the difference between the probabilities of real value and the predicted ones in the results.

In [None]:
# Display some error results 

# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)

Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_val[errors]

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

The most important errors are also the most intrigous. 

For those six case, the model is not ridiculous. Some of these errors can also be made by humans, especially for one the 9 that is very close to a 4. The last 9 is also very misleading, it seems for me that is a 0.

In [None]:
# predict results
results = model.predict(test)

# select the indix with the maximum probability
results = np.argmax(results,axis = 1)

results = pd.Series(results,name="Label")

In [None]:
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

submission.to_csv("cnn_mnist_datagen.csv",index=False)