# Final Model

Simon Schellaert

This notebook implements the final model with tuned hyperparameters. The tuning of these hyperparameters was performed in `Experiments.ipynb`.

## 0. Required dependencies

We start by including some packages that will be used in the remainder of the notebook. This prevents us from cluttering the other cells with imports.

In [None]:
# standard packages used to handle files
import sys
import os 
import glob
import time

# commonly used library for data manipilation
import pandas as pd

# numerical
import numpy as np

# handle images - opencv
import cv2

# machine learning library
import sklearn
import sklearn.preprocessing

#used to serialize python objects to disk and load them back to memory
import pickle

#plotting
import matplotlib.pyplot as plt

# helper functions kindly provided for you by Matthias 
import helpers
# specific helper functions for feature extraction
import features

# tell matplotlib that we plot in a notebook
%matplotlib notebook

# filepath constants
DATA_BASE_PATH = './'
OUTPUT_PATH='./'

DATA_TRAIN_PATH = os.path.join(DATA_BASE_PATH,'train')
DATA_TEST_PATH = os.path.join(DATA_BASE_PATH,'test')

FEATURE_BASE_PATH = os.path.join(OUTPUT_PATH,'features')
FEATURE_TRAIN_PATH = os.path.join(FEATURE_BASE_PATH,'train')
FEATURE_TEST_PATH = os.path.join(FEATURE_BASE_PATH,'test')

PREDICTION_PATH = os.path.join(OUTPUT_PATH,'predictions')

# filepatterns to write out features
FILEPATTERN_DESCRIPTOR_TRAIN = os.path.join(FEATURE_TRAIN_PATH,'train_features_{}.pkl')
FILEPATTERN_DESCRIPTOR_TEST = os.path.join(FEATURE_TEST_PATH,'test_features_{}.pkl')

## 1. Augmenting the data set 

The model employs data augmentation in the form of horizontal flipped version of the images. To reduce the computation time while training, we flip all training images beforehand. The flipped version of `bobcat_0001.jpg` is saved as `bobcat_0001_flip.jpg`. To create these extra images, we use the ImageMagick convert utility. Concretely, we can generate flipped versions for all images by running the command below in each class folder.

```sh
for f in *.jpg; do convert $f -flop $(basename $f .jpg)_flip.jpg; done
```

## 2. Loading the train labels
First, let's get the train labels. The train data is ordered in a way such that all images in a class are stored in a separate folder, thus we can simply get a string representation of the labels by using the folder names.

In [2]:
folder_paths = glob.glob(os.path.join(DATA_TRAIN_PATH,'*'))
label_strings = np.sort(np.array([os.path.basename(path) for path in folder_paths]))
num_classes = label_strings.shape[0]

In [3]:
train_paths = dict((label_string, helpers.getImgPaths(os.path.join(DATA_TRAIN_PATH,label_string))) for label_string in label_strings)
test_paths = helpers.getImgPaths(DATA_TEST_PATH)

## 3. Loading the BoVW image features

To extract the features from the images (and their flipped versions), run the code in `ExtractFeatures.ipynb`. The extraction of the features is the analogous to the extraction provided in the example notebook. This time, however, we extract 3000 features from each images (see `features.py`). Once this extraction is done, we load the features here.

In [4]:
with open(FILEPATTERN_DESCRIPTOR_TRAIN.format('sift'), 'rb') as pkl_file_train:
    train_features_from_pkl_sift = pickle.load(pkl_file_train)

In [5]:
with open(FILEPATTERN_DESCRIPTOR_TRAIN.format('daisy'),'rb') as pkl_file_train:
    train_features_from_pkl_daisy = pickle.load(pkl_file_train)

Next, we create the codebook for both SIFT and DAISY based on the extracted features. Note that the hyperparameters chosen here are already optimized. This optimization was done in `Experiment.ipynb`.

In [14]:
clustered_codebook_sift = helpers.createCodebook(train_features_from_pkl_sift, codebook_size = 2000)
clustered_codebook_daisy = helpers.createCodebook(train_features_from_pkl_daisy, codebook_size = 1000)

training took 1473.2987508773804 seconds
training took 215.16722512245178 seconds


Next, we construct a feature vector for all images for both the SIFT and DAISY features. To avoid duplicating code, we define two helpers function that will be used for both preprocessing the training and test data.

In [15]:
def create_histogram_features(paths, number_of_bins = 10):
    """ Returns a NumPy array containing the histogram feature given a list of image paths """
    features = []
    
    for path in paths:
        img = cv2.imread(path)
        hist = cv2.calcHist([img], [0, 1, 2], None, [number_of_bins, number_of_bins, number_of_bins], 3 * [0, 256]).flatten()
        features.append(hist / np.sum(hist))
        
    return np.array(features)

def convert_features_to_bow(features, codebook):
    """ Converts an array of features to a BoVW representation using the provided codebook """
    bow_vectors = []
    
    for feature in features:
        bow_vector = helpers.encodeImage(feature.data, codebook)
        bow_vectors.append(bow_vector)

    return bow_vectors    

Using these helper functions, we construct the input data for each training set.

In [16]:
train_data_sift = convert_features_to_bow(train_features_from_pkl_sift, clustered_codebook_sift)
train_data_daisy = convert_features_to_bow(train_features_from_pkl_daisy, clustered_codebook_daisy)
train_data_hist = create_histogram_features([feature.path for feature in train_features_from_pkl_sift])

train_data = np.concatenate([train_data_sift, train_data_daisy, train_data_hist], axis=1)

Next, we repeat this procedure for the test data set.

In [17]:
with open(FILEPATTERN_DESCRIPTOR_TEST.format('sift'),'rb') as pkl_file_test:
    test_features_from_pkl_sift = pickle.load(pkl_file_test)

with open(FILEPATTERN_DESCRIPTOR_TEST.format('daisy'),'rb') as pkl_file_test:
    test_features_from_pkl_daisy = pickle.load(pkl_file_test)

test_data_sift = convert_features_to_bow(test_features_from_pkl_sift, clustered_codebook_sift)
test_data_daisy = convert_features_to_bow(test_features_from_pkl_daisy, clustered_codebook_daisy)
test_data_hist = create_histogram_features([feature.path for feature in test_features_from_pkl_sift])

test_data = np.concatenate([test_data_sift, test_data_daisy, test_data_hist], axis=1)

Finally, we convert the string labels to numerical labels before feeding them to our model.

In [18]:
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.fit(label_strings)

train_labels_raw = [image.label for image in train_features_from_pkl_sift]
train_labels = label_encoder.transform(train_labels_raw)

## 4. Training the model
For our naive sample submission we assume that every class occurs with the equal probability, thus we assign an equal probability over all classes to each image. <code>helpers.writePredictionsToCsv</code> can be used to write out predictions as a csv file ready to be submitted to the competition page

In [None]:
from sklearn.svm import SVC

classifier = SVC(random_state=0, probability=True, kernel='linear', C=0.9)
classifier.fit(train_data, train_labels)

## 5. Generating predictions for the test set

We now have a trained model we can use to generate predictions. Generating a 2-dimensional array of probabilities is easy using the `predict_proba` function. Afterwards, we save the predictions in a CSV-file. 

In [20]:
predictions = classifier.predict_proba(test_data)

pred_file_path = os.path.join(PREDICTION_PATH, helpers.generateUniqueFilename('predictions','csv'))
helpers.writePredictionsToCsv(predictions, pred_file_path, label_strings)