# Animal classification AI

# Final model

## TOC
- Student info
- Important note
- Special credits
- Required imports and basic setup
- Step 1: Loading in the precomputed rootSIFT K-Means clusters
   - Step 1a: Loading the clusters for the SVC model
   - Step 1b: Loading the clusters for the LBM
- Step 2: Encoding the images
- Step 3: Creating a train and validation set
- Step 4: Making the final model class
- Step 5: Training the final model
- Step 6: Making CSV for Kaggle
- Model Analysis
   

## Student info
- **Name**: Bontinck Lennert
- **StudentID**: 568702
- **Affiliation**: VUB - Master Computer Science: AI

## Important note
This codebook is used for testing purposes and might not contain perfectly representive information. 
Please take a look at svc_large.ipynb for exploration of using more clusters, clustering.ipynb and more_helpers.py for exploration of using different cluster approaches, final_model.ipynb for the final combined model and the report for more details.

## Special credits
Some of the code used in this notebook is adopted or copied from the notebooks supplied in the Kaggle compition. A special thanks is given to Andries Rosseau for supplying us with this helpfull code.

## Required imports basic setup
All required imports for this file are taken care of once using the following code block. Installing the required libraries is discussed in the README of this GitHub repository. Some basic setup for the used libraries is also taken care of here

In [5]:
# Set to True for final submission
use_full_training_set = True

# General optima
optima_test_fraction = 0.15
optima_test_fraction_balanced = True
optima_descriptor = "sift"


# SVC optima
svc_optima_clusters = 1250
svc_optima_cache_size = 5000
svc_optima_probability = True
svc_optima_shrinking = True
svc_optima_C = 5
svc_optima_gamma = 'scale'
svc_optima_kernel = 'rbf'
svc_optima_tol = 0.001
svc_optima_class_weight = 'balanced'

# LBM optima
lbm_optima_clusters = 100
lbm_class_weight = 'balanced'
lbm_C = 3
lbm_max_iter = 500
lbm_fit_intercept = True

In [6]:
# standard packages used to handle files
import sys
import os 
import glob
import time

# commonly used library for data manipulation
import pandas as pd

# numerical
import numpy as np
from collections import Counter 

# handle images - opencv
import cv2

# model
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

# machine learning library
import sklearn
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline

# scoring
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from scipy import stats as sstats

# used to serialize python objects to disk and load them back to memory
import pickle

# plotting
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import learning_curve


# helper functions
import helpers
import more_helpers
import pretty_cm

# specific helper functions for feature extraction
import features

# tell matplotlib that we plot in a notebook and make images high(er) resolution
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'svg'}

# used for counting files
import fnmatch

In [7]:
# load and save vars
def save_var_to_file(filename, var):
    with open('savefiles/' + filename +'.pkl','wb') as f:
        pickle.dump(var, f)

def get_var_from_file(filename):
    with open('savefiles/' + filename +'.pkl','rb') as f:
        return pickle.load(f)
    
    
#-------------------------

#save to file example
#save_var_to_file("folder/name", var_to_save)
    
#open from file example
#get_var_from_file = get_var_from_file("folder/name")

In [8]:
# datasets  location
dataset_path = '../images/'
# output location:
output_path = './'

# other path settings
dataset_path_train = os.path.join(dataset_path, 'train')
dataset_path_test = os.path.join(dataset_path, 'test')

features_path = os.path.join(output_path, 'features')
features_path_train = os.path.join(features_path, 'train')
features_path_test = os.path.join(features_path, 'test')

prediction_path = os.path.join(output_path, 'predictions')

# filepatterns to write out features
filepattern_descriptor_train = os.path.join(features_path_train, 'train_features_{}.pkl')
filepattern_descriptor_test = os.path.join(features_path_test, 'test_features_{}.pkl')

# create paths in case they don't exist:
helpers.createPath(features_path)
helpers.createPath(features_path_train)
helpers.createPath(features_path_test)
helpers.createPath(prediction_path)

train_images_folder_paths = glob.glob(os.path.join(dataset_path_train,'*'))
label_strings = np.sort(np.array([os.path.basename(path) for path in train_images_folder_paths]))

## Step 1: Getting the data
### Step 1a: Loading the clusters for the SVC model

In the clustering.ipynb notebook, a clustering with regular K-Means was done using preprocessing (PowerTransformer) and 1250 clusters, this is loaded in here. This will be used for the SVC models.

In [9]:
svc_clustered_codebook = get_var_from_file("kmeans_notebook_1250_preproc_rootsift")
svc_clustered_codebook_transformer = get_var_from_file("kmeans_notebook_1250_preproc_rootsift_scaler")

# ----------------- open pickle files ----------------- 
with open(filepattern_descriptor_train.format(optima_descriptor), 'rb') as pkl_file_train:
    train_features_from_pkl = pickle.load(pkl_file_train)
    
print('Number of encoded train images: {}'.format(len(train_features_from_pkl)))

with open(filepattern_descriptor_test.format(optima_descriptor), 'rb') as pkl_file_test:
    test_features_from_pkl = pickle.load(pkl_file_test)
        
print('Number of encoded test images: {}'.format(len(test_features_from_pkl)))

# ----------------- convert to rootSIFT ----------------- 
def rootsift(imgs, eps=1e-7):
    # apply the Hellinger kernel by first L1-normalizing and taking the
    # square-root
    for idx in range(len(imgs)):
        imgs[idx] = imgs[idx]._replace(data = imgs[idx].data / (imgs[idx].data.sum(axis=1, keepdims=True) + eps))
        imgs[idx] = imgs[idx]._replace(data = np.sqrt(imgs[idx].data))
    return imgs

# convert sift to rootsift
train_features_from_pkl = rootsift(train_features_from_pkl)
test_features_from_pkl = rootsift(test_features_from_pkl)

Number of encoded train images: 4042
Number of encoded test images: 4035


### Step 1b: Loading the clusters for the LBM

The LBM model performed optimally with 100 clusters, it is loaded here.
This was not yet calculated in the clustering notebook thus it is done here (now in comment).

In [10]:
#lbm_clustered_codebook = more_helpers.createCodebookFullK(train_features_from_pkl, codebook_size=lbm_optima_clusters)
#save_var_to_file("kmeans_notebook_100_rootsift", lbm_clustered_codebook)

Using n_init = 4 and max_iter = 500




Initialization complete
Iteration 0, inertia 551301.25
Iteration 1, inertia 391539.375
Iteration 2, inertia 383916.375
Iteration 3, inertia 380827.96875
Iteration 4, inertia 379131.71875
Iteration 5, inertia 378060.34375
Iteration 6, inertia 377323.34375
Iteration 7, inertia 376772.40625
Iteration 8, inertia 376333.1875
Iteration 9, inertia 375957.25
Iteration 10, inertia 375638.59375
Iteration 11, inertia 375376.75
Iteration 12, inertia 375162.625
Iteration 13, inertia 374965.3125
Iteration 14, inertia 374789.40625
Iteration 15, inertia 374633.65625
Iteration 16, inertia 374491.40625
Iteration 17, inertia 374374.6875
Iteration 18, inertia 374254.40625
Iteration 19, inertia 374152.34375
Iteration 20, inertia 374057.75
Iteration 21, inertia 373969.90625
Iteration 22, inertia 373888.46875
Iteration 23, inertia 373816.0625
Iteration 24, inertia 373746.15625
Iteration 25, inertia 373684.9375
Iteration 26, inertia 373633.8125
Iteration 27, inertia 373585.125
Iteration 28, inertia 373527.406

Iteration 236, inertia 372401.15625
Iteration 237, inertia 372399.21875
Iteration 238, inertia 372399.78125
Iteration 239, inertia 372398.78125
Iteration 240, inertia 372396.0625
Iteration 241, inertia 372396.28125
Iteration 242, inertia 372395.875
Iteration 243, inertia 372397.1875
Iteration 244, inertia 372395.34375
Iteration 245, inertia 372394.59375
Iteration 246, inertia 372392.71875
Iteration 247, inertia 372392.28125
Iteration 248, inertia 372392.65625
Iteration 249, inertia 372392.96875
Iteration 250, inertia 372391.625
Iteration 251, inertia 372391.375
Iteration 252, inertia 372390.46875
Iteration 253, inertia 372391.59375
Iteration 254, inertia 372391.8125
Iteration 255, inertia 372391.40625
Iteration 256, inertia 372392.21875
Iteration 257, inertia 372391.65625
Iteration 258, inertia 372391.15625
Iteration 259, inertia 372390.625
Iteration 260, inertia 372390.96875
Iteration 261, inertia 372389.96875
Iteration 262, inertia 372389.21875
Iteration 263, inertia 372388.375
Itera

Iteration 470, inertia 372365.6875
Iteration 471, inertia 372366.0
Iteration 472, inertia 372364.125
Iteration 473, inertia 372364.375
Iteration 474, inertia 372364.3125
Iteration 475, inertia 372362.5625
Iteration 476, inertia 372362.9375
Iteration 477, inertia 372362.3125
Iteration 478, inertia 372362.3125
Iteration 479, inertia 372359.78125
Iteration 480, inertia 372360.0625
Iteration 481, inertia 372358.75
Iteration 482, inertia 372357.84375
Iteration 483, inertia 372357.59375
Iteration 484, inertia 372355.4375
Iteration 485, inertia 372355.09375
Iteration 486, inertia 372352.96875
Iteration 487, inertia 372352.09375
Iteration 488, inertia 372350.21875
Iteration 489, inertia 372349.875
Iteration 490, inertia 372349.9375
Iteration 491, inertia 372348.9375
Iteration 492, inertia 372349.96875
Iteration 493, inertia 372349.4375
Iteration 494, inertia 372351.28125
Iteration 495, inertia 372349.75
Iteration 496, inertia 372349.1875
Iteration 497, inertia 372347.4375
Iteration 498, inerti

Iteration 206, inertia 372294.0625
Iteration 207, inertia 372294.09375
Iteration 208, inertia 372295.5
Iteration 209, inertia 372296.5625
Iteration 210, inertia 372296.9375
Iteration 211, inertia 372296.25
Iteration 212, inertia 372296.125
Iteration 213, inertia 372297.5
Iteration 214, inertia 372299.0
Iteration 215, inertia 372299.75
Iteration 216, inertia 372299.84375
Iteration 217, inertia 372300.0
Iteration 218, inertia 372300.75
Iteration 219, inertia 372300.53125
Iteration 220, inertia 372300.15625
Iteration 221, inertia 372299.875
Iteration 222, inertia 372298.9375
Iteration 223, inertia 372298.625
Iteration 224, inertia 372298.125
Iteration 225, inertia 372297.15625
Iteration 226, inertia 372298.25
Iteration 227, inertia 372298.28125
Iteration 228, inertia 372296.84375
Iteration 229, inertia 372296.6875
Iteration 230, inertia 372297.0625
Iteration 231, inertia 372296.125
Iteration 232, inertia 372296.0625
Iteration 233, inertia 372295.9375
Iteration 234, inertia 372296.84375
It

Iteration 38, inertia 373223.53125
Iteration 39, inertia 373203.53125
Iteration 40, inertia 373189.53125
Iteration 41, inertia 373176.09375
Iteration 42, inertia 373159.6875
Iteration 43, inertia 373138.71875
Iteration 44, inertia 373122.59375
Iteration 45, inertia 373105.375
Iteration 46, inertia 373091.5625
Iteration 47, inertia 373078.75
Iteration 48, inertia 373063.03125
Iteration 49, inertia 373053.5625
Iteration 50, inertia 373038.40625
Iteration 51, inertia 373025.0
Iteration 52, inertia 373010.625
Iteration 53, inertia 372993.53125
Iteration 54, inertia 372983.03125
Iteration 55, inertia 372971.53125
Iteration 56, inertia 372954.875
Iteration 57, inertia 372943.75
Iteration 58, inertia 372933.71875
Iteration 59, inertia 372922.46875
Iteration 60, inertia 372910.4375
Iteration 61, inertia 372892.46875
Iteration 62, inertia 372873.46875
Iteration 63, inertia 372860.15625
Iteration 64, inertia 372841.90625
Iteration 65, inertia 372827.34375
Iteration 66, inertia 372814.75
Iteratio

Iteration 274, inertia 372424.5
Iteration 275, inertia 372424.40625
Iteration 276, inertia 372424.65625
Iteration 277, inertia 372425.65625
Iteration 278, inertia 372423.5625
Iteration 279, inertia 372420.8125
Iteration 280, inertia 372420.3125
Iteration 281, inertia 372419.4375
Iteration 282, inertia 372419.28125
Iteration 283, inertia 372418.375
Iteration 284, inertia 372415.34375
Iteration 285, inertia 372416.71875
Iteration 286, inertia 372415.6875
Iteration 287, inertia 372417.21875
Iteration 288, inertia 372418.25
Iteration 289, inertia 372416.4375
Iteration 290, inertia 372415.03125
Iteration 291, inertia 372414.9375
Iteration 292, inertia 372412.84375
Iteration 293, inertia 372411.75
Iteration 294, inertia 372410.9375
Iteration 295, inertia 372409.40625
Iteration 296, inertia 372409.0625
Iteration 297, inertia 372408.71875
Iteration 298, inertia 372408.4375
Iteration 299, inertia 372407.28125
Iteration 300, inertia 372406.71875
Iteration 301, inertia 372405.15625
Iteration 302,

Iteration 8, inertia 375620.25
Iteration 9, inertia 375257.21875
Iteration 10, inertia 374969.21875
Iteration 11, inertia 374731.90625
Iteration 12, inertia 374545.15625
Iteration 13, inertia 374386.0
Iteration 14, inertia 374246.4375
Iteration 15, inertia 374135.8125
Iteration 16, inertia 374033.78125
Iteration 17, inertia 373932.84375
Iteration 18, inertia 373843.0
Iteration 19, inertia 373754.0
Iteration 20, inertia 373662.78125
Iteration 21, inertia 373585.90625
Iteration 22, inertia 373509.8125
Iteration 23, inertia 373436.4375
Iteration 24, inertia 373361.5625
Iteration 25, inertia 373299.90625
Iteration 26, inertia 373245.96875
Iteration 27, inertia 373183.78125
Iteration 28, inertia 373137.65625
Iteration 29, inertia 373092.625
Iteration 30, inertia 373056.53125
Iteration 31, inertia 373019.40625
Iteration 32, inertia 372979.375
Iteration 33, inertia 372945.25
Iteration 34, inertia 372917.5
Iteration 35, inertia 372888.65625
Iteration 36, inertia 372854.3125
Iteration 37, inert

Iteration 245, inertia 372228.0
Iteration 246, inertia 372229.0625
Iteration 247, inertia 372229.4375
Iteration 248, inertia 372229.65625
Iteration 249, inertia 372228.59375
Iteration 250, inertia 372225.96875
Iteration 251, inertia 372226.6875
Iteration 252, inertia 372226.6875
Iteration 253, inertia 372228.375
Iteration 254, inertia 372227.84375
Iteration 255, inertia 372227.75
Iteration 256, inertia 372227.375
Iteration 257, inertia 372228.65625
Iteration 258, inertia 372228.9375
Iteration 259, inertia 372228.59375
Iteration 260, inertia 372228.6875
Iteration 261, inertia 372228.59375
Iteration 262, inertia 372228.8125
Iteration 263, inertia 372228.75
Iteration 264, inertia 372228.34375
Iteration 265, inertia 372228.78125
Iteration 266, inertia 372229.0625
Iteration 267, inertia 372229.4375
Iteration 268, inertia 372228.6875
Iteration 269, inertia 372228.96875
Iteration 270, inertia 372227.9375
Iteration 271, inertia 372227.71875
Iteration 272, inertia 372228.46875
Iteration 273, in

Iteration 479, inertia 372214.875
Iteration 480, inertia 372214.5625
Iteration 481, inertia 372214.875
Iteration 482, inertia 372214.9375
Iteration 483, inertia 372214.96875
Iteration 484, inertia 372214.5
Iteration 485, inertia 372214.5
Iteration 486, inertia 372214.28125
Iteration 487, inertia 372214.03125
Iteration 488, inertia 372213.46875
Converged at iteration 488: center shift 2.5935170810953423e-07 within tolerance 2.645446686074138e-07.
training took 1932.006531238556 seconds


In [15]:
lbm_clustered_codebook = get_var_from_file("kmeans_notebook_100_rootsift")

## Step 2: Encoding the images

The images now need to be encoded using the clustered notebooks.

In [16]:
# ----------------- encode all train images ----------------- 
svc_train_data = []
lbm_train_data = []
train_labels = []

for image_features in train_features_from_pkl:
    bow_feature_vector = more_helpers.encodeImageWithPreProc(image_features.data, svc_clustered_codebook, svc_clustered_codebook_transformer)
    svc_train_data.append(bow_feature_vector)
    
    bow_feature_vector = helpers.encodeImage(image_features.data, lbm_clustered_codebook)
    lbm_train_data.append(bow_feature_vector)
    
    train_labels.append(image_features.label)  
    
# ----------------- make labels numerical ----------------- 

label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.fit(label_strings)
train_labels = label_encoder.transform(train_labels)

# ensure label strings correspond with correct label representation
label_strings = label_encoder.inverse_transform([idx for idx in range(12)])
    
# ----------------- encode all test images ----------------- 
svc_test_data_by_kaggle = []
lbm_test_data_by_kaggle = []
for image_features in test_features_from_pkl:
    bow_feature_vector = more_helpers.encodeImageWithPreProc(image_features.data, svc_clustered_codebook, svc_clustered_codebook_transformer)
    svc_test_data_by_kaggle.append(bow_feature_vector)
    
    bow_feature_vector = helpers.encodeImage(image_features.data, lbm_clustered_codebook)
    lbm_test_data_by_kaggle.append(bow_feature_vector)
    
   

## Step 3: Creating a train and validation set

Whilst the final model will make use of the complete training data, a test and validation split is made for development and evaluation purposes.

In [17]:
# fine-tuning of these parameters has been done!
test_fraction = optima_test_fraction
balanced = optima_test_fraction_balanced

if use_full_training_set:
    svc_train_data_split, svc_test_data_split, svc_train_labels_split, svc_test_labels_split = svc_train_data, [], train_labels, []
    lbm_train_data_split, lbm_test_data_split, lbm_train_labels_split, lbm_test_labels_split = lbm_train_data, [], train_labels, []
else:
    if balanced:
        svc_train_data_split, svc_test_data_split, svc_train_labels_split, svc_test_labels_split = train_test_split(svc_train_data, train_labels, test_size = test_fraction, stratify=train_labels, random_state=98)
        save_var_to_file("final_model_test_set_svc", svc_test_data_split)
        save_var_to_file("final_model_test_set_labels_svc", svc_test_labels_split)
        
        lbm_train_data_split, lbm_test_data_split, lbm_train_labels_split, lbm_test_labels_split = train_test_split(lbm_train_data, train_labels, test_size = test_fraction, stratify=train_labels, random_state=98)
        save_var_to_file("final_model_test_set_lbm", lbm_test_data_split)
        save_var_to_file("final_model_test_set_labels_lbm", lbm_test_labels_split)
        
    else:
        print("Not implemented!")

In [18]:
# MAKING MERGED TRAIN AND TEST SET
label_strings_merged = ["chicken", "big", "catish", "dog", "flying"]

# new label encoder
label_encoder_merged = sklearn.preprocessing.LabelEncoder()
label_encoder_merged.fit(label_strings_merged)


# merged classes
flying = ["owl", "parrot", "swan"]
flying_ori = label_encoder.transform(flying)
catish = ["fox", "lion", "tiger", "jaguar"]
catish_ori = label_encoder.transform(catish)
big = ["horse", "elephant"]
big_ori = label_encoder.transform(big)
dog = ["german_shepherd", "golden_retriever"]
dog_ori = label_encoder.transform(dog)


# making the merged sets
svc_train_data_split_merged = []
lbm_train_data_split_merged = []
train_labels_split_merged = []

for idx in range(len(svc_train_data_split)):
    if svc_train_labels_split[idx] in flying_ori:
        svc_train_data_split_merged.append(svc_train_data_split[idx])
        lbm_train_data_split_merged.append(lbm_train_data_split[idx])
        train_labels_split_merged.append(label_encoder_merged.transform(["flying"])[0])
    elif svc_train_labels_split[idx] in catish_ori:
        svc_train_data_split_merged.append(svc_train_data_split[idx])
        lbm_train_data_split_merged.append(lbm_train_data_split[idx])
        train_labels_split_merged.append(label_encoder_merged.transform(["catish"])[0])
    elif svc_train_labels_split[idx] in big_ori:
        svc_train_data_split_merged.append(svc_train_data_split[idx])
        lbm_train_data_split_merged.append(lbm_train_data_split[idx])
        train_labels_split_merged.append(label_encoder_merged.transform(["big"])[0])
    elif svc_train_labels_split[idx] in dog_ori:
        svc_train_data_split_merged.append(svc_train_data_split[idx])
        lbm_train_data_split_merged.append(lbm_train_data_split[idx])
        train_labels_split_merged.append(label_encoder_merged.transform(["dog"])[0])
    else:
        svc_train_data_split_merged.append(svc_train_data_split[idx])
        lbm_train_data_split_merged.append(lbm_train_data_split[idx])
        ori_label = label_encoder.inverse_transform([svc_train_labels_split[idx]])[0]
        train_labels_split_merged.append(label_encoder_merged.transform([ori_label])[0])
        
svc_test_data_split_merged = []
lbm_test_data_split_merged = []
test_labels_split_merged = []

for idx in range(len(svc_test_data_split)):
    if svc_test_labels_split[idx] in flying_ori:
        svc_test_data_split_merged.append(svc_test_data_split[idx])
        lbm_test_data_split_merged.append(lbm_test_data_split[idx])
        test_labels_split_merged.append(label_encoder_merged.transform(["flying"])[0])
    elif svc_test_labels_split[idx] in catish_ori:
        svc_test_data_split_merged.append(svc_test_data_split[idx])
        lbm_test_data_split_merged.append(lbm_test_data_split[idx])
        test_labels_split_merged.append(label_encoder_merged.transform(["catish"])[0])
    elif svc_test_labels_split[idx] in big_ori:
        svc_test_data_split_merged.append(svc_test_data_split[idx])
        lbm_test_data_split_merged.append(lbm_test_data_split[idx])
        test_labels_split_merged.append(label_encoder_merged.transform(["big"])[0])
    elif svc_test_labels_split[idx] in dog_ori:
        svc_test_data_split_merged.append(svc_test_data_split[idx])
        lbm_test_data_split_merged.append(lbm_test_data_split[idx])
        test_labels_split_merged.append(label_encoder_merged.transform(["dog"])[0])
    else:
        svc_test_data_split_merged.append(svc_test_data_split[idx])
        lbm_test_data_split_merged.append(lbm_test_data_split[idx])
        ori_label = label_encoder.inverse_transform([svc_test_labels_split[idx]])[0]
        test_labels_split_merged.append(label_encoder_merged.transform([ori_label])[0])
        
if not use_full_training_set:
    save_var_to_file("final_model_test_set_merged_svc", svc_test_data_split_merged)
    save_var_to_file("final_model_test_set_merged_lbm", lbm_test_data_split_merged)
    save_var_to_file("final_model_test_set_labels_merged", test_labels_split_merged)

In [19]:
# MAKING MERGED TRAIN AND TEST SET
label_strings_dog = ["dog", "others"]

# new label encoder
label_encoder_dog = sklearn.preprocessing.LabelEncoder()
label_encoder_dog.fit(label_strings_dog)


# merged classes
dog = ["german_shepherd", "golden_retriever"]
dog_ori = label_encoder.transform(dog)


# making the merged sets
svc_train_data_split_dog = []
lbm_train_data_split_dog = []
train_labels_split_dog = []

for idx in range(len(svc_train_data_split)):
    if svc_train_labels_split[idx] in dog_ori:
        svc_train_data_split_dog.append(svc_train_data_split[idx])
        lbm_train_data_split_dog.append(lbm_train_data_split[idx])
        train_labels_split_dog.append(label_encoder_dog.transform(["dog"])[0])
    else:
        svc_train_data_split_dog.append(svc_train_data_split[idx])
        lbm_train_data_split_dog.append(lbm_train_data_split[idx])
        train_labels_split_dog.append(label_encoder_dog.transform(["others"])[0])

        
svc_test_data_split_dog = []
lbm_test_data_split_dog = []
test_labels_split_dog = []

for idx in range(len(svc_test_data_split)):
    if svc_test_labels_split[idx] in dog_ori:
        svc_test_data_split_dog.append(svc_test_data_split[idx])
        lbm_test_data_split_dog.append(lbm_test_data_split[idx])
        test_labels_split_dog.append(label_encoder_dog.transform(["dog"])[0])
    else:
        svc_test_data_split_dog.append(svc_test_data_split[idx])
        lbm_test_data_split_dog.append(lbm_test_data_split[idx])
        test_labels_split_dog.append(label_encoder_dog.transform(["others"])[0])
        

if not use_full_training_set:
    save_var_to_file("final_model_test_set_dog_svc", svc_test_data_split_dog)
    save_var_to_file("final_model_test_set_dog_lbm", lbm_test_data_split_dog)
    save_var_to_file("final_model_test_set_labels_dog", test_labels_split_dog)

## Step 4: Making the final model class

The final model class now needs to be made

In [20]:
class FinalModel:
    def __init__(self):
        self.lbm_model = model = make_pipeline(PolynomialFeatures(),
                                               LogisticRegression(class_weight = lbm_class_weight,
                                                                  C = lbm_C,
                                                                  max_iter = lbm_max_iter, 
                                                                  fit_intercept = lbm_fit_intercept))
        
        self.svc_model = make_pipeline(StandardScaler(),
                                       SVC(C=svc_optima_C,
                                           gamma = svc_optima_gamma,
                                           kernel = svc_optima_kernel,
                                           shrinking = svc_optima_shrinking,
                                           tol = svc_optima_tol,
                                           class_weight = svc_optima_class_weight,
                                           probability = svc_optima_probability,
                                           cache_size = svc_optima_cache_size))
        
        
        self.lbm_model_merged = model = make_pipeline(PolynomialFeatures(),
                                                      LogisticRegression(class_weight = lbm_class_weight,
                                                                         C = lbm_C,
                                                                         max_iter = lbm_max_iter, 
                                                                         fit_intercept = lbm_fit_intercept))
        
        self.svc_model_merged = make_pipeline(StandardScaler(),
                                              SVC(C=svc_optima_C,
                                                  gamma = svc_optima_gamma,
                                                  kernel = svc_optima_kernel,
                                                  shrinking = svc_optima_shrinking,
                                                  tol = svc_optima_tol,
                                                  class_weight = svc_optima_class_weight,
                                                  probability = svc_optima_probability,
                                                  cache_size = svc_optima_cache_size))
        
        
        self.lbm_model_dog = model = make_pipeline(PolynomialFeatures(),
                                                   LogisticRegression(class_weight = lbm_class_weight,
                                                                      C = lbm_C,
                                                                      max_iter = lbm_max_iter, 
                                                                      fit_intercept = lbm_fit_intercept))
        
        self.svc_model_dog = make_pipeline(StandardScaler(),
                                           SVC(C=svc_optima_C,
                                               gamma = svc_optima_gamma,
                                               kernel = svc_optima_kernel,
                                               shrinking = svc_optima_shrinking,
                                               tol = svc_optima_tol,
                                               class_weight = svc_optima_class_weight,
                                               probability = svc_optima_probability,
                                               cache_size = svc_optima_cache_size))
        
        self.use_existing = False
        
    
        

    def fit(self, svc_data_to_fit, svc_labels_to_fit, svc_data_to_fit_merged, svc_labels_to_fit_merged, svc_data_to_fit_dog, svc_labels_to_fit_dog,
            lbm_data_to_fit, lbm_labels_to_fit, lbm_data_to_fit_merged, lbm_labels_to_fit_merged, lbm_data_to_fit_dog, lbm_labels_to_fit_dog
           ):
        if self.use_existing:
            if use_full_training_set:
                self.lbm_model = get_var_from_file("final_model_LBM_full")
                self.svc_model = get_var_from_file("final_model_SVC_full")
                
                self.lbm_model_merged = get_var_from_file("final_model_LBM_merged_full")
                self.svc_model_merged = get_var_from_file("final_model_SVC_merged_full")
                
                self.lbm_model_dog = get_var_from_file("final_model_LBM_dog_full")
                self.svc_model_dog = get_var_from_file("final_model_SVC_dog_full")
            else:
                self.lbm_model = get_var_from_file("final_model_LBM")
                self.svc_model = get_var_from_file("final_model_SVC")
                
                self.lbm_model_merged = get_var_from_file("final_model_LBM_merged")
                self.svc_model_merged = get_var_from_file("final_model_SVC_merged")
                
                self.lbm_model_dog = get_var_from_file("final_model_LBM_dog")
                self.svc_model_dog = get_var_from_file("final_model_SVC_dog")
        else:
            if use_full_training_set:
                self.lbm_model.fit(lbm_data_to_fit, lbm_labels_to_fit)
                save_var_to_file("final_model_LBM_full", self.lbm_model)
                
                self.svc_model.fit(svc_data_to_fit, svc_labels_to_fit)
                save_var_to_file("final_model_SVC_full", self.svc_model)
                
                self.lbm_model_merged.fit(lbm_data_to_fit_merged, lbm_labels_to_fit_merged)
                save_var_to_file("final_model_LBM_merged_full", self.lbm_model_merged)
                
                self.svc_model_merged.fit(svc_data_to_fit_merged, svc_labels_to_fit_merged)
                save_var_to_file("final_model_SVC_merged_full", self.svc_model_merged)
                
                self.lbm_model_dog.fit(lbm_data_to_fit_dog, lbm_labels_to_fit_dog)
                save_var_to_file("final_model_LBM_dog_full", self.lbm_model_dog)
                
                self.svc_model_dog.fit(svc_data_to_fit_dog, svc_labels_to_fit_dog)
                save_var_to_file("final_model_SVC_dog_full", self.svc_model_dog)
            else:
                self.lbm_model.fit(lbm_data_to_fit, lbm_labels_to_fit)
                save_var_to_file("final_model_LBM", self.lbm_model)
                
                self.svc_model.fit(svc_data_to_fit, svc_labels_to_fit)
                save_var_to_file("final_model_SVC", self.svc_model)
                
                self.lbm_model_merged.fit(lbm_data_to_fit_merged, lbm_labels_to_fit_merged)
                save_var_to_file("final_model_LBM_merged", self.lbm_model_merged)
                
                self.svc_model_merged.fit(svc_data_to_fit_merged, svc_labels_to_fit_merged)
                save_var_to_file("final_model_SVC_merged", self.svc_model_merged)
                
                self.lbm_model_dog.fit(lbm_data_to_fit_dog, lbm_labels_to_fit_dog)
                save_var_to_file("final_model_LBM_dog", self.lbm_model_dog)
                
                self.svc_model_dog.fit(svc_data_to_fit_dog, svc_labels_to_fit_dog)
                save_var_to_file("final_model_SVC_dog", self.svc_model_dog)
                
    def predict_proba(self, svc_data_to_predict, lbm_data_to_predict):
        return self.svc_model.predict_proba(svc_data_to_predict)
        
    def predict(self, svc_data_to_predict, lbm_data_to_predict):
        return self.svc_model.predict(svc_data_to_predict)

## Step 5: Training the final model

The final model can now be trained

In [21]:
# Create a model instance 
model = FinalModel()

# Train the model with the right train sets
model.fit(svc_train_data_split, svc_train_labels_split, 
          svc_train_data_split_merged, train_labels_split_merged,
          svc_train_data_split_dog, train_labels_split_dog,
          
          lbm_train_data_split, lbm_train_labels_split, 
          lbm_train_data_split_merged, train_labels_split_merged,
          lbm_train_data_split_dog, train_labels_split_dog)


# Show prediction for train set
predictions_probability_train = model.predict_proba(svc_train_data_split, lbm_train_data_split)
train_score = log_loss(svc_train_labels_split, predictions_probability_train)
print("Accuracy of model (single log_loss): ",train_score," (train)")

# Show prediction of test set if applicable
if not use_full_training_set:
    predictions_probability_test = model.predict_proba(svc_test_data_split, lbm_test_data_split)
    test_score = log_loss(svc_test_labels_split, predictions_probability_test)
    print("Accuracy of model (single log_loss): ", test_score," (test)")

Accuracy of model (single log_loss):  0.07994764595518414  (train)


## Step 6: Making CSV for Kaggle

Make a final prediction CSV for Kaggle

In [None]:
# Do predictions on the actual test data
predictions_probability_test_data_by_kaggle = model.predict_proba(test_data_by_kaggle)

# Build a submission
pred_file_path = os.path.join(prediction_path, helpers.generateUniqueFilename('final/final_model', 'csv'))
helpers.writePredictionsToCsv(predictions_probability_test_data_by_kaggle, pred_file_path, label_strings)

<hr>
<hr>
<hr>

## Model Analysis

This can only be done if an evaluation split is made.
Since we are using our own class as model, we can't use SciKits confussion matrix function since this expects a classifier object to be passed.
We tweaked Wagner Cipriano's pretty print: [available here](https://github.com/wcipriano/pretty-print-confusion-matrix).

In [23]:
test_predicitons = model.predict(svc_test_data_split, lbm_test_data_split)
pretty_cm.plot_confusion_matrix_from_data(svc_test_labels_split, test_predicitons,
                                          columns=label_strings.tolist(), pred_val_axis="x")

ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [22]:
# put PC to sleep (handy if long calculation during the night)
import os
os.system("rundll32.exe powrprof.dll,SetSuspendState 0,1,0")

0