### Hyper-Parameter Tuning Methodology in Task A1 (Model 2)

This Jupyter Notebook shows the methodology used in task A1 to pick the best parameters for model 2. This model uses Local Binary Patterns (LBP) as features for a Support Vector Machine (SVM).

In order to observe the impact of the models hyper-parameters, Grid Search Cross-Validation was performed with a variety of possible parameters. This method undertakes an exhaustive search over given parameter settings, as to find the combination of parameters which will perform best.

In [1]:
# Import statements
import glob, os, time
import numpy as np
import pandas as pd
from PIL import Image

from matplotlib import image
import matplotlib.pyplot as plt 

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split

from skimage.feature import local_binary_pattern

### Importing & pre-processing data

The steps taken when importing & pre-processing the data are the same as the ones performed in the final model in A1.py, and described in the report.

In [2]:
def mainA1LBP():
    imgs, lbs = extract_lbp()
    data_train, data_test, lbs_tr, lbs_te = train_test_split(imgs, lbs, test_size=0.2)
    pca_train, pca_test = dimensionality_reductionLBP(data_train, data_test)
    return pca_train, pca_test, lbs_tr, lbs_te

def extract_lbp():
    imgs, lbs = grayscale()

    numImgs = len(imgs)
    radius = 8
    numPoints = 24
    hist_lbp = np.ones((numImgs, numPoints+2))
    
    for i, img in enumerate(imgs):
        img = local_binary_pattern(img, numPoints, radius, "uniform")
        (hist, _) = np.histogram(img.ravel(), bins=np.arange(0, numPoints + 3),range=(0, numPoints + 2))
        hist = hist.astype("float")
        hist /= hist.sum()
        hist_lbp[i,:] = hist

    return hist_lbp, lbs

def dimensionality_reductionLBP(train_dataset, test_dataset):
    '''
    Scales the data and performs Principal Component 
    Analysis (PCA) on a given dataset
    '''

    print("Dimensionality reduction started!")
    time0 = time.time()
    print("PRE-PCA TRAIN SHAPE: ", train_dataset.shape)
    print("PRE-PCA TEST SHAPE: ", test_dataset.shape)
    scaler = StandardScaler()
    scaler.fit(train_dataset)
    
    train_dataset = scaler.transform(train_dataset)
    test_dataset = scaler.transform(test_dataset)

    pca = PCA(n_components = 'mle', svd_solver = 'full')

    pca.fit(train_dataset)
    train_dataset = pca.transform(train_dataset)
    test_dataset = pca.transform(test_dataset)

    time1 = time.time()
    print("PCA finished, it took: ", (time1-time0)/60, " min")
    
    print("Post-PCA TRAIN SHAPE: ", train_dataset.shape)
    print("Post-PCA TEST SHAPE: ", test_dataset.shape)
    
    return train_dataset, test_dataset


def grayscale():
    '''
    Converts all images into grayscale
    '''

    basedir = '../Datasets/dataset/Original Datasets/celeba/'
    labels_file = open(os.path.join(basedir,'labels.csv'), 'r')
    lines = labels_file.readlines()
    gender_labels = {line.split(',')[0] : int(line.split(',')[2]) for line in lines[1:]}

    imgs = []
    all_labels = []

    dirA1 = os.path.join(basedir,'img/')

    # Iterating over images in a sorted order
    for filename in sorted(os.listdir(dirA1), key = lambda x : int(x[:-4])):

        img = np.array(Image.open(os.path.join(dirA1,filename)).convert('L'))
        imgs.append(img)
        all_labels.append(gender_labels[filename[:-4]])
    
    labels = np.array(all_labels)
    return imgs, labels

In [3]:
data_train, data_test, lbs_train, lbs_test = mainA1LBP()

Dimensionality reduction started!
PRE-PCA TRAIN SHAPE:  (4000, 26)
PRE-PCA TEST SHAPE:  (1000, 26)
PCA finished, it took:  0.0005142450332641602  min
Post-PCA TRAIN SHAPE:  (4000, 25)
Post-PCA TEST SHAPE:  (1000, 25)


### Grid Search Cross-Validation with PCA

In [4]:
# Parameter distribution to perform the search on
param_dist = { 
    # Kernel type to be used in the algorithm
    'kernel': ('linear', 'rbf'),   

    # Regularization parameter
    'C': [0.1,0.3,1,3,10,30],

    # Kernel coefficient if kernel is 'rbf'
    'gamma': ['scale',0.001,0.01,0.1,0.3,1],

    # Specifying the seed for random distribution of data
    'random_state': [42]
}

In [5]:
def report(results, n_top=3):
    '''
    Helper function to report best scores for model
    '''
    
    for i in range(1, n_top + 1): 
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                results['mean_test_score'][candidate],
                results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [6]:
# Running Grid Search

clf = SVC()
grid_search = GridSearchCV(clf, param_grid=param_dist, cv=5)
start = time.time()
grid_search.fit(data_train, lbs_train)

print("GridSearchCV took %.2f minutes for %d candidate parameter settings."
    % ((time.time() - start)/60, len(grid_search.cv_results_['params'])))
print("")

report(grid_search.cv_results_)

GridSearchCV took 6.56 minutes for 72 candidate parameter settings.

Model with rank: 1
Mean validation score: 0.643 (std: 0.012)
Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 2
Mean validation score: 0.642 (std: 0.013)
Parameters: {'C': 3, 'gamma': 0.01, 'kernel': 'rbf', 'random_state': 42}

Model with rank: 3
Mean validation score: 0.641 (std: 0.009)
Parameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf', 'random_state': 42}



### Grid Search Cross-Validation without PCA

In [8]:
def mainA1LBPSansPCA():
    imgs, lbs = extract_lbp()
    data_train, data_test, lbs_tr, lbs_te = train_test_split(imgs, lbs, test_size=0.2)
    return data_train, data_test, lbs_tr, lbs_te

data_train, data_test, lbs_train, lbs_test = mainA1LBPSansPCA()

In [9]:
# Running Grid Search

clf = SVC()
grid_search = GridSearchCV(clf, param_grid=param_dist, cv=5)
start = time.time()
grid_search.fit(data_train, lbs_train)

print("GridSearchCV took %.2f minutes for %d candidate parameter settings."
    % ((time.time() - start)/60, len(grid_search.cv_results_['params'])))
print("")

report(grid_search.cv_results_)

GridSearchCV took 2.21 minutes for 72 candidate parameter settings.

Model with rank: 1
Mean validation score: 0.629 (std: 0.028)
Parameters: {'C': 30, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 2
Mean validation score: 0.614 (std: 0.020)
Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 3
Mean validation score: 0.606 (std: 0.017)
Parameters: {'C': 30, 'gamma': 1, 'kernel': 'rbf', 'random_state': 42}



### Conclusions

Observing the results of Grid Search Cross-Validation with and without PCA, it is possible to conclude that the SVM model performs (and generalizes) best when PCA is implemented, as the mean validation score for that model 64.3 ± 0.012 %, whereas for the non-PCA model it is 62.9 ± 0.028%. 
As such, the model with PCA will be used in the main code.

Furthermore, the parameters of the model with the highest rank in the PCA model will be used as to get the best performance possible.