### Hyper-Parameter Tuning Methodology in Task A1 (Model 2)

This Jupyter Notebook shows the methodology used in task A1 to pick the best parameters for model 2. This model uses Local Binary Patterns (LBP) as features for a Support Vector Machine (SVM).

In order to observe the impact of the models hyper-parameters, Grid Search Cross-Validation was performed with a variety of possible parameters. This method undertakes an exhaustive search over given parameter settings, as to find the combination of parameters which will perform best.

In [1]:
# Import statements
import glob, os, time
import numpy as np
import pandas as pd
from PIL import Image

from matplotlib import image
import matplotlib.pyplot as plt 

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split

from skimage.feature import local_binary_pattern

### Importing & pre-processing data

The steps taken when importing & pre-processing the data are the same as the ones performed in the final model in A1.py, and described in the report.

In [2]:
def mainA1LBP():
    '''
    Extracts LBP histograms for each picture
    Performs train/test spliting (90% train, 10% test)
    Implements dimensionality reduction by scaling and performing PCA
    
    Returns:
        - pca_train : Train dataset of LBP after PCA
        - pca_test : Test dataset of LBP after PCA
        - lbs_train : Labels of training dataset
        - lbs_test : Labels of testing dataset
    '''

    # Extracting LBP histograms
    imgs, lbs = extract_lbp()

    # Splitting dataset into 90% train and 10% test
    data_train, data_test, lbs_train, lbs_test = train_test_split(imgs, lbs, test_size=0.1)

    # Applying dimensionality reduction to dataset
    pca_train, pca_test = dimensionality_reductionLBP(data_train, data_test)

    return pca_train, pca_test, lbs_train, lbs_test

def extract_lbp():
    '''
    Converts images to grayscale for LBP to be applied
    Computes LBP for each picture
    Implements histogram of LBP

    Returns:
        - hist_lbp : Dataset of images after LBP histogram computation
        - lbs : Labels of entire dataset
    '''

    # Obtaining grayscale images and respective labels
    imgs, lbs = grayscale()

    # Defining parameters for LBP computation
    # radius : Defines radius of circle of neighours
    # numPoints : Defines number of neighbours to be used in LBP
    numImgs = len(imgs)
    radius = 2
    numPoints = 30
    hist_lbp = np.ones((numImgs, numPoints+2))
    
    for i, img in enumerate(imgs):
        img = local_binary_pattern(img, numPoints, radius, "uniform")
        (hist, _) = np.histogram(img.ravel(), bins=np.arange(0, numPoints + 3),range=(0, numPoints + 2))
        hist = hist.astype("float")
        hist /= hist.sum()
        hist_lbp[i,:] = hist

    return hist_lbp, lbs

def grayscale():
    '''
    Converts all images into grayscale

    Returns:
        - imgs : Entire dataset of grayscale images
        - labels : Labels of entire dataset
    '''

    # Extracting labels
    basedir = '../Datasets/dataset/A/'
    labels_file = open(os.path.join(basedir,'labels.csv'), 'r')
    lines = labels_file.readlines()
    gender_labels = {line.split(',')[0] : int(line.split(',')[2]) for line in lines[1:]}

    imgs = []
    all_labels = []
    dirA1 = os.path.join(basedir,'img/')

    # Iterating over each image and converting it to grayscale
    for filename in sorted(os.listdir(dirA1), key = lambda x : int(x[:-4])):

        img = np.array(Image.open(os.path.join(dirA1,filename)).convert('L'))
        imgs.append(img)
        all_labels.append(gender_labels[filename[:-4]])
    
    labels = np.array(all_labels)
    return imgs, labels


def dimensionality_reductionLBP(train_data, test_data):
    '''
    Scales train and test datasets
    Implements Principal Component Analysis (PCA) on both datasets

    Keyword arguments:
        - train_data : Raw train dataset of LBP
        - test_data : Raw test dataset of LBP

    Returns:
        - train_pca : Train dataset of LBP after PCA
        - test_pca : Train dataset of LBP after PCA
    '''

    # Scaling datasets
    scaler = StandardScaler()
    scaler.fit(train_data)
    train_data = scaler.transform(train_data)
    test_data = scaler.transform(test_data)

    # Applying PCA to datasets
    # 'mle' algorithm not used since n_components > n_features
    pca = PCA(n_components = 0.8, svd_solver = 'full')
    pca.fit(train_data)
    train_pca = pca.transform(train_data)
    test_pca = pca.transform(test_data)

    return train_pca, test_pca

In [3]:
data_train, data_test, lbs_train, lbs_test = mainA1LBP()

### Grid Search Cross-Validation with PCA

In [4]:
# Parameter distribution to perform the search on
param_dist = { 
    # Kernel type to be used in the algorithm
    'kernel': ('linear', 'rbf'),   

    # Regularization parameter
    'C': [0.1,0.3,1,3,10,30],

    # Kernel coefficient if kernel is 'rbf'
    'gamma': ['scale',0.001,0.01,0.1,0.3,1],

    # Specifying the seed for random distribution of data
    'random_state': [42]
}

In [5]:
def report(results, n_top=3):
    '''
    Helper function to report best scores for model
    '''
    
    for i in range(1, n_top + 1): 
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                results['mean_test_score'][candidate],
                results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [6]:
# Running Grid Search

clf = SVC()
grid_search = GridSearchCV(clf, param_grid=param_dist, cv=5)
start = time.time()
grid_search.fit(data_train, lbs_train)

print("GridSearchCV took %.2f minutes for %d candidate parameter settings."
    % (round((time.time() - start)/60,2), len(grid_search.cv_results_['params'])))
print("")

report(grid_search.cv_results_)

GridSearchCV took 5.78 minutes for 72 candidate parameter settings.

Model with rank: 1
Mean validation score: 0.631 (std: 0.017)
Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 2
Mean validation score: 0.628 (std: 0.016)
Parameters: {'C': 3, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 2
Mean validation score: 0.628 (std: 0.016)
Parameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf', 'random_state': 42}



### Grid Search Cross-Validation without PCA

In [7]:
def mainA1LBPSansPCA():
    '''
    Extracts LBP histograms for each picture
    Performs train/test spliting (90% train, 10% test)
    Implements dimensionality reduction by scaling and performing PCA
    
    Returns:
        - pca_train : Train dataset of LBP after PCA
        - pca_test : Test dataset of LBP after PCA
        - lbs_train : Labels of training dataset
        - lbs_test : Labels of testing dataset
    '''

    # Extracting LBP histograms
    imgs, lbs = extract_lbp()

    # Splitting dataset into 90% train and 10% test
    data_train, data_test, lbs_train, lbs_test = train_test_split(imgs, lbs, test_size=0.1)

    return data_train, data_test, lbs_train, lbs_test

In [8]:
data_train, data_test, lbs_train, lbs_test = mainA1LBPSansPCA()

In [9]:
# Running Grid Search

clf = SVC()
grid_search = GridSearchCV(clf, param_grid=param_dist, cv=5)
start = time.time()
grid_search.fit(data_train, lbs_train)

print("GridSearchCV took %.2f minutes for %d candidate parameter settings."
    % (round((time.time() - start)/60,2), len(grid_search.cv_results_['params'])))
print("")

report(grid_search.cv_results_)

GridSearchCV took 3.15 minutes for 72 candidate parameter settings.

Model with rank: 1
Mean validation score: 0.728 (std: 0.007)
Parameters: {'C': 30, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 2
Mean validation score: 0.723 (std: 0.014)
Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}

Model with rank: 3
Mean validation score: 0.702 (std: 0.017)
Parameters: {'C': 3, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 42}



### Conclusions

Observing the results of Grid Search Cross-Validation with and without PCA, it is possible to conclude that the SVM model performs (and generalizes) best when PCA is implemented, as the mean validation score for that model 74.4 ± 1.5 %, whereas for the non-PCA model it is 74.1 ± 1.4%. 
As such, the model with PCA will be used in the main code.

Furthermore, the parameters of the model with the highest rank in the PCA model will be used as to get the best performance possible. They are:
* Regularization parameter (C) : 10
* Gamma : 0.01
* Kernel Function : Radial basis function (RBF)