# 3.4 Object-Scene Recognition - BoW vs. SPM
This notebook is a modification of the Spatial Pyramid Matching - Scene Recognition notebook written by TrungTVo
- Reference Project GitHub Repo Link: https://github.com/TrungTVo/spatial-pyramid-matching-scene-recognition
- Functions such as `computeSIFT()` and `load_dataset()` are taken as-is from the reference project

Required 3rd party packages:
- numpy > conda install numpy
- split-folders > pip install split-folders
- opencv via menpo channel on Anaconda Cloud > conda install -c menpo opencv
    - OpenCV from the menpo channel is required, for the SIFT functions to be available
- pillow via pip > pip install pillow - *NOTE: do not install via conda to avoid issues with imports*

This notebook will explore Bag of Words/Features (BoW/BoF) vs. Spatial Pyramid Matching (SPM)

In [1]:
# import packages here
import cv2
import numpy as np
import glob
import split_folders
import copy

### Preparing Datasets

We first read in the dataset to be used: Caltech101 Dataset.

The dataset can be found here: http://www.vision.caltech.edu/Image_Datasets/Caltech101/
- Extract and place the dataset into a folder into the same location as this ipynb: e.g. *101_ObjectCategories*

After which, we read in the files, and take a peep at the categories of images.

In [2]:
# Read in the dataset folder and check out category names
# Change path name to dataset folder name - '<folder-name>/*'
class_names = [name[21:] for name in glob.glob('101_ObjectCategories/*')]
class_names = dict(zip(range(0,len(class_names)), class_names))
print(class_names)

{0: 'accordion', 1: 'airplanes', 2: 'anchor', 3: 'ant', 4: 'BACKGROUND_Google', 5: 'barrel', 6: 'bass', 7: 'beaver', 8: 'binocular', 9: 'bonsai', 10: 'brain', 11: 'brontosaurus', 12: 'buddha', 13: 'butterfly', 14: 'camera', 15: 'cannon', 16: 'car_side', 17: 'ceiling_fan', 18: 'cellphone', 19: 'chair', 20: 'chandelier', 21: 'cougar_body', 22: 'cougar_face', 23: 'crab', 24: 'crayfish', 25: 'crocodile', 26: 'crocodile_head', 27: 'cup', 28: 'dalmatian', 29: 'dollar_bill', 30: 'dolphin', 31: 'dragonfly', 32: 'electric_guitar', 33: 'elephant', 34: 'emu', 35: 'euphonium', 36: 'ewer', 37: 'Faces', 38: 'Faces_easy', 39: 'ferry', 40: 'flamingo', 41: 'flamingo_head', 42: 'garfield', 43: 'gerenuk', 44: 'gramophone', 45: 'grand_piano', 46: 'hawksbill', 47: 'headphone', 48: 'hedgehog', 49: 'helicopter', 50: 'ibis', 51: 'inline_skate', 52: 'joshua_tree', 53: 'kangaroo', 54: 'ketch', 55: 'lamp', 56: 'laptop', 57: 'Leopards', 58: 'llama', 59: 'lobster', 60: 'lotus', 61: 'mandolin', 62: 'mayfly', 63: 'm

With the 'split-folders' module, we then split the images into a training and test set. For this example, we will have 0.7 of the images for training.

In [2]:
# Split into training and test data set (Only need to run this once, unless you want to change the ratio of train_test_split; delete folder and re-run)
split_folders.ratio('101_ObjectCategories/', output="101_split/", seed=1337, ratio=(.7, .3)) # default values

In [3]:
# Function to load images into datasets
def load_dataset(path, num_per_class=-1):
    data = []
    labels = []
    for id, class_name in class_names.items():
        img_path_class = glob.glob(path + class_name + '/*.jpg')
        if num_per_class > 0:
            img_path_class = img_path_class[:num_per_class]
        labels.extend([id]*len(img_path_class))
        for filename in img_path_class:
            data.append(cv2.imread(filename, 0))
    return data, labels

To emulate the setting as required in the Lab 2 manual, we perform training on 30 images and test on 50 images, per category.

In [4]:
# load training dataset
train_data, train_label = load_dataset('101_split/train/', 30)
train_num = len(train_label)

# load testing dataset
test_data, test_label = load_dataset('101_split/val/', 50)
test_num = len(test_label)

### Computing SIFT features

Functions are defined to compute SIFT features of the images.

We want to represent our training and testing images as Bag of Features histograms. We first establish a vocabulary of visual words. Then we sample local features and then perform k-means clustering on the SIFT descriptors.
- This partitions the continuous 128 dimensional SIFT feature space into k regions.

In [6]:
# compute dense SIFT 
def computeSIFT(data):
    x = []
    for i in range(0, len(data)):
        sift = cv2.xfeatures2d.SIFT_create()
        img = data[i]
        step_size = 15
        kp = [cv2.KeyPoint(x, y, step_size) for x in range(0, img.shape[0], step_size) for y in range(0, img.shape[1], step_size)]
        dense_feat = sift.compute(img, kp)
        x.append(dense_feat[1])     
    return x

We use the SIFT object created to densely sample keypoints in a grid with a step-size (sampling density), and scale.
- `sift.compute()` computes SIFT descriptors

We then feed the training and test data to obtain our training and test set.

In [6]:
# extract dense sift features from training images
x_train = computeSIFT(train_data)
x_test = computeSIFT(test_data)

all_train_desc = []
for i in range(len(x_train)):
    for j in range(x_train[i].shape[0]):
        all_train_desc.append(x_train[i][j,:])

all_train_desc = np.array(all_train_desc)

We then build the Bag of Words features via KMeans on the SIFT descriptors, and form Histograms to train our model on.

In [5]:
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.svm import LinearSVC

In [7]:
# build BoW presentation from SIFT of training images 
def clusterFeatures(all_train_desc, k):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(all_train_desc)
    return kmeans


# form training set histograms for each training image using BoW representations
def formHistogram(x_train, kmeans, k):
    train_hist = []
    for i in range(len(x_train)):
        data = copy.deepcopy(x_train[i])
        predict = kmeans.predict(data)
        train_hist.append(np.bincount(predict, minlength=k).reshape(1,-1).ravel())
        
    return np.array(train_hist)
    

def accuracy(predict_label, test_label):
    return np.mean(np.array(predict_label.tolist()[0]) == np.array(test_label))

Here, we set k = 30, and form our training and test histogram sets.

In [9]:
# Takes ~40 mins...
k = 30
kmeans = clusterFeatures(all_train_desc, k)

# form training and testing histograms
train_hist = formHistogram(x_train, kmeans, k)
test_hist = formHistogram(x_test, kmeans, k)

# normalize histograms
scaler = preprocessing.StandardScaler().fit(train_hist)
train_hist = scaler.transform(train_hist)
test_hist = scaler.transform(test_hist)

We use Support Vector Machines (SVMs) to train our model; In particular, we use a one-vs-all LinearSVC classifier to binarise the classification problem.

We test on a range of values for the parameter c, which is the penalty parameter of the error term.

In [11]:
for c in [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    clf = LinearSVC(random_state=0, C=c)
    clf.fit(train_hist, train_label)
    predict = clf.predict(test_hist)
    print("C =", c, ",\t Accuracy:", np.mean(predict == test_label)*100, "%")

C = 0.01 ,	 Accuracy: 25.811001410437235 %




C = 0.1 ,	 Accuracy: 30.982604607428303 %




C = 0.2 ,	 Accuracy: 31.170662905500706 %




C = 0.3 ,	 Accuracy: 31.358721203573108 %




C = 0.4 ,	 Accuracy: 31.40573577809121 %




C = 0.5 ,	 Accuracy: 31.781852374236014 %




C = 0.6 ,	 Accuracy: 31.781852374236014 %




C = 0.7 ,	 Accuracy: 31.781852374236014 %




C = 0.8 ,	 Accuracy: 31.54677950164551 %




C = 0.9 ,	 Accuracy: 31.499764927127412 %
C = 1.0 ,	 Accuracy: 31.499764927127412 %




Accuracy of the model improves as C increases, however the overall accuracy is low.

### Try again with training and test set sizes of (100, 100), and k = 60

We explore a little further with tweaking training and test data sizes, as well as trying a different value of k.

In [12]:
# load training and test datasets, this time with (100, 100)
train_data, train_label = load_dataset('101_split/train/', 100)
train_num = len(train_label)

test_data, test_label = load_dataset('101_split/val/', 100)
test_num = len(test_label)

In [13]:
# extract dense sift features from training images
x_train = computeSIFT(train_data)
x_test = computeSIFT(test_data)

all_train_desc = []
for i in range(len(x_train)):
    for j in range(x_train[i].shape[0]):
        all_train_desc.append(x_train[i][j,:])

all_train_desc = np.array(all_train_desc)

In [14]:
# Takes > 1 hr...
k = 60
kmeans = clusterFeatures(all_train_desc, k)

# form training and testing histograms
train_hist = formHistogram(x_train, kmeans, k)
test_hist = formHistogram(x_test, kmeans, k)

# normalize histograms
scaler = preprocessing.StandardScaler().fit(train_hist)
train_hist = scaler.transform(train_hist)
test_hist = scaler.transform(test_hist)

In [15]:
for c in [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    clf = LinearSVC(random_state=0, C=c)
    clf.fit(train_hist, train_label)
    predict = clf.predict(test_hist)
    print ("C =", c, ",\t Accuracy:", np.mean(predict == test_label)*100, "%")

C = 0.01 ,	 Accuracy: 39.76753839767538 %




C = 0.1 ,	 Accuracy: 44.831880448318806 %




C = 0.2 ,	 Accuracy: 44.707347447073474 %




C = 0.3 ,	 Accuracy: 44.79036944790369 %




C = 0.4 ,	 Accuracy: 44.87339144873391 %




C = 0.5 ,	 Accuracy: 44.74885844748858 %




C = 0.6 ,	 Accuracy: 44.624325446243255 %




C = 0.7 ,	 Accuracy: 44.58281444582815 %




C = 0.8 ,	 Accuracy: 44.3752594437526 %




C = 0.9 ,	 Accuracy: 44.12619344126193 %
C = 1.0 ,	 Accuracy: 44.167704441677046 %




Overall accuracy has improved. Accuracy improves as C changes as before, but the model has performed better with k=60, and larger training and test images sets.

### Try SPM

Now, we attempt to improve performance with Spatial Pyramid Matching.

In [16]:
import math

def extract_denseSIFT(img):
    DSIFT_STEP_SIZE = 2
    sift = cv2.xfeatures2d.SIFT_create()
    disft_step_size = DSIFT_STEP_SIZE
    keypoints = [cv2.KeyPoint(x, y, disft_step_size)
            for y in range(0, img.shape[0], disft_step_size)
                for x in range(0, img.shape[1], disft_step_size)]

    descriptors = sift.compute(img, keypoints)[1]
    
    return descriptors


# form histogram with Spatial Pyramid Matching upto level L with codebook kmeans and k codewords
def getImageFeaturesSPM(L, img, kmeans, k):
    W = img.shape[1]
    H = img.shape[0]   
    h = []
    for l in range(L+1):
        w_step = math.floor(W/(2**l))
        h_step = math.floor(H/(2**l))
        x, y = 0, 0
        for i in range(1,2**l + 1):
            x = 0
            for j in range(1, 2**l + 1):                
                desc = extract_denseSIFT(img[y:y+h_step, x:x+w_step])                

                predict = kmeans.predict(desc)
                histo = np.bincount(predict, minlength=k).reshape(1,-1).ravel()
                weight = 2**(l-L)
                h.append(weight*histo)
                x = x + w_step
            y = y + h_step
            
    hist = np.array(h).ravel()
    
    # normalize hist
    dev = np.std(hist)
    hist -= np.mean(hist)
    hist /= dev
    return hist


# get histogram representation for training/testing data
def getHistogramSPM(L, data, kmeans, k):    
    x = []
    for i in range(len(data)):        
        hist = getImageFeaturesSPM(L, data[i], kmeans, k)        
        x.append(hist)
    return np.array(x)

In [17]:
train_histo = getHistogramSPM(2, train_data, kmeans, k)
test_histo = getHistogramSPM(2, test_data, kmeans, k)

In [19]:
# train SVM
for c in [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    clf = LinearSVC(random_state=0, C=c)
    clf.fit(train_histo, train_label)
    predict = clf.predict(test_histo)
    print("C =", c, ",\t\t Accuracy:", np.mean(predict == test_label)*100, "%")

C = 0.01 ,		 Accuracy: 58.779576587795766 %




C = 0.1 ,		 Accuracy: 56.08136156081361 %




C = 0.2 ,		 Accuracy: 55.50020755500208 %




C = 0.3 ,		 Accuracy: 55.25114155251142 %




C = 0.4 ,		 Accuracy: 55.2096305520963 %




C = 0.5 ,		 Accuracy: 55.168119551681194 %




C = 0.6 ,		 Accuracy: 55.2096305520963 %




C = 0.7 ,		 Accuracy: 55.168119551681194 %




C = 0.8 ,		 Accuracy: 55.33416355334163 %




C = 0.9 ,		 Accuracy: 55.12660855126609 %
C = 1.0 ,		 Accuracy: 55.002075550020756 %




We observe that C could be set as 0.01, offering an accuracy of approximately 58.78%.

### Summary

We have implemented the Bag of Words/Features (BoW/BoF) model to represent the features of images in visual words. These features are extracted from SIFT descriptors built. Each SIFT descriptor is a Nx128 dimensional matrix, with N = no. of key points in each image.

KMeans clustering is performed to cluster features into groups known as codewords. These codewords make up a dictionary (codebook) of K different codewords, each represented by the cluster centroids.
- K random cluster centroids are initialised and SIFT features are assigned to a cluster based on nearest distances based on a distance metric
- Centroids are recomputed until convergence occurs; Centroids represent the mean of all points in the cluster

With the Bag of Features, they can then be represented as a set of histograms. Then we classify them using SVM.

### Improvements via Spatial Pyramid Matching (SPM)

The BoW model encodes all local features into a single code vector, and information on the position of feature descriptors are dropped. Hence, spatial information between words are not preserved.

Spatial Pyramid Matching breaks the image up into different regions and subregions to compute SIFT descriptors in each subregion, then form the histogram of visual words and concatenate them into a single 1D vector.

### Improvements of model performance

Via hyperparameter tuning, we can find the optimal value of the hyperparameters such as C, in the case of SVM, and increase accuracy of the model.

Other models can be applied as well, and performances of various classification techniques such as KNearestNeighbors can be compared, or be ensembled.