# Model specification and approach

Now that I have a feeling for the dataset and it's structure. Let's recap the problem and consider what options we have in terms of model choice. Here I brainstorm a few possibilities.

### Problem recap

We want to be able identify the brand of logos present in a test image. These brands should be car and/or clothing brands available in the BelgaLogos dataset. There may be more than one brand present in a test image and we want to be able to identify all of them (as best as we can).

The problem can be separated into two different responsibilities; the identification of logo-locations in an image, and the subsequent classification of the logo. Separating the process into two procedures allows us to specialise each stage to it's task. Depending on the model, a 'one-shot' classifier combining the two tasks is also possible.

### Identification of logo locations

The simplest (and as far as I can tell most commonly used) approach to finding potential object locations in an image is the **sliding-window** method. In this method, a smaller patch or window of the test image is evaluated for the presence of the desired object. This window is scanned accross the whole test image ([example gif](https://www.pyimagesearch.com/wp-content/uploads/2015/03/sliding-window-animated-adrian.gif)) until the entire image is covered. If the object-classification stage returns a match for a window position, that window is then considered a potential object location.

The sliding-window is a rather brute-force method for finding logo locations. However it means that the object-classification stage only needs to be trained on a small window of an image. This considerably simplifies the training of the classification model at the expense of making evaluations of test images slow (the classifier must be scanned many times over a test image). 

There are [many methods available](https://github.com/caocuong0306/awesome-object-proposals) for the more efficient identification of candidate object locations. However as my time and available computational power are limited, if using a model that requires a separate location ID step, I will use the sliding-window method to minimise training time.

### Classification of logos

Here I discuss the results of a brief (far from exhaustive) survey into different options for classifying logos in an image.

###### Template matching (sliding window)

Template matching is a relatively simple method for finding objects in an image. It tests (for every sliding-window position) for a match between a template image (for example, a high quality 'official' version of the logo) and a location in the test image. The match can be judged by a number of local-image-simmilarity measures. 

While the method is simple to implement, it has to be treated with care to acheive matches for images with different scales to the template. Furthermore rotations of the template, noise, and poor focus in the test image can very rapidly spoil the match. Template matching appears to be most commonly used when a close-to-exact match can be assumed, which is not the case in the natural-image environment of the BelgaLogos dataset.

###### Keypoint matching (one-shot/sliding window)

There are a few papers available discussing keypoint matching in the BelgaLogos dataset.
 - [Chu, Lin 2012](https://www.cs.ccu.edu.tw/~wtchu/papers/2012ICASSP2-chu2.pdf) 
 - [Roy, Garain 2012](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.9134&rep=rep1&type=pdf)
 
In these methods, *key-points* which describe the geometry of a logo are determined (via SIFT/SURF/ORB algorithm) from a training set. These key-points are invariant under affine transformations (e.g scaling, rotation, shearing) and therefore can be used to match equivalent key-points determined from a test image. This could be applied either as a one-shot algorithm or within a sliding window.

###### Haar classifiers (sliding window)

A Haar classifier works by building a set of hand-constructed convolutional kernels ([Haar features](https://en.wikipedia.org/wiki/Haar-like_feature)) that reasonably describe common features of the target object. Each kernel is applied to the test image, and a pass/fail threshold is applied depending on the output of the kernel-image product. The required thresholds/weights are found by training the classifier on a provided dataset. The final classification probability is given as a weighted sum of the results of every individual feature classifier. Before deep convolutional networks, these kinds of classifiers often had the best performance in image recongition problems. However, they do need some care in the selection of the Haar feature basis.

###### Convolutional neural networks (one-shot/sliding window)

(Deep) convolutional neural networks are a very general tool for image classification. A basic ConvNet classifier is formed of two stages, a convolutional stage and a classification stage. The convolutional stage applies a set of convolution kernels to the input image, either for the purposes of feature extraction or dimensional reduction. In this stage various features of the input data can be exaggerated or minimised by selection of different kernels. The classification stage progresses as a normal neural network classifier, taking the output of the convolution layers as input.

Variants of these classifiers are the current state-of-the-art in object recognition. They are able to achieve excellent generalisation, at the cost of an exceptionally high parameter space, and consequently an enormous thirst for data.

## Summary

There are many possible approaches to the problem. Two jump out at me as being the most promising:

1. Convolutional neural networks
    - Pros: Best in class performance, robust.
    - Cons: Require a very large dataset, long training times.


2. Keypoint matching
    - Pros: Fast to setup, doesn't require as large a dataset as the CNN.
    - Cons: Can't handle non-affine transformations (e.g logos on warped t-shirts), involves a potentially error-prone matching step, is maybe quite noise sensitive.

In terms of training a convolutional network, the small number of images available in the BelgaLogos dataset appears to be a serious problem. In the literature this is typically handled by generating new 'simulated' training data samples. For example

- [Su, Zhu, Gong 2016](https://arxiv.org/pdf/1612.09322.pdf) take a 'canonical' form of the logo, apply random affine transformations to it, and superimpose the transformed logo on a random image. This is then used to train a deep ConvNet.
- The [FlickrBelgaLogos](http://www-sop.inria.fr/members/Alexis.Joly/BelgaLogos/FlickrBelgaLogos.html) takes logos from the original BelgaLogo dataset (the images within annotated bounding-boxes) and applies them ontop of random images from Flickr. This procedure for generating more test samples has also been used in the literature. 

Both of these methods have been used to good effect in the literature, but they do add considerable complication to the problem. In order to maximise the likelihood of getting an acceptable solution in the provided timeframe, I'll begin by running a small feasibility analysis for keypoint matching.