#Machine learning for image classification on CIFAR10 dataset
##Problem description
The task of image classification is to, given an image and a set of predefined classes, tell to which class the image belongs. As it is the case with CIFAR10, we are given a labeled training set and strive to achieve the best performance on a testing set, but we are only allowed to use the former to tune our algorithm. The score is computed as a percentage of correctly labeled test examples (their true classification is available for grading). This problem has immense practical applicability and is difficult to grasp from the algorithmic point of view. On the other hand, using machine learning approach has proved to be a resounding success, with state-of-the-art solutions surpassing human performance in some cases ([recent results](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#43494641522d3130), [human accuracy quoted as 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/)).

##CIFAR10 dataset
The dataset consist of 60000 images belonging to 10 classes, evenly distributed. They are split to training and test sets, with sizes of respectively 50000 and 10000. The images themselves are 32x32 pixels with 3 RGB color channels.

![sample CIFAR10 images](http://i.imgur.com/al4cmJF.png)

##My results
Using the Convolutional Neural Network described below and some basic data augmentation I've been able to achieve **87.37%** accuracy on the test set. The network was trained for approx. 6 hours on a GPU.

* The source code is available here: https://github.com/piotder/nn_project
* The test results listing: https://github.com/piotder/nn_project/blob/master/results-8-23-28.txt

##The solution overview
###Convolutional Neural Networks (CNNs)
CNN is a type of a neural network containing *convolutional layers*. A neuron in such a layer is only connected to the spatially coherent area in the previous layer, as opposed to the full connectivity of dense layers (also called fully-connected layers). A convolutional layer is a collection of *feature maps*, each being a result of applying a certain filter to the previous layer. This means that a feature map is characterised by the set of weights describing the filter, which is usually small. This is advantageous both in terms of a network's memory footprint and its resilience to overfitting. 

CNNs are especially useful for image classification, because they utilise a natural property of images: that patterns are local and translation invariant.

One downside of CNNs is that they are more computationally demanding in comparison to dense layers of same size ([link](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.137.482&rep=rep1&type=pdf)). But their computation can be greatly sped up by use of a GPU, which allows for efficient training of sizeable networks.

###Types of layers used

Layer type  | Description | Parameters | Remarks
------------|-------------|------------|--------
Convolution | (described above) | (*number of feature maps*) x (*filter width*) x (*filter height*) | Other parameters could include padding and stride, but in my solutions these two always had the same setting (padding = (filter size - 1) / 2; stride = 1), so I'm omitting them.
Dense | Simply applies an affine transformation on outputs from all neurons in the previous layer | (*number of neurons*) | 
Leaky rectify | An activation layer, very similar to ReLU, but also has a non-zero gradient on negative input. [link](https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29) | (*leakyness* = 0.01) |
Softmax | Classification layer - transforms input scores into probabilities of classes | |
Pool | Used to downsample a feature map. This is performed in conjunction with convolution layers. This layer looks at a couple of local results in a feature map and selects the maximum. Intuitively it finds the most dominant feature in a given area. | (*pool width*) x (*pool height*) x (*stride*) | Instead of the maximum, other functions can be used, for example the average.
Dropout | [Dropout](http://cs231n.github.io/neural-networks-2/%23reg) is a simple and effective regularization technique. It emulates the training of the large number of identical models as an ensemble, thus averaging their individual biases. | (*probability = 0.5*) | 

###Network architecture
* **Convolution: 128 maps, size=3x3** --> Leaky rectify
* **Convolution: 128 maps, size=3x3** --> Leaky rectify
* **Pool: size=2x2, stride=2**
* **Convolution: 256 maps, size=3x3** --> Leaky rectify
* **Convolution: 256 maps, size=3x3** --> Leaky rectify
* **Pool: size=2x2, stride=2**
* **Convolution: 512 maps, size=3x3** --> Leaky rectify
* **Convolution: 512 maps, size=3x3** --> Leaky rectify 
* **Pool: size=2x2, stride=2** --> Dropout: p=0.5
* **Dense: size=2048** --> Leaky rectify --> Dropout: p=0.5
* **Dense: size=2048** --> Leaky rectify --> Dropout: p=0.5
* **Dense: size=10** --> Softmax

##Rationale behind some of the design choices
**DISCLAIMER:** *I didn't have time nor resources to perform a systematic parameter selection. Many of my decisions were based on a performance of a single run on the small subset of data. Take this with a grain of salt.*

In general my approach was following the design of the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) as presented in [here](http://cs231n.github.io/convolutional-networks/%23case). One of the core features of this architecture is a stack of the convolution groups (consisting of multiple conv layers followed by a pool layer). Intuitively it allows the network to recognize patterns of increasing size and complexity, starting with very small ones (3x3). Then pooling is performed to decrease the layer size, hopefully keeping the useful information and discarding the noise, which in turn allows the next layer to look for more complex patterns. 

Some important choices are to made here:
* The number of Conv-Pool groups. Each pool layer reduces the size of subsequent layers by 2, so there is a bound on the number of these. Also performing a convolution with a filter roughly the size of the input is not much different than using a fully-connected layer. On the other hand a network too shallow might miss out on some bigger patterns. In my tests I found that 3 groups worked better than 2.
* Convolution filter sizes. The authors of the [original network](http://arxiv.org/pdf/1409.1556.pdf) recommend stacking smaller filters as opposed to using a single, bigger one. Also, using bigger filters might lead to the topmost convolutional layers having a receptive field bigger than the image, which intuitively doesn't offer any advantages (there is a discussion of this [here](https://kaggle2.blob.core.windows.net/forum-message-attachments/69182/2287/A%20practical%20theory%20for%20designing%20very%20deep%20convolutional%20neural%20networks.pdf?sv=2012-02-12&se=2016-02-23T18%3A07%3A34Z&sr=b&sp=r&sig=3Ke128OwAXSzVQPXR9EQhonRrKT%2FeTgJj%2FAbGS5w4zI%3D)). I tested 5x5 layers during a couple of runs and got no improvement and a much longer computation time.
* Number of feature maps. In this regard I noticed a clear relation: more = better. Of course, it comes with an increased computational burden and I would expect the benefits to be progressively smaller. As in the original work, each subsequent layer is twice the depth of the previous one. This distributes the learning load more evenly (early layers are bigger, thus convolutions are more costly). The intuition I have about why it makes sense to assign more depth to upper layers is that the number of patterns grows exponentialy with a receptive field. 
* Pool layers. I didn't really experiment with those. [Here](http://cs231n.github.io/convolutional-networks/%23pool) it is claimed that there only 2 settings used in practice.

Things to consider in regard to other layers:
* Dense layers. I didn't spend much time tuning them. My general feeling was that two layers worked better than one, bigger sizes led to faster training (and stalling), 4 times the depth of the last convolution layer gave better results than 2 times, but larger layers caused overfitting.
* Leaky rectify. ReLU is viewed as a better activation function than sigmoid for deep networks [link](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf). I didn't run a comparison between ReLU and Leaky ReLU for my network, but I don't expect it would make a difference.
* Dropout. A couple of tests suggested that it definitely helps. Although I only tried removing all dropout at once. It might actually be beneficial to skip dropout on the last convolution layer. The intuition is that the convolutions perform pattern recognition, while dense layers do the actual classifying. Thus it might be unnecessary to deprive these layers of some inputs. 

Another important question is the proper weight initialization in layers that require it:
* Convolution layers. I followed the advice of [this paper](http://arxiv.org/pdf/1502.01852v1.pdf), adjusting it to taking into account the leakyness of the rectifiers used, as suggested by [Lasagne documentation](http://lasagne.readthedocs.org/en/latest/modules/init.html%23lasagne.init.He).
* Dense layers. I tried the classic recommendation of [Glorot and Bengio](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf), but got better results with [orthogonal initialization](http://lasagne.readthedocs.org/en/latest/modules/init.html%23lasagne.init.Orthogonal).

##Learning algorithm
| | |
|-|-|
|**Algorithm**|Stochastic Gradient Descent|
|**Loss function**|Categorical cross entropy with regularization|
|**Regularization**|L2 norm with a factor of *1e-4*|
|**Training batch size**|*100*|
|**Number of epochs**|Determined with a patience expansion mechanism, at least *10*.|
|**Result**|Trained network uses the parameters from the epoch that achieved the best accuracy on the validation set.|
|**Learning rate**|Annealed after every epoch. Normally it decreases by a small value dependent on the number of examples seen by the network. When it doesn't beat its best result in *5* consecutive epochs, the learning rate is halved.|
|**Parameter updates**|The algorithm uses Nesterov momentum with a coefficient *0.9*.|

###Comments
I tried regularization factors from the range [1e-4, 1e-5, 1e-6] but didn't register a significant difference. I also considered many adaptations of the learning rate. My annealing and early stopping strategies may be a little too aggressive and impatient, but the speed of convergence was also very important. An improvement that I considered implementing was a momentum schedule. Some sources suggest that starting with a low value and increasing it with time is beneficial. I fiddled with RMSProp for a while, but dropped the idea after discouraging early results.

##Data augmentation
One transformation that my solution applies to the images is affinely scaling the [0..255] RGB pixel values to the [-1..1] range. Since I was using it from the get-go I didn't measure its impact. 
Beyond transforming the data itself, I extendend the dataset with the horizontal flips of the training images. This yielded a great improvement and also promises an even better result if more such operations were performed. Other ideas for enriching the dataset include translations and rotations of existing images.

##Frameworks used
* Fuel: accessing, iterating and transforming the dataset
* Lasagne + Theano: constructing and training the network

My solution is also heavily based on [this Lasagne example](https://github.com/Lasagne/Lasagne%23example).