# Data
exclude label "horses" and "mountains" for the numbers are small, following the publisher suggestion [MSRCv2]

## Preprocess
* Resize to 256x256 (due to model requirements), following [AlexNet] procedure:
    * resize square images to 256x256. If an image is rectangular, let the shorter edge be 256px and **crop** the
    central patch of the image
        * *Cropping might eliminate valuable information*
    * substract pixel intensities on each R, G, B channel by their respective channel's mean intensity
        * *should we scale to [-1,1] interval?*

* Store data in Python pickle format. It make data transfer to GPU much faster [[?]](#Ref).

Resized raw datasize: 256 x 256 x 3 x 591 $\approx$ 110MB. So we have ~500 examples of 65536 dimensions, which is a clear indication of possible over-fitting with deep nets.


## Augmentation
When training, we also perform data augmentation to avoid over-fitting. There are 2 forms of augmentation:
* label-preserving transformation [e.g. AlexNet, 25, 4, 5](#Ref)
    * translation: randomly generate patches of 224x224 on original images
    * horizontal flip: to both original and generated images
* altering intensities of the RGB channels [AlexNet](#Ref)
    * *"This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination."* [AlexNet](#Ref)
    
Prospective augmented datasize: order of $\#$ random patches x $\#$ splitting ratio $\approx$ 2000 x 0.7 i.e. ~830,000 images. Number of random patches taken from [AlexNet](#Ref) as a rough guideline. I might consider generating only 200 random patches, due to hardware limitation. 


# Assessment Measures
* Bipartion measure: on classification results (recall, precision, F1)
    * Micro-averaged
    * Macro-averaged
* Ranking measure [[1]](#Ref): on ranked list of labels. One of either
    * Rank loss: Section 3.1 of [1]
    * One-Error: evaluates whether the top most ranked label with the highest score is is a positive label or not
    * Coverage: measures on average how far one needs to go down the ranked list of labels to achieve recall of 100%
    * Average Precision or AP measures the average fraction of labels preceding relevant labels in the ranked list of labels

# Modelling
**Output (y) and target (t) format**
* t = [0 0 1 1 ... ] in $Z^{k}$
* y $\in Z^{k}$ where elements are also binary predictors, thresholded on softmax probabilities of the k classes [1]

**Cost function** 

Cross-entropy: CE is shown to be better in textual data and is faster to learn [[1]](#Ref). It's worth to note that similar statement is not found for image data yet, and **rank-loss** is also a common choice of cost function.

<img src="http://ufldl.stanford.edu/wiki/images/math/7/6/3/7634eb3b08dc003aa4591a95824d4fbd.png">[[2]](http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression#Cost_Function)
* k: # classes
* m: # examples

Also need ***L2-regulariser***
## Model Sketch 
* Architecture: as a starting point, we will consider architecture of existing models which worked on datasets of similar size (CIFAR?)
    * Baseline model: 1 input, 2 hidden, 1 output (softmax) layer
        * 800 neurons / hidden layer
        * 0 or 20% drop-out at input layer, 50% drop-out on the hidden layers
        * implemented with `nolearn` and `Lagsane`: out-of-the-box, not optimised for performance
    * ConvNet model: 1 input, (?) Conv, (?) pooling, 1 output (softmax) layer
        * allows deeper architecture with efficient learning time
        * number of neurons and layers: **to be experimented**
        * (most likely) 20% drop-out at input layer, 50% drop-out on the hidden layers
        * implemented with `pylearn2`: optimised for performance
* Threshold label predictor on softmax layer [[1]](#Ref)
* ReLU units (neurons) on all layers, can be learned faster than sigmoid or tanh non-linearity [[?]](#Ref)
* AdaGrad method for adaptive learning rate [[?]](#Ref)

Dropout [[?]](#Ref) + data augmentation for regularisation as mentioned above.

# Fine tuning 
 fine-tuning hyper-parameters and/or network architecture ***(ARGHH!!)***

# Results
compare to state-of-the-art performance?

# Extension
* consider rescale images to 256x256 regardless of aspect ratio. This assumes that object aspect ratios have litte effect to labeling.
* experiment on other datasets (MIML, Corel 5k/10k)

# Ref
[1] Nam2014

[AlexNet] Convolutional Neural Net model that broke state-of-the-art records, with significant performance, on ImageNet's image recognition (1 label per image) competition in 2012. 

[MSRCv2] MSRCv2 dataset

[?] refs to be added later