ImageNet Classification with Deep Convolutional Neural Networks

Introduction

The network was trained on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions. Achieved top results.
The network has 60 million parameters and 650000 neurons.
Consists of 5 convolution layers, some of which are follwed by max-pooling layers and 3 fully connected layers with a final 1000-way softmax.
"Dropout" was used to prevent overfitting.
Anything less than 5 convolution layers and 3 fully connected layers gave inferior performance and higher depths were limited by cost only.

ImageNet had multiple resolution images, images were down-sampled to 256 by 256.

ReLU non-linearity was used in place of tanh, increased speed by six times for the same accuracy. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.
The architecture
Trained on multiple GPU's, not so important.
Local Response Normalization, not used anymore.
Overlapping Pooling, check this and this for more info.
The image size in the architecture chart should be (227 by 227) instead of (224 by 224), as was pointed out by Andrej Karpathy in his famous CS231n course.

Dropout is used in the first two fully-connected layers with probability 0.5. Without dropout, the network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.

Models were trained using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005.
Update Rule:
An equal learning rate was used for all layers, which was adjusted manually throughout training. The heuristic was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination.