# Backbone networks

## Datasets

### ImageNet

The ImageNet (see [the Web site](https://www.image-net.org/)) dataset was the first large enough dataset to train large deep learning networks. Before ImageNet there were other similar datasets that inspired the collection of ImageNet:

 * Caltech-101 (101 categories, 10k images, by CalTech Uni)
 * Pascal VOC (20 categories, 10k images, by EU project led by Oxford)

ImageNet (by Standford Uni) included 1M images of 1k categories annotated according to the [WordNet](https://en.wikipedia.org/wiki/WordNet) nouns.


## AlexNet

The main driving force that boosted research on deep learning was the AlexNet architecture published in 2012 NeurIPS paper. Many still valid ideas were presented in that paper, and this kind of networks trained for semantic classification tasks and multipurpose tools for computer vision and audio analysis. AlexNet won the ILSVRC 2012 competition and their amazing and worldbreaking results are still available in the ImageNet server:

 * https://www.image-net.org/challenges/LSVRC/2012/results.html

See also the original paper:

 * Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012): "ImageNet Classification with Deep Convolutional Neural Networks". In Proc. of the NeurIPS. URL: https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

<div>
<img src="pictures/alexnet.png" width="800"/>
</div>

<div>
<img src="pictures/alexnet_parameters_table.png" width="400"/>
</div>

## VGGNet

In their paper the Oxford group investigated various strategies to build a better backbone network for image classification and proposed VGGNet which is many ways simplified version of AlexNet.

 * Karen Simonyan, Andrew Zisserman (2015): "Very Deep Convolutional Networks for Large-Scale Image Recognition". In Proc. of the ICLR. URL: https://arxiv.org/abs/1409.1556

<div>
<img src="pictures/vggnet_table.png" width="600"/>
</div>

## ResNet

 * Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016): "Deep Residual Learning for Image Recognition". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [PDF](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)

<div>
    <img src="pictures/resnet_34_layers.png" height=400/>
</div>


## Other backbones

The success of these three backbones boosted research on new backbones with distinct features

 * MobileNet
 * SqueezeNet

and also other modalities such as audio backbones

 * OpenL3
 * PANNs
 * PaSST

or specifically for speech

 * wac2vec 2.0

Moreover, new dataset has been collected and annotated:

 * MS COCO by Microsoft (similar to ImageNet)
 * DCASE by Tampere University (for audio recognition)
 * AudioSet by Google (similar to DCASE)

There exist many more for audio and video, and also other new large scale datasets. What is the meaning of filters in the deep hierarchy:

See:
 * YouTube: [Jurney on the Deep Dreams](https://www.youtube.com/watch?v=SCE-QeDfXtA)


## Using backbones

You can use them directly or as the feature extraction body for your own application.

**Example:** AlexNet image classification

 * [Colab notebook](https://colab.research.google.com/drive/1PXpSA9qy9Kr-4Jh-Pe44FUDaHXr4MtJO?usp=sharing)


## References

See the original articles of the backbone networks to learn more.