# Highlights from ICML 2020
> Transformers :robot:, time series :chart_with_upwards_trend:, and a little bit of physics :apple:.

- toc: true 
- badges: false
- comments: false
- categories: [research,conference]
- image: images/icml.png

This year I had the opportunity to attend the [International Conference on Machine Learning](https://icml.cc/) (ICML) and decided to summarise some of the talks and papers I found especially interesting.

**tl;dr:** Transformer-based architectures are rapidly outgrowing their natural language roots and are finding new applications in computer vision and reinforcement learning; encoding sparse time series as sets beats RNNs and transformers; neural networks are slowly conquering physical systems, through either Lie groups or high-fidelity simulations.

## Transformers

### [Generative Pretraining from Pixels](https://proceedings.icml.cc/static/paper_files/icml/2020/6022-Paper.pdf)

![](my_icons/igpt.png)

_Predicting the next pixel with a GPT-2 scale model yields high quality representations. The best representations lie in the middle of the network._

This paper shows that with enough compute, it is possible to adapt transformer architectures to images and achieve strong results in self-supervised learning benchmarks. To make this idea work, the authors use a three-step process:

1. Downsize the images, cluster the RGB pixel values to create a 9-bit colour map, and reshape to 1D.{% fn 1 %}
2. Pre-train on either an autoregressive next pixel or masked pixel prediction task.
3. Evaluate the quality of the learned representations on downstream tasks.

One surprising result of the linear probe{% fn 2 %} experiments is that representation quality tends to be highest in the _middle_ of the network.

I think this work provides a compelling example of Sutton's ["bitter lesson"](http://incompleteideas.net/IncIdeas/BitterLesson.html)

> Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

but takes it one step further by discarding knowledge of the 2D structure in images entirely! 

Although the iGPT models are 2 - 30 times larger than ResNet-152, I expect it is only a matter of time before people find ways to make this approach more efficient. In the meantime, it's nice to see that the pre-trained models have been [open-sourced](https://github.com/openai/image-gpt) and a [port](https://github.com/huggingface/transformers/issues/5088) to HuggingFace's transformers is already underway.

### [Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://linear-transformers.com/)

![](my_icons/transformers-are-rnns.png)

_A clever choice of kernel reduces the computational complexity of attention from $O(N^2)$ to $O(N)$. Generate images 4000x faster than vanilla transformers :fire:._

### [XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation](https://proceedings.icml.cc/static/paper_files/icml/2020/4220-Paper.pdf)

![](my_icons/xtreme.png "Image credit: https://ai.googleblog.com/2020/04/xtreme-massively-multilingual-multi.html")

_A new benchmark to test zero-shot cross-lingual transfer from English to 39 typologically diverse languages._

In this paper, the authors introduce the [XTREME benchmark](https://sites.research.google/xtreme) to evaluate the ability of multilingual representations to generalise across 40 languages and 9 tasks. To evaluate a model in XTREME, the main idea is to follow a three-stage recipe:

1. Pre-train on a large corpus of multilingual text.
2. Fine-tune on English data for each task.
3. Evaluate the model on _zero-shot transfer_ performance, e.g. evaluate the accuracy on a German text classification task.

English is chosen for fine-tuning because it's the langauge with the most labelled data, and the authors employ a neat trick using Google Translate to generate proxy test sets for the tasks where a pre-existing translation does not exist. 

Although not strictly about Transformers, the baseline models for this benchmark are all variants of the Transformer architecture, and the authors find that [XLM-R](https://arxiv.org/abs/1911.02116) achieves the best zero-shot transfer performance across all languages in each task. What I especially like about XTREME is that the tasks are designed to be trainable on a single GPU for less than a day. This should make it possible for research labs with tight budgets to create competitive models, where the gains in performance are likely to come from architectural design rather than simply scaling-up the compute.

I'm excited about this benchmark because I expect it will produce models that have a direct impact on my professional work in Switzerland. With [four national languages](https://en.wikipedia.org/wiki/Languages_of_Switzerland) and a smattering of English, building natural language applications that serve the whole population is challenging.

## Time series

### Set Functions for Time Series

![](my_icons/seft.png)

### Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure

![](my_icons/anomaly-detection.png)

## Physics

### Learning to Simulate Complex Physics with Graph Networks
> youtube: https://youtu.be/h7h9zF8OO7E

### [Lorentz Group Equivariant Neural Network for Particle Physics](https://proceedings.icml.cc/static/paper_files/icml/2020/561-Paper.pdf)

## Machine learning

### AutoML-Zero: Evolving Machine Learning Algorithms From Scratch

![](my_icons/automl-zero.png)

### The Tree Ensemble Layer: Differentiability meets Conditional Computation

![](my_icons/tel.png)

{{ "Downscaling is needed because naively training on a $224^2 \times 3$ sequence length would blow up the memory of the largest TPU!" | fndetail: 1 }}

{{ "A linear probe refers to using the model as a feature extractor and passing those features through a linear model like logistic regression." | fndetail: 2 }}