# Highlights from ICML 2020
> Transformers :robot:, time series :chart_with_upwards_trend:, and a little bit of physics :apple:.

- toc: true 
- badges: false
- comments: false
- categories: [research,conference]
- image: images/icml.png

This year I had the opportunity to attend the [International Conference on Machine Learning](https://icml.cc/) (ICML) and decided to highlight some of the talks I found especially interesting. Although the conference was hosted entirely online, this provided two key benefits over attending in person:

* **Clash resolution:** with [1,088 papers accepted](https://syncedreview.com/2020/06/01/icml-2020-announces-accepted-papers/#:~:text=Conference%20Industry-,ICML%202020%20Announces%20Accepted%20Papers,the%20prestigious%20machine%20learning%20conference.), it is inevitable that multiple talks of interest would clash in the timetable. Watching the pre-recorded presentations on my own time provided a simple solution, not to mention the ability to quickly switch to a new talk if desired.
* **Better Q&A sessions:** at large conferences it is not easy to get your questions answered directly after a talk, usually because the whole session is running overtime and the moderator wants to move onto the next speaker. By having two (!) dedicated Q&A sessions for each talk, I found the discussions to be extremely insightful and much more personalised.

Highlights below!


## Transformers

### [Generative Pretraining from Pixels](https://icml.cc/virtual/2020/poster/6739)

![](my_icons/igpt.png)

_Predicting the next pixel with a GPT-2 scale model yields high quality representations. The best representations lie in the middle of the network._

This talk shows that with enough compute, it is possible to adapt transformer architectures to images and achieve strong results in self-supervised learning benchmarks. Dubbed iGPT, this approach relies on a three-step process:

1. Downsize the images, cluster the RGB pixel values to create a 9-bit colour map, and reshape to 1D.{% fn 1 %}
2. Pre-train on either an autoregressive next pixel or masked pixel prediction task.
3. Evaluate the quality of the learned representations on downstream tasks.

One surprising result of the linear probe{% fn 2 %} experiments is that representation quality tends to be highest in the _middle_ of the network.

I think this work provides a compelling example of Sutton's ["bitter lesson"](http://incompleteideas.net/IncIdeas/BitterLesson.html)

> Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

but takes it one step further by discarding knowledge of the 2D structure in images entirely! 

Although the iGPT models are 2-30 times larger than ResNet-152, I expect it is only a matter of time before people find ways to make this approach more efficient. In the meantime, it's nice to see that the pre-trained models have been [open-sourced](https://github.com/openai/image-gpt) and a [port](https://github.com/huggingface/transformers/issues/5088) to HuggingFace's transformers library is already underway.

### [Retrieval Augmented Language Model Pre-Training](https://icml.cc/virtual/2020/poster/6294)

![](my_icons/realm.png)

_Augmenting language models with knowledge retrieval sets a new benchmark for open-domain question answering._

I liked this talk a lot because it takes a non-trivial step towards integrating world knowledge into language models and addresses Gary Marcus' [common complaint](https://thegradient.pub/gpt2-and-the-nature-of-intelligence/) that data and compute aren't enough to produce Real Intelligence&trade;.

To integrate knowledge into language model pretraining, this talk proposes adding a text retriever that is _learned_ during the training process. Unsurprisingly, this introduces a major computational challenge because the conditional probability now involves a sum over _all_ documents in a corpus $\mathcal{Z}$:

$$ p(y|x) = \sum_{z\in \mathcal{Z}} p(y|x,z)p(z)\,.$$

To deal with this, the authors compute an embedding for every document in the corpus and then use [Maximum Inner Product Search](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) algorithms to find the approximate top $k$ documents.

### [Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://icml.cc/virtual/2020/poster/6257)

![](my_icons/transformers-are-rnns.png)

_A clever choice of kernel reduces the computational complexity of attention from $O(N^2)$ to $O(N)$. Generate images 4000x faster than vanilla transformers :fire:._

It's refreshing to see a transformer talk that isn't about using a "bonfire worth of GPU-TPU-neuromorphic wafer scale silicon"{% fn 4 %} to break NLP benchmarks. This talk observes that the main bottleneck in vanilla transformer models is the softmax attention computation

$$ V' = \mathrm{softmax} \left(\frac{QK^T}{\sqrt{D}} \right) V $$

whose time and space complexity is $O(N^2)$ for sequence length $N$. To get around this, the authors first use a similarity function to obtain a _generalised_ form of self-attention

$$ V_i' = \frac{\sum_j \mathrm{sim}(Q_i, K_j)V_j}{\sum_j \mathrm{sim}(Q_i, K_j)} $$

which can be simplified via a choice of kernel and matrix associativity:

$$V_i' = \frac{\phi(Q_i)^T\sum_j\phi(K_j)V_j^T}{\phi(Q_i)^T\sum_j\phi(K_j)}\,. $$

The result is a self-attention step that is $O(N)$ because the sums in the above expression can be computed once and reused for every query. In practice, this turns out to be especially powerful for inference, with speed-ups of 4000x reported in the talk! The authors go on to show that their formulation can also be used to express transformers as RNNs, which might be an interesting way to explore the [shortcomings](https://mostafadehghani.com/2019/05/05/universal-transformers/) of these large langauge models.

### [XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation](https://proceedings.icml.cc/static/paper_files/icml/2020/4220-Paper.pdf)

![](my_icons/xtreme.png "Image credit: https://ai.googleblog.com/2020/04/xtreme-massively-multilingual-multi.html")

_A new benchmark to test zero-shot cross-lingual transfer from English to 39 typologically diverse languages._

In this talk, the authors introduce the [XTREME benchmark](https://sites.research.google/xtreme) to evaluate the ability of multilingual representations to generalise across 40 languages and 9 tasks. To evaluate a model in XTREME, the main idea is to follow a three-stage recipe:

1. Pre-train on a large corpus of multilingual text.
2. Fine-tune on English data for each task.
3. Evaluate the model on _zero-shot transfer_ performance, e.g. evaluate the accuracy on a German text classification task.

English is chosen for fine-tuning because it's the langauge with the most labelled data, and the authors employ a neat trick using Google Translate to generate proxy test sets for the tasks where a pre-existing translation does not exist. 

Although not strictly about Transformers, the baseline models for this benchmark are all variants of the Transformer architecture, and the authors find that [XLM-R](https://arxiv.org/abs/1911.02116) achieves the best zero-shot transfer performance across all languages in each task. What I especially like about XTREME is that the tasks are designed to be trainable on a single GPU for less than a day. This should make it possible for research labs with tight budgets to create competitive models, where the gains in performance are likely to come from architectural design rather than simply scaling-up the compute.

I'm excited about this benchmark because I expect it will produce models that have a direct impact on my professional work in Switzerland. With [four national languages](https://en.wikipedia.org/wiki/Languages_of_Switzerland) and a smattering of English, building natural language applications that serve the whole population is a constant challenge.

* Mention link to Gary Marcus and hybrid systems.

## Time series

### [Set Functions for Time Series](https://icml.cc/virtual/2020/poster/6545)

![](my_icons/seft.png)

_High-performance classification for irregularly sampled time series_

Main idea is to learn classificaiton models without imputation / interpolation. Introduce SeFT.

* New approach
* Competitive performance with low runtime
* Per observation contributions

Key ideas: each observation is tuple $(t_i, z_i, m_i)$

Use DeepSets. Find leakage on IP-NETs and Transformers on benchamrk

### [Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure](https://icml.cc/virtual/2020/poster/6171)

![](my_icons/anomaly-detection.png)

_A new unsupervised anomaly detection for IoT devices._

Fixed rules of supervised approaches no good. Goal $p(x \in \mathrm{normal}) \approx 0$. Uses negative samplign methods. Positive region = observed (most a re normal). Negative region from complement (anomalos). Use integrated gradients to interpret anomaly. Open sourced with madi. 

## Physics

### [Learning to Simulate Complex Physics with Graph Networks](https://icml.cc/virtual/2020/poster/6849)
> youtube: https://youtu.be/h7h9zF8OO7E

_A single architecture creates high-fidelity particle simulations of various interacting materials._

I'm a sucker for flashy demos and this talk from DeepMind didn't disappoint. They propose an "encode-process-decode" architecture to calculate the dynamics of physical systems, where particle states are represented as graphs and a graph neural network learns the particle interactions.

![](my_icons/gns.png)

During training, the model predicts each particle's position and velocity one timestep into the future, and these predictions are compared against the ground-truth values of a simulator. Remarkably, this approach generalises to _thousands of timesteps_ at test time, even under different initial conditions and an order of magnitude more particles!{% fn 3 %}

I think this work is a great example of how machine learning can help physicists build better simulations of complex phenomena. It will be interesting to see whether this approach can scale to systems with _billions_ of particles, like those found in [dark matter simulations](https://wwwmpa.mpa-garching.mpg.de/galform/virgo/millennium/) or [high-energy collisions](https://www.youtube.com/watch?v=NhXMXiXOWAA) at the Large Hadron Collider.

{{ "Downscaling is needed because naively training on a $224^2 \times 3$ sequence length would blow up the memory of the largest TPU!" | fndetail: 1 }}

{{ "A _linear probe_ refers to using the model as a feature extractor and passing those features through a linear model like logistic regression." | fndetail: 2 }}

{{ "The authors ascribe this generalisation power to the fact that each particle is only aware of local interactions in some &#39;connectivity radius&#39;, so the model is flexible enough to generalise to out-of-distribution inputs." | fndetail: 3 }}

{{ "Quote from Stephen Merity's brilliant _[Single Headed Attention RNN: Stop Thinking With Your Head](https://arxiv.org/abs/1911.11423)_." | fndetail: 4 }}