ICLR 2020

_{^{Lalibela, Ethiopia by Trevor Cole on Unsplash.}}

Selections from ICLR 2020 (mostly NLP-related)

See also: https://www.theexclusive.org/2020/05/virtual-iclr.html

Workshops
Socials
Keynotes
Papers

Workshops

Causal Learning For Decision Making (videos)
Towards Trustworthy ML: Rethinking Security and Privacy for ML (videos) "Bring together experts from a variety of communities (ML, computer security, data privacy, fairness, & ethics) to work on promising ideas and research directions."
Bridging AI and Cognitive Science (BAICS) (videos)
Fundamental Science in the era of AI
Neural Architecture Search (videos)
Integration of Deep Neural Models and Differential Equations (videos)
Beyond 'tabula rasa' in reinforcement learning: agents that remember, adapt, and generalize (videos)
AI for Overcoming Global Disparities in Cancer Care (videos)
AI for Affordable Healthcare (videos)
AI for Earth Sciences (videos)
Computer Vision for Agriculture (CV4A) (videos)
Tackling Climate Change with ML (videos)
ML-IRL: Machine Learning in Real Life
AfricaNLP - Unlocking Local Languages (videos)
Practical ML for Developing Countries: learning under limited/low resource scenarios (videos)

Some workshop talks

Workshops, see later

Cyclic-Permutation Invariant Networks for Modeling Periodic Time Series

Socials

Lots of "social" events, both topic and demographic based:

Topics in Language Research
Learning Representation for Cybersecurity
Research with 🤗 Transformers
ICLR Town:

Keynotes

Aisha Walcott-Bryant: AI + Africa = Global Innovation
Leslie Kaelbling: Doing for Our Robots What Nature Did For Us
Ruha Benjamin: 2020 Vision: Reimagining the Default Settings of Technology & Society
- A discussion on how even apparently neutral technology can perpetuate discrimination. Technologists and researchers should be aware of the societal consequences of their work.
Laurent Dinh: Invertible Models and Normalizing Flows
Mihaela van der Schaar: Machine Learning: Changing the future of healthcare
Devi Parikh: AI Systems That Can See And Talk
Yann LeCun and Yoshua Bengio: Reflections from the Turing Award Winners
- Yann LeCunn: "The future is self-supervised". Challenges for Deep Learning: (1) Learning with less labeled data (self-supervised learning!), (2) how to make reasoning compatible with gradient-based learning, i.e., beyond 'system 1', and (3) learning complex (hierarchical) action sequences (nothing to say here). Mostly a discussion of energy-based models (not too different from previous talks). "Could energy-based SSL be a basis for common sense?"
- Yoshua Bengio: "Deep learning priors associated with conscious processing". Similar to this other recent talk.
  - ML and Consciousness ("Consciousness Prior")
  - The need for systematic generalization by dynamically recombining existing concepts, but avoiding the pitfalls of classical AI (e.g., need uncertaining handling, distributed representation, efficient search, grounding in 'system 1' and large-scale training).
Michael I. Jordan: The Decision-Making Side of Machine Learning: Dynamical, Statistical and Economic Perspectives Note: see also Artificial Intelligence—The Revolution Hasn’t Happened Yet

Papers

Observations: there was less of a distinction between posters and orals than in an IRL conference, as posters were just short talks. I thought the "poster" format worked very well, but I was much less likely to interact with the authors than in a non-virtual conference.

Some popular topics: reinforcement learning, adversarial ML, graph neural networks.

🔝 Adversarial ML and Robustness

💥 Robustness Verification for Transformers (paper) (reviews) (code)

TL;DR: "We propose the first algorithm for verifying the robustness of Transformers."

➖ Distributionally Robust Neural Networks for Group Shifts (paper) (reviews) (code)

Problem: models often "latch onto" spurious correlations: features that work on most training examples but don't solve the problem as we would expect. E.g., image classification -- waterbird and water background often (but not always co-occur). Overall accuracy may be high, but worst-group accuracy (e.g., waterbirds on land) can be very low.

Goal: achieve models that are more robust to spurious correlations with lower worst-group error.

Solution: Group distributionally robust optimization (DRO): minimize the worst-group's average loss, rather than the (overall) average loss. This requires knowing groups (attributes and labels) for each training example (but not at test time). A stochastic optimization algorithm is proposed and convergence guarantees are derived.

But: the worst-group error of Group DRO (at test time) is still high, i.e.,poor generalization! Previous work on small convex or generative models says this shouldn't happen. This happens because the models are SOTA overparametrized neural networks. To solve this, use stronger regularization than usual (L2 penalty).

Evaluation: Two image classification datasets (CelebA and Waterbirds) and one NLI dataset (MultiNLI).

The remaining papers deal with adversarial ML in computer vision:

💥 Unrestricted Adversarial Examples via Semantic Manipulation (paper) (reviews) (code)

Problem: adversarial examples (for images) are often created using perturbations within a small ball; it is easy to defend against them using JPEG compression or randomized smoothing.

Contributions: introduce "semantically motivated" adversarial perturbations (manipulating color and texture) with no l_p bounds (unlike most perturbations in the literature, these are large, structured, explainable). It is shown these fool some common defenses (JPEG 75, Feature Squeezing, and adversarially-trained models).

Colorization attack: use a pre-trained colorization model and "color hints" to colorize an image in order to fool the classifier. But need to do it carefully in order to keep the colors similar to the original colors.

Texture attack: style transfer (transfer texture from another image). This works best with an image from the target adversarial class, but with similar features to the original image.

Evaluation:

Misclassification rate under various defenses. Also, attacks transfer.

User study: humans have difficulty in detecting the attack.

Caption attack: these adversarial images also fool image captioning systems! E.g., "A man is holding an apple" -> "A dog is holding an apple").

➖ Adversarial Training and Provable Defenses: Bridging the Gap (paper) (reviews) (code)

TL;DR: "We propose a novel combination of adversarial training and provable defenses which produces a model with state-of-the-art accuracy and certified robustness on CIFAR-10."

➖ Fast is better than free: Revisiting adversarial training (paper) (reviews) (code)

TL;DR: "FGSM-based adversarial training, with randomization, works just as well as PGD-based adversarial training: we can use this to train a robust classifier in 6 minutes on CIFAR10, and 12 hours on ImageNet, on a single machine." (cheaper than PGD).

➖ A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning (paper) (reviews) (code)

🔝 Neural Network Architectures

NOTE: Transformers and Graph Neural Networks get their own categories.

💥 Neural Stored-program Memory (paper) (reviews) (code)

Presents a new architecture which simulates a Universal Turing Machine.

➖ Mogrifier LSTM (paper) (reviews) (code)

Initial motivation -- input embeddings for language models are based on the average context; it might be better (particularly for verbs and function words) to use the actual context. But forget this! "Mogrify" the LSTM by adding more than one round of gating. This achieves lower perplexity than LSTMs and Transformer XL (on Penn Treebank and Wikitext-2).

Why does the Mogrifier work? There are many plausible reasons, none of them fully convincing. On a synthetic dataset, the Mogrifier LSTM also outperforms the LSTM (with larger gains for larger vocabulary size). "Sadly, we could not escape the deep learning pit and a convincing explanation remained elusive".

➖ On the Relationship between Self-Attention and Convolutional Layers (paper) (reviews) (code)

TL;DR: "A self-attention layer can perform convolution and often learns to do so in practice."

Transformers are great at NLP tasks. They can also reach SOTA accuracy on vision tasks (Bello et al. 2019; Ramachandran et al., 2019). Why does self-attention work so well for images? This paper shows that multi-head self-attention can express convolutions.

Blog: http://jbcordonnier.com/posts/attention-cnn/

Demo: https://epfml.github.io/attention-cnn/

🔝 Compositionality

💥 Permutation Equivariant Models for Compositional Generalization in Language (paper) (reviews) (code)

TL;DR: "We propose a link between permutation equivariance and compositional generalization, and provide equivariant language models."

Compositionality example: if one understands "Today I will run twice" and "I walk to school every day", one should also understand "I will have to walk twice around the store". The SCAN benchmark: machine translation between simple natural language commands (e.g., "jump", "walk left", "turn right twice") and 'machine actions' (e.g., JUMP, LTURN WALK, RTURN RTURN RTURN).

💥 Measuring Compositional Generalization: A Comprehensive Method on Realistic Data (paper) (reviews) (code)

TL;DR: "Benchmark and method to measure compositional generalization by maximizing divergence of compound frequency at small divergence of atom frequency."

Compositional Generalization: ability to generalize to unseen combinations of known components (atoms).

Goal: want to measure how much compositional generalization is required for a given train/test split.

"Compound divergence": a more comprehensive measure than previous approaches, assuming that (1) all test atoms occur in training, (2) the distribution of atoms is similar in train and test and (3) distribution of compounds is different between train and test. Compound divergence correlates well with previous ad-hoc methods.

Evaluation: Compositional Freebase Questions (CFQ) and SCAN. An LSTM+attention, Transformer and Universal Transformer are compared. Compound Divergence is a great predictor of accuracy! Current systems fail to generalize compositionally, even with large training data, while random split is easy. (But it appears Transformers outperform LSTM+attention by a wide margin for almost every value of compound divergence -- see also results on syntactic generalization in https://arxiv.org/pdf/2005.03692.pdf )

➖ Environmental drivers of systematicity and generalization in a situated agent (paper) (reviews)

TL;DR: "We isolate the environmental and training factors that contribute to emergent systematic generalization in a situated language-learning agent."

From the discussion: see An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution.

➖ Compositional Language Continual Learning (paper) (reviews) (code)

Goal: continually learn new words in seq2seq tasks (e.g., instruction learning using the SCAN dataset and machine translation).

🔝 Emergent Language

➖ Compositional languages emerge in a neural iterated learning model (paper) (reviews) (code)

TL;DR: "Use iterated learning framework to facilitate the dominance of high compositional language in multi-agent games."

➖ On the interaction between supervision and self-play in emergent communication (paper) (reviews) (code)

🔝 Explainability

💥 Learning The Difference That Makes A Difference With Counterfactually-Augmented Data (reviews) (paper) (code)

TL;DR: "Humans in the loop revise documents to accord with counterfactual labels, resulting resource helps to reduce reliance on spurious associations."

See also: Evaluating NLP Models via Contrast Sets

💥 Explanation by Progressive Exaggeration (paper) (reviews) (code)

TL;DR: "A method to explain a classifier, by generating visual perturbation of an image by exaggerating or diminishing the semantic features that the classifier associates with a target label."

Creating image counterfactuals with GANs.

➖ N-BEATS: Neural basis expansion analysis for interpretable time series forecasting (paper) (reviews)

TL;DR: "A novel deep interpretable architecture that achieves state of the art on three large scale univariate time series forecasting datasets."

➖ Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models (paper) (reviews) (code) (project)

TL;DR: "We propose measurement of phrase importance and algorithms for hierarchical explanation of neural sequence model predictions."

🔝 Graphs

💥 Spectral Embedding of Regularized Block Models (paper) (reviews) (code)

➖ GraphZoom: A Multi-level Spectral Approach for Accurate and Scalable Graph Embedding (paper) (reviews) (code)

➖ Low-dimensional statistical manifold embedding of directed graphs (paper) (reviews)

🔝 Graph Neural Networks

There were a lot of papers about GNNs. Here is a few I found interesting:

➖ A Fair Comparison of Graph Neural Networks for Graph Classification (paper) (reviews) (code)

➖ On the Equivalence between Positional Node Embeddings and Structural Graph Representations (paper) (reviews) (code)

➖ LambdaNet: Probabilistic Type Inference using Graph Neural Networks (paper) (reviews) (code)

➖ Strategies for Pre-training Graph Neural Networks (paper) (reviews) (code)

➖ What graph neural networks cannot learn: depth vs width (paper) (reviews)

➖ The Logical Expressiveness of Graph Neural Networks (paper) (reviews) (code)

🔝 Knowledge Graphs

➖ Probability Calibration for Knowledge Graph Embedding Models (paper) (reviews) (code)

➖ Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings (paper) (reviews) (code) (project)

➖ You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings (paper) (reviews) (code)

🔝 Learning with Less Labels

💥 Few-shot Text Classification with Distributional Signatures (paper) (reviews) (code)

💥 Learning from Rules Generalizing Labeled Exemplars (paper) (reviews) (code)

➖ Locality and Compositionality in Zero-Shot Learning (paper) (reviews)

➖ Graph inference learning for semi-supervised classification (paper) (reviews)

➖ Automatically Discovering and Learning New Visual Categories with Ranking Statistics (paper) (reviews) (code)

🔝 Language Models and Transformers

NOTE: here is another summary of some of the papers on Transformers.

💥 Generalization through Memorization: Nearest Neighbor Language Models (paper) (reviews) (code)

From the discussion: see also "Forgetting Exceptions is Harmful in Language Learning" (1998) https://arxiv.org/abs/cs/9812021, on the same theme of generalization vs memorization.

Some interesting related work: BERT-kNN and RAG

See also: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization

💥 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (paper) (reviews) (code)

💥 ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (paper) (reviews) (code)

💥 Reformer: The Efficient Transformer (paper) (reviews) (code)

➖ Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention (paper) (reviews) (code)

➖ StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding (paper) (reviews)

From the discussion:

➖ Lite Transformer with Long-Short Range Attention (paper) (reviews) (code)

➖ Compressive Transformers for Long-Range Sequence Modelling (paper) (reviews)

➖ Depth-Adaptive Transformer (paper) (reviews)

➖ LAMOL: LAnguage MOdeling for Lifelong Language Learning (paper) (reviews) (code)

➖ Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models (paper) (reviews) (code)

➖ On Identifiability in Transformers (paper) (reviews)

From the discussion:

➖ Are Transformers universal approximators of sequence-to-sequence functions? (paper) (reviews)

➖ Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction (paper) (reviews) (code)

TL;DR: "We formulate a probabilistic latent sequence model to tackle unsupervised text style transfer, and show its effectiveness across a suite of unsupervised text style transfer tasks."

x

➖ Adjustable Real-time Style Transfer (paper) (reviews) (code)

TL;DR: "Stochastic style transfer with adjustable features."

➖ Controlling generative models with continuous factors of variations (paper) (reviews)

TL;DR: "A model to control the generation of images with GAN and beta-VAE with regard to scale and position of the objects."

x

➖ Understanding the Limitations of Conditional Generative Models (paper) (reviews)

🔝 Text Generation

💥 Plug and Play Language Models: A Simple Approach to Controlled Text Generation (paper) (reviews) (code)

Blog: https://eng.uber.com/pplm/

➖ Decoding As Dynamic Programming For Recurrent Autoregressive Models (paper) (reviews) (code)

Evaluation: Text infilling task (on SWAG and Daily Dialogue datasets); this method outperforms unidirectional decoding baselines.

➖ BERTScore: Evaluating Text Generation with BERT (paper) (reviews) (code)

From the discussion:

➖ The Curious Case of Neural Text Degeneration (paper) (reviews) (code)

The "nucleus sampling" (top-p sampling) paper.

➖ Neural Text Generation With Unlikelihood Training (paper) (reviews) (code)

➖ Data-dependent Gaussian Prior Objective for Language Generation (paper) (reviews) (code)

➖ Self-Adversarial Learning with Comparative Discrimination for Text Generation (paper) (reviews)

➖ Residual Energy-Based Models for Text Generation (paper) (reviews)

➖ Language GANs Falling Short (paper) (reviews) (code)

🔝 Miscellaneous

💥 Learning to Represent Programs with Property Signatures (paper) (reviews) (code)

TL;DR: "We represent a computer program using a set of simpler programs and use this representation to improve program synthesis techniques."

➖ Your classifier is secretly an energy based model and you should treat it like one (paper) (reviews) (code)

TL;DR: "We show that there is a hidden generative model inside of every classifier. We demonstrate how to train this model and show the many benefits of doing so."

x

➖ A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms (paper) (reviews) (code)

TL;DR: "This paper proposes a meta-learning objective based on speed of adaptation to transfer distributions to discover a modular decomposition and causal variables."

x

➖ CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning (paper) (reviews) (code)

Have we "almost solved video understanding"? 3D convolutional models (which take time into account) only perform slightly better than their 2D counterparts. But the temporal aspect of frames is essential. Real world video understanding requires reasoning about object permanence, estimating intentions, causal reasoning.

This paper presents a new dataset, CATER (Compositional Actions and Temporal Reasoning) and a series of benchmark tasks on the dataset which requires temporal reasoning to solve. E.g., predict "rorate(cube) after slide(cone)" from the video clip. SOTA models struggle against temporal reasoning.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
iclrtown.png		iclrtown.png
trevor-cole-k7ZhGBzSM5w-unsplash-crop.jpg		trevor-cole-k7ZhGBzSM5w-unsplash-crop.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICLR 2020

Selections from ICLR 2020 (mostly NLP-related)

Workshops

Some workshop talks

Workshops, see later

Socials

Keynotes

Papers

🔝 Adversarial ML and Robustness

🔝 Neural Network Architectures

🔝 Compositionality

🔝 Emergent Language

🔝 Explainability

🔝 Graphs

🔝 Graph Neural Networks

🔝 Knowledge Graphs

🔝 Learning with Less Labels

🔝 Language Models and Transformers

🔝 Reasoning

🔝 Reinforcement Learning

🔝 Style Transfer and Generative Models

🔝 Text Generation

🔝 Miscellaneous

About

Releases

Packages

juand-r/ICLR-2020

Folders and files

Latest commit

History

Repository files navigation

ICLR 2020

Selections from ICLR 2020 (mostly NLP-related)

Workshops

Some workshop talks

Workshops, see later

Socials

Keynotes

Papers

🔝 Adversarial ML and Robustness

🔝 Neural Network Architectures

🔝 Compositionality

🔝 Emergent Language

🔝 Explainability

🔝 Graphs

🔝 Graph Neural Networks

🔝 Knowledge Graphs

🔝 Learning with Less Labels

🔝 Language Models and Transformers

🔝 Reasoning

🔝 Reinforcement Learning

🔝 Style Transfer and Generative Models

🔝 Text Generation

🔝 Miscellaneous

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages