Skip to content

juand-r/ICLR-2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ICLR 2020

Lalibela, Ethiopia by Trevor Cole on Unsplash.

Selections from ICLR 2020 (mostly NLP-related)

See also: https://www.theexclusive.org/2020/05/virtual-iclr.html



Some workshop talks

Workshops, see later



Lots of "social" events, both topic and demographic based:

  • Topics in Language Research
  • Learning Representation for Cybersecurity
  • Research with πŸ€— Transformers
  • ICLR Town:


  1. Aisha Walcott-Bryant: AI + Africa = Global Innovation

  2. Leslie Kaelbling: Doing for Our Robots What Nature Did For Us

  3. Ruha Benjamin: 2020 Vision: Reimagining the Default Settings of Technology & Society

    • A discussion on how even apparently neutral technology can perpetuate discrimination. Technologists and researchers should be aware of the societal consequences of their work.
  4. Laurent Dinh: Invertible Models and Normalizing Flows

  5. Mihaela van der Schaar: Machine Learning: Changing the future of healthcare

  6. Devi Parikh: AI Systems That Can See And Talk

  7. Yann LeCun and Yoshua Bengio: Reflections from the Turing Award Winners

    • Yann LeCunn: "The future is self-supervised". Challenges for Deep Learning: (1) Learning with less labeled data (self-supervised learning!), (2) how to make reasoning compatible with gradient-based learning, i.e., beyond 'system 1', and (3) learning complex (hierarchical) action sequences (nothing to say here). Mostly a discussion of energy-based models (not too different from previous talks). "Could energy-based SSL be a basis for common sense?"

    • Yoshua Bengio: "Deep learning priors associated with conscious processing". Similar to this other recent talk.

      • ML and Consciousness ("Consciousness Prior")
      • The need for systematic generalization by dynamically recombining existing concepts, but avoiding the pitfalls of classical AI (e.g., need uncertaining handling, distributed representation, efficient search, grounding in 'system 1' and large-scale training).
  8. Michael I. Jordan: The Decision-Making Side of Machine Learning: Dynamical, Statistical and Economic Perspectives Note: see also Artificial Intelligenceβ€”The Revolution Hasn’t Happened Yet



Observations: there was less of a distinction between posters and orals than in an IRL conference, as posters were just short talks. I thought the "poster" format worked very well, but I was much less likely to interact with the authors than in a non-virtual conference.

Some popular topics: reinforcement learning, adversarial ML, graph neural networks.

See also:

πŸ” Adversarial ML and Robustness

πŸ’₯ Robustness Verification for Transformers (paper) (reviews) (code)

TL;DR: "We propose the first algorithm for verifying the robustness of Transformers."

βž– Distributionally Robust Neural Networks for Group Shifts (paper) (reviews) (code)

Problem: models often "latch onto" spurious correlations: features that work on most training examples but don't solve the problem as we would expect. E.g., image classification -- waterbird and water background often (but not always co-occur). Overall accuracy may be high, but worst-group accuracy (e.g., waterbirds on land) can be very low.

Goal: achieve models that are more robust to spurious correlations with lower worst-group error.

Solution: Group distributionally robust optimization (DRO): minimize the worst-group's average loss, rather than the (overall) average loss. This requires knowing groups (attributes and labels) for each training example (but not at test time). A stochastic optimization algorithm is proposed and convergence guarantees are derived.

But: the worst-group error of Group DRO (at test time) is still high, i.e.,poor generalization! Previous work on small convex or generative models says this shouldn't happen. This happens because the models are SOTA overparametrized neural networks. To solve this, use stronger regularization than usual (L2 penalty).

Evaluation: Two image classification datasets (CelebA and Waterbirds) and one NLI dataset (MultiNLI).

The remaining papers deal with adversarial ML in computer vision:

πŸ’₯ Unrestricted Adversarial Examples via Semantic Manipulation (paper) (reviews) (code)

Problem: adversarial examples (for images) are often created using perturbations within a small ball; it is easy to defend against them using JPEG compression or randomized smoothing.

Contributions: introduce "semantically motivated" adversarial perturbations (manipulating color and texture) with no l_p bounds (unlike most perturbations in the literature, these are large, structured, explainable). It is shown these fool some common defenses (JPEG 75, Feature Squeezing, and adversarially-trained models).

  • Colorization attack: use a pre-trained colorization model and "color hints" to colorize an image in order to fool the classifier. But need to do it carefully in order to keep the colors similar to the original colors.
  • Texture attack: style transfer (transfer texture from another image). This works best with an image from the target adversarial class, but with similar features to the original image.

Evaluation:

  • Misclassification rate under various defenses. Also, attacks transfer.
  • User study: humans have difficulty in detecting the attack.
  • Caption attack: these adversarial images also fool image captioning systems! E.g., "A man is holding an apple" -> "A dog is holding an apple").

βž– Adversarial Training and Provable Defenses: Bridging the Gap (paper) (reviews) (code)

TL;DR: "We propose a novel combination of adversarial training and provable defenses which produces a model with state-of-the-art accuracy and certified robustness on CIFAR-10."

βž– Fast is better than free: Revisiting adversarial training (paper) (reviews) (code)

TL;DR: "FGSM-based adversarial training, with randomization, works just as well as PGD-based adversarial training: we can use this to train a robust classifier in 6 minutes on CIFAR10, and 12 hours on ImageNet, on a single machine." (cheaper than PGD).

βž– A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning (paper) (reviews) (code)


πŸ” Neural Network Architectures

NOTE: Transformers and Graph Neural Networks get their own categories.

πŸ’₯ Neural Stored-program Memory (paper) (reviews) (code)

Presents a new architecture which simulates a Universal Turing Machine.

βž– Mogrifier LSTM (paper) (reviews) (code)

Initial motivation -- input embeddings for language models are based on the average context; it might be better (particularly for verbs and function words) to use the actual context. But forget this! "Mogrify" the LSTM by adding more than one round of gating. This achieves lower perplexity than LSTMs and Transformer XL (on Penn Treebank and Wikitext-2).

Why does the Mogrifier work? There are many plausible reasons, none of them fully convincing. On a synthetic dataset, the Mogrifier LSTM also outperforms the LSTM (with larger gains for larger vocabulary size). "Sadly, we could not escape the deep learning pit and a convincing explanation remained elusive".

βž– On the Relationship between Self-Attention and Convolutional Layers (paper) (reviews) (code)

TL;DR: "A self-attention layer can perform convolution and often learns to do so in practice."

Transformers are great at NLP tasks. They can also reach SOTA accuracy on vision tasks (Bello et al. 2019; Ramachandran et al., 2019). Why does self-attention work so well for images? This paper shows that multi-head self-attention can express convolutions.

Blog: http://jbcordonnier.com/posts/attention-cnn/

Demo: https://epfml.github.io/attention-cnn/


πŸ” Compositionality

πŸ’₯ Permutation Equivariant Models for Compositional Generalization in Language (paper) (reviews) (code)

TL;DR: "We propose a link between permutation equivariance and compositional generalization, and provide equivariant language models."

Compositionality example: if one understands "Today I will run twice" and "I walk to school every day", one should also understand "I will have to walk twice around the store". The SCAN benchmark: machine translation between simple natural language commands (e.g., "jump", "walk left", "turn right twice") and 'machine actions' (e.g., JUMP, LTURN WALK, RTURN RTURN RTURN).

πŸ’₯ Measuring Compositional Generalization: A Comprehensive Method on Realistic Data (paper) (reviews) (code)

TL;DR: "Benchmark and method to measure compositional generalization by maximizing divergence of compound frequency at small divergence of atom frequency."

Compositional Generalization: ability to generalize to unseen combinations of known components (atoms).

Goal: want to measure how much compositional generalization is required for a given train/test split.

"Compound divergence": a more comprehensive measure than previous approaches, assuming that (1) all test atoms occur in training, (2) the distribution of atoms is similar in train and test and (3) distribution of compounds is different between train and test. Compound divergence correlates well with previous ad-hoc methods.

Evaluation: Compositional Freebase Questions (CFQ) and SCAN. An LSTM+attention, Transformer and Universal Transformer are compared. Compound Divergence is a great predictor of accuracy! Current systems fail to generalize compositionally, even with large training data, while random split is easy. (But it appears Transformers outperform LSTM+attention by a wide margin for almost every value of compound divergence -- see also results on syntactic generalization in https://arxiv.org/pdf/2005.03692.pdf )

βž– Environmental drivers of systematicity and generalization in a situated agent (paper) (reviews)

TL;DR: "We isolate the environmental and training factors that contribute to emergent systematic generalization in a situated language-learning agent."

From the discussion: see An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution.

βž– Compositional Language Continual Learning (paper) (reviews) (code)

Goal: continually learn new words in seq2seq tasks (e.g., instruction learning using the SCAN dataset and machine translation).


πŸ” Emergent Language

βž– Compositional languages emerge in a neural iterated learning model (paper) (reviews) (code)

TL;DR: "Use iterated learning framework to facilitate the dominance of high compositional language in multi-agent games."

βž– On the interaction between supervision and self-play in emergent communication (paper) (reviews) (code)


πŸ” Explainability

πŸ’₯ Learning The Difference That Makes A Difference With Counterfactually-Augmented Data (reviews) (paper) (code)

TL;DR: "Humans in the loop revise documents to accord with counterfactual labels, resulting resource helps to reduce reliance on spurious associations."

See also: Evaluating NLP Models via Contrast Sets

πŸ’₯ Explanation by Progressive Exaggeration (paper) (reviews) (code)

TL;DR: "A method to explain a classifier, by generating visual perturbation of an image by exaggerating or diminishing the semantic features that the classifier associates with a target label."

Creating image counterfactuals with GANs.

βž– N-BEATS: Neural basis expansion analysis for interpretable time series forecasting (paper) (reviews)

TL;DR: "A novel deep interpretable architecture that achieves state of the art on three large scale univariate time series forecasting datasets."

βž– Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models (paper) (reviews) (code) (project)

TL;DR: "We propose measurement of phrase importance and algorithms for hierarchical explanation of neural sequence model predictions."


πŸ” Graphs

πŸ’₯ Spectral Embedding of Regularized Block Models (paper) (reviews) (code)

βž– GraphZoom: A Multi-level Spectral Approach for Accurate and Scalable Graph Embedding (paper) (reviews) (code)

βž– Low-dimensional statistical manifold embedding of directed graphs (paper) (reviews)


πŸ” Graph Neural Networks

There were a lot of papers about GNNs. Here is a few I found interesting:

βž– A Fair Comparison of Graph Neural Networks for Graph Classification (paper) (reviews) (code)

βž– On the Equivalence between Positional Node Embeddings and Structural Graph Representations (paper) (reviews) (code)

βž– LambdaNet: Probabilistic Type Inference using Graph Neural Networks (paper) (reviews) (code)

βž– Strategies for Pre-training Graph Neural Networks (paper) (reviews) (code)

βž– What graph neural networks cannot learn: depth vs width (paper) (reviews)

βž– The Logical Expressiveness of Graph Neural Networks (paper) (reviews) (code)


πŸ” Knowledge Graphs

βž– Probability Calibration for Knowledge Graph Embedding Models (paper) (reviews) (code)

βž– Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings (paper) (reviews) (code) (project)

βž– You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings (paper) (reviews) (code)


πŸ” Learning with Less Labels

πŸ’₯ Few-shot Text Classification with Distributional Signatures (paper) (reviews) (code)

πŸ’₯ Learning from Rules Generalizing Labeled Exemplars (paper) (reviews) (code)

βž– Locality and Compositionality in Zero-Shot Learning (paper) (reviews)

βž– Graph inference learning for semi-supervised classification (paper) (reviews)

βž– Automatically Discovering and Learning New Visual Categories with Ranking Statistics (paper) (reviews) (code)


πŸ” Language Models and Transformers

NOTE: here is another summary of some of the papers on Transformers.

πŸ’₯ Generalization through Memorization: Nearest Neighbor Language Models (paper) (reviews) (code)

From the discussion: see also "Forgetting Exceptions is Harmful in Language Learning" (1998) https://arxiv.org/abs/cs/9812021, on the same theme of generalization vs memorization.

Some interesting related work: BERT-kNN and RAG

See also: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization

πŸ’₯ ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (paper) (reviews) (code)

πŸ’₯ ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (paper) (reviews) (code)

πŸ’₯ Reformer: The Efficient Transformer (paper) (reviews) (code)

βž– Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention (paper) (reviews) (code)

βž– StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding (paper) (reviews)

From the discussion:

βž– Lite Transformer with Long-Short Range Attention (paper) (reviews) (code)

βž– Compressive Transformers for Long-Range Sequence Modelling (paper) (reviews)

βž– Depth-Adaptive Transformer (paper) (reviews)

βž– LAMOL: LAnguage MOdeling for Lifelong Language Learning (paper) (reviews) (code)

βž– Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models (paper) (reviews) (code)

βž– On Identifiability in Transformers (paper) (reviews)

From the discussion:

βž– Are Transformers universal approximators of sequence-to-sequence functions? (paper) (reviews)

βž– Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction (paper) (reviews) (code)


πŸ” Reasoning

βž– Neural Module Networks for Reasoning over Text (paper) (reviews) (code)

βž– Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension (paper) (reviews)

βž– ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning (paper) (reviews) (code)

βž– (paper) (reviews) (code)


πŸ” Reinforcement Learning

βž– Making Efficient Use of Demonstrations to Solve Hard Exploration Problems (paper) (reviews) [(code "coming soon")]

βž– SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards (paper) (reviews)

βž– On the Weaknesses of Reinforcement Learning for Neural Machine Translation (paper) (reviews)


πŸ” Style Transfer and Generative Models

πŸ’₯ A Probabilistic Formulation of Unsupervised Text Style Transfer (paper) (reviews) (code)

TL;DR: "We formulate a probabilistic latent sequence model to tackle unsupervised text style transfer, and show its effectiveness across a suite of unsupervised text style transfer tasks."

x

βž– Adjustable Real-time Style Transfer (paper) (reviews) (code)

TL;DR: "Stochastic style transfer with adjustable features."

βž– Controlling generative models with continuous factors of variations (paper) (reviews)

TL;DR: "A model to control the generation of images with GAN and beta-VAE with regard to scale and position of the objects."

x

βž– Understanding the Limitations of Conditional Generative Models (paper) (reviews)


πŸ” Text Generation

πŸ’₯ Plug and Play Language Models: A Simple Approach to Controlled Text Generation (paper) (reviews) (code)

Blog: https://eng.uber.com/pplm/

βž– Decoding As Dynamic Programming For Recurrent Autoregressive Models (paper) (reviews) (code)

Evaluation: Text infilling task (on SWAG and Daily Dialogue datasets); this method outperforms unidirectional decoding baselines.

βž– BERTScore: Evaluating Text Generation with BERT (paper) (reviews) (code)

From the discussion:

βž– The Curious Case of Neural Text Degeneration (paper) (reviews) (code)

The "nucleus sampling" (top-p sampling) paper.

βž– Neural Text Generation With Unlikelihood Training (paper) (reviews) (code)

βž– Data-dependent Gaussian Prior Objective for Language Generation (paper) (reviews) (code)

βž– Self-Adversarial Learning with Comparative Discrimination for Text Generation (paper) (reviews)

βž– Residual Energy-Based Models for Text Generation (paper) (reviews)

βž– Language GANs Falling Short (paper) (reviews) (code)


πŸ” Miscellaneous

πŸ’₯ Learning to Represent Programs with Property Signatures (paper) (reviews) (code)

TL;DR: "We represent a computer program using a set of simpler programs and use this representation to improve program synthesis techniques."

βž– Your classifier is secretly an energy based model and you should treat it like one (paper) (reviews) (code)

TL;DR: "We show that there is a hidden generative model inside of every classifier. We demonstrate how to train this model and show the many benefits of doing so."

x

βž– A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms (paper) (reviews) (code)

TL;DR: "This paper proposes a meta-learning objective based on speed of adaptation to transfer distributions to discover a modular decomposition and causal variables."

x

βž– CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning (paper) (reviews) (code)

Have we "almost solved video understanding"? 3D convolutional models (which take time into account) only perform slightly better than their 2D counterparts. But the temporal aspect of frames is essential. Real world video understanding requires reasoning about object permanence, estimating intentions, causal reasoning.

This paper presents a new dataset, CATER (Compositional Actions and Temporal Reasoning) and a series of benchmark tasks on the dataset which requires temporal reasoning to solve. E.g., predict "rorate(cube) after slide(cone)" from the video clip. SOTA models struggle against temporal reasoning.

About

Selections from ICLR 2020

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published