# ICLR 2019 notes

## TLDR
* Lots of reinforcement learning, didn't focus so much on this.
* Some other keywords (that felt thematic to this conference)
    * Robustness
    * Transfer learning
    * Unsupervised learning
    * Meta-learning
* Some of the posters can be found here https://postersession.ai/


## Day 1

### Invited talk: Recent developments in fairness (Cynthia Dwork)
* Different definitions of fairness.
* Learning fair representations where sensitive feature has been removed. (Adversarial fairness).

### Invited talk: Learning representations using causal invariance (Leon Bottou)
* Learning algorithms often capture correlations we are not actually interested in.
* Suppose we have datasets each showcasing the same concept but under different biases (environments).
* We want to learn what is common between them and ignore the spurious correlations.
* We can do this by projecting to a representation that has a causal invariance criterion.
* Shuffling data could be a loss of information.
    * Lose information about what has changed and what remains under different circumstances (environments).

### BigGAN
* Conditional image generation.
* Self-attention GAN is the base model they improve on.
* Tricks from other papers
    * Spectral normalization in G and D.
    * Imbalanced learning rates and number of train steps for G and D.
    * Hinge GAN loss.
    * Class conditional batch norm.
    * Shared batch norm statistics across devices.
    * Orthogonal initialization.
    * Exponential moving average of weights to produce final model.
* Innovations
    * More parameters are beneficial (scaling up).
    * Larger batch size is beneficial.
    * Ortogonal regularization.
    * Shared embedding space with linear projection at each resblock.
    * Hierarchical latent space.
    * Truncation trick, resampling values with too high magnitudes gives tradeoff capability between quality and variation.
* Insights
    * Guaranteeing training stability comes with a quality cost (with this approach at least).
    * Instead it seems better (at least for now) to let it train with unstable hparams and then just take the last checkpoint before training goes bad.
* Good paper for overview of many modern GAN tricks.

### The Lottery Ticket Hypothesis
* It's often possible to prune trained networks to make them smaller in size and faster to do inference with.
* However, it's also difficult to trained the pruned network architecture from scratch.
* The lottery ticket hypothesis states that the big unpruned nets contain sub-networks that got the "winning ticket", i.e. got a good initialization that would allow it to be trained in isolation and still reach comparable performance to the full network.
* This paper presents an algorithm to find winning tickets.
    * But only after training has converged?
    * Does the winning ticket work well if we change to another (similar) dataset? E.g. another marketplace.
* Question: when training the full network the weight of a sub-network would still be touch by gradients from other weights. How does this influence the whole thing?

### Workshop: Deep generative models for Structured Data

#### Continuous-Output Language Generation (Yulia Tsvetkov)
* A softmax layer is often used as output layer for language generation models.
* It's very computationally heavy for large vocabulary sizes (and high memory usage).
* Often limiting the vocabulary size.
* GANs for text have not worked well due to non continuous output and getting gradients back.
* The suggestion here is to instead output a continuous embedding which alleviates the computational complexity and is an approach for text GANs.
* Questions: Only greeding decoding right now, but they were working on something else.


## Day 2

### Invited talk: Adversarial Machine Learning (Ian Goodfellow)
* Overview talk on adversarial machine learning.
* He went through different areas within machine learning where adversarial machine learning is used.
    * Generative modeling.
    * Security (adversarial examples).
    * Fairness.
    * Domain adaptation.
    * Label efficiency.
    * etc.

### Learning Robust Representations by Projecting out Superficial Statistics
TODO

### Poster session 1

#### Switchable Normalization
* They introduce a new type of normalization layer, Switch Norm (SN), that learns importance weights of BN, IN, and LN.
* At each layer, separate statistics are collected according to BN, IN, LN. These statistics are then weight-averaged and used in the normalization, i.e. subtract the weighted mean and divide by the weighted stddevs.
* They see that SN prefers BN for backbone networks and LN in layers close to the head of object detection models.
* They see that it picks IN for style transfer models (which is what they use I think), so it seems to make good choices.
* Seems nice to not have to experiment with different normalization layers.
* Robust to varying batch sizes and works well even for very small batch sizes.
* Doesn't have sensitive hyper params like GN.

#### The Singular Values of Convolutional Layers
* Singular values are bad because they lead to exploding or vanishing gradients.
* Operator norm means to constrain the maximum singular value.
* The effect of this is that linear transformations (like conv layers) can't make too big changes (Lipschitz constant).
* Regularizing the operator norm can lead to better generalization and robustness.
* Previous work has also identified this as a problem but only used approximations to compute the singular values for conv layers (spectral norm?).
* Operator norm is complementary to batch norm.

#### Approximating CNNs with bag of local features
* Aka BagNet
* CNNs is applied on small image patches of the full image which outputs logits.
* Many patches are evaluated like this which yields a heatmap per class. 
* The heatmaps are summed to produce the "votes" per class which is then fed into a softmax for predictions.
    * Maybe the dog class was activated in many local patches, which would give it a lot of "votes".
* Size of patches?
* This had a connection to the texture bias that CNNs were shown to have.
    * Would probably fail for the style transfered dataset (where texture was altered) in that paper.

### Poster session 2
Lots of papers about GANs and adversarial examples. Is adversarial examples something we need to consider at Schibsted?

#### Fixup Initialization
* Normalization in deep learning is often credited for 
    * training stabilization
    * enabling higher learning rate
    * accelarate convergence
    * increase generalization
* The reasons for these effects have not been proven yet and authors show that they are not unique to normalization and suggest this alternative method.
* An alternative to normalization for residual networks called *fixup* or *fixed-update* initialization.
* Cool because saving on memory usage.
* The steps to convert a resnet with normalization to fixup initialization instead:
    * Remove the normalization layers.
    * For each residual branch:
        * Initialize one weight layer (conv) to zeros.
        * Initialize the other ones with some standard initiation and then rescale the weights by $L^{-\frac{1}{2m - 2}}$
        * Add a scalar multiplier initialized to 1.
        * Add a scalar bias initialized to 0 before each weight and before each activation layer.
* Experiments in image classification (resnet variants) and machine translation (transformers).
* Should be useful for small batch sizes? How does it compare to normalization techniques designed for smaller batch sizes (e.g. layer norm)?
* How does it relate to self-normalizing networks (SELU)?


## Day 3

### Invited talk: Learning Deep Representations by Mutual Information Estimation and Maximization (Devon Hjelm)
* Unsupervised learning of image representations.
* Key is to maximize the mutual information (MI) between an image and the encoder's computed representation of it.
* MI is hard to compute, but recent advances are leveraged, via neural network.
* MI with full input doesn't always work so well, rather MI with local regions of the input works better.
* MI is combined with prior matching (like adversarial autoencoder) to get desired statistical properties of representation.
* Sort of a triplet learning thing, maximize mutual information between $X_{image}$ and representation $y_{image}$ while minimizing mutual information between $X_{other}$ and $y_{image}$.

### Poster session 1

### Poster session 2

#### Towards Understanding Regularization in Batch Normalization
* Batchnorm improves both convergence speed and generalization.
* TODO

#### Decoupled weight decay
* Paper from previous paper reading session.
* l2 regularization plus adam often gives bad results.
* Don't let adam make its updates based on a l2 regularized loss.
* Instead update adam based on normal loss only and then do the weight decay as a separate step.

#### Deep Anomaly Detection with Outlier Exposure
TODO

## Day 4

### Pay Less Attention with Lightweight and Dynamic Convolutions
* The authors ask themselves whether self-attention (like in transformers) are required to get good performance and whether a more limited context is actually enough for many NLP tasks.
* More limited context (as with CNNs) is interesting because it would be faster.
* Standard CNNs have fixed weights over time which is less than ideal compared to self-attention and RNNs.
* They introduce dynamic convolutions to address this.
* Dynamic convolutions take the current word embedding and computes a kernel to apply to the neighborhood in order to compute the next representation layer.
* A challenge lies in the fact that a lot of parameters must be predicted (~100M per layer). This is handled by first considering depthwise convolutions (which is a lot fewer but still not enough? they do experiments with both) and secondly *lightweight convolutions* where some weights (heads) are shared.
* They modify a transformer to use dynamic convolutions (n x lightweight convolutions) and get slightly better accuracy and also faster in some translation tasks.
* Also really useful in text summarization in their experiments.
* They also do experiments with only lightweight convolutions, I guess same kernel for all time steps but still computed based on something?

### Smoothing the Geometry of Probabilistic Box Embeddings
* Vector embeddings have some problems.
    * Can't capture regions (mammal region should cover all mammal species).
    * Can't capture asymmetry (rabbit is a mammal, but mammal is not a rabbit). At least not with similarity measures like dot product.
* One attempt to address this is gaussian representations which also has disjointness (can avoid region overlap of two concepts) but it's not closed under intersection (intersection of two gaussian are not necessarily a gaussian).
* Another representation is via cones (what does this mean? a box that extends to infinity?) which has the closed under intersection property but not the disjointness property.
* Box representation solves both of these and has all four properties.
* The representations are initialized randomly.
* When training this there are difficulties with lack of gradients when a box is outside another box that it should be inside because the probability is zero.

### Ordered Neurons: Integrating Tree Structures into RNNs (best paper award)
TODO

### Poster session 1

#### Poincare Glove: Hyperbolic Word Embeddings
TODO

#### Hyperbolic Attention Networks
TODO

#### Universal Transformers
* RNNs are slow to train, can't benefit from parallelization. 
* RNNs are difficult to train due to long range dependencies.
* Vanilla transformers was a solution to this in some tasks.
* But vanilla transformers are bad in other tasks, e.g. would fail to generalize in simple tasks like copying strings.
* Universal Transformers (UT) are introduced to solve these weaknesses of transformers.
* UTs are still parallel in time, i.e. each time step can access/attend to representations from all other time steps.
* UTs instead have recurrent depth coupled with dynamic halting of recurrence per time step.
* They got sota on both some machine translation task and the type of tasks in which vanilla transformers would fail.
* Author said it should work well as a drop in replacement for vanilla transformers.

#### Structured Neural Summarization
* Previously most standard summarization models are based on some sequence-to-sequence model.
* Here they add a way to exploit known relationships between elements in the input by combining sequence encoders with graph neural networks (GNN).
* Input is a sequence of tokens which is fed through a sequence encoder to get representations for each token.
* These representations are fed into the GNN as initial node representations which computes updated states that can then be fed to a decoder.
* The GNN also takes the binary relationships between nodes as input.
* These relationsships can be
    * Inferred sentence structure relationships (these statistical models are accurate enough).
    * Relationships to describe which node follows which node in the sequence.
    * Relationships describing if a token represents a person, or similar knowledge graph relations.
    * Relationships describing that all words in a sentence belong to the sentence.

### Poster session 2