# ICML2018 Notes
Some notes taken during ICML 2018.

## Temporal Point Processes
* Models for discrete events in continuous time, irregular events?
* E.g. financial data, social media activity, users articles data
* Intensity function $\lambda(t) = \frac{P(\text{event happening between t and t+dt)}}{dt}$

## Toward Theoretical Understanding of Deep Learning
* Optimization
    * We don't really know that much yet, many studies are done on small networks or networks without nonlinearities.
* Overparametrization
    * e.g. why doesn't a big network with 20M params overfit on cifar10 with 50K samples? We don't really know why.
    * Folklore(?) experiment
        * Generate labeled data by feeding random data into a depth 2 net.
        * Then try to train a new network to learn this. Much easier with a big network.
    * Longtime belief: SGD + regularization eliminated "excess capacity"
    * Noise stability of trained networks, i.e. if injecting noise at some layer, difference in output further down should be small. We don't really know why though.
        * Generalization
        * This could be used to prove that the network is compressible.
* Role of depth
    * Expressiveness but also about making optimization easier which is a bit counter intuitive.
* Mentioned during questions after: *The lottery ticket hypothesis* paper - with a big network we get more "chances" at finding a solution that can then be pruned leading to better performance than starting with the pruned architecture.
    1. Train big network, prune it.
    2. Take pruned network, reset weights to their initial value (from big). "The winning ticket."
    3. Train this network. Better performance.

## Variational Bayes and Beyond: Bayesian Inference for Big Data
* Good introduction about variational inference in general and how it fits in among different techniques in approximate bayesian inference. The following describes subset of approximate inference leading down to how to optimize variational parameters for mean field VB. Note that other choices can be made at each step leading to other inference methods (e.g. MCMC).
    1. Use $q^*(z)$ to approximate $p(z|y)$
    2. Optimization $q^* = argmin_{q \in Q} f(q(z), p(z|y))$ ($f$ is some distance/divergence measure)
    3. Variational Bayes $q^* = argmin_{q \in Q} KL(q(z) || p(z|y))$
    4. Mean field VB $q^* = argmin_{q \in Q_{MFVB}} KL(q(z) || p(z|y))$
    5. Coordinate ascent / gradient ascent (stochastic variational inference)
* People mean the same thing with *variational bayes* and *variational inference* right?
* Problems with MFVB:
    * Problems with choice of KL divergence direction, missing modes or overcompensating.
    * Problems with underestimation of variance.
        * Many papers try to improve on this.
* One way to evaluate seems to be to just compare to (longrunning) results of MCMC, if VI algorithm closely resembles MCMC posterior approximation it's good.
* Bayesian coresets to scale to big data.
* Bayesian coresets is basically a smarter way of subsampling a dataset to make it more scalable. In short I understood it as a way of picking representative datapoints and reweighting them to reflect their importance based on what they represent?
    * Not just importance sampling.
    * Points didn't necessarily have to be inside some cluster of data point as might seem intuitive? But rather they are picked to mimic the loss by using all data points.
* Uniform subsampling might miss important points.

## Obfuscated Gradients, False Sense of Security
* An adversarial example is something that is misclassified (with high confidence?) but it still very much looks like correctly classified examples. At least to a human.
* Security threat, examples of this are misclassified traffic signs which is obviously a big problem for autonomous driving.
* It's relatively easy to generate these. Speaker mentions different scenarios where the attacker has access to different parts of the network. Output logits, loss, just the topK labels etc, still manages to perform attacks.
* Some try to obfuscate gradients but speaker argues this is a weak defense.
* Speaker thinks more evaluation is needed and that many defense papers actually fail to evaluate their approaches properly.

## Conditional Neural Processes
* Gaussian processes are nice because we can exploit prior knowledge and we get an uncertainty estimation but they are also expensive at test time. (Naive case is $\mathcal{O}((n+m)^3)$ because of matrix inversion)
* This paper tries to combine this with fast inference of neural networks by trading mathematical guarantees of GPs with scalability of NNs.
* Essentially they have two networks, $h$ and $g$
    * $h(x_i, y_i) = r_i$ for all training data input-output pairs
    * Combine $r_i$ to embedding $r$ which functions as the knowledge of the behaviour of the function being regressed.
    * Trained by maximizing log likelihood of predicted distributions for training data input-output pairs but conditioned on subsets of the training data (i.e. computing $r$ from subset?)
        * Monte Carlo estimates by sampling from the predicted distributions.
    * $\phi_i = g(x_i, r)$ is the predicted distribution for test input $x_i$
* One advantage: don't have to pick a prior (kernel for GPs?) but can learn this from empirical data samples.

## Fixing a Broken ELBO
* This is concerning latent variable models trained by optimizing ELBO, like VAEs.
* They argue that it doesn't necessarily give a good representation.
* They show a curve of models with identical ELBO but with different tradeoff of compression and reconstruction ability. (Rate-Distortion tradeoff)
    * E.g. we could have a model that simply discards the latent code and just draws samples from the prior. (KL collapse) Usually happens for very powerful decoders.
* They say that mutual information between latent code Z and observed X is a better way to measure representation learning performance.
* Takeaway, a more principled method for how to make a powerful decoder not ignore it's latent code.
* Also see *TherML* related to this by same author.

## Tighter Variational Bounds are Not Necessarily Better
* A lot of research in approximate inference tries to make the ELBO tighter.
* This paper argues that this doesn't always give a better inference model.

## Yes, But Did it Work? Evaluating Variational Inference
???

## Geometry Score
* Generative models (and GANs in particular because implicit distribution) are very difficult to evaluate. A few methods exist but they are not perfect (inception score is sort of standard but critique against it is that it doesn't always correlate well with human judgement). Geometry score is another method.
* Geometry score is not limited to visual data!
* Geometry score works by comparing the underlying data manifolds for the data distribution and the generated distribution.
    * No access to actual manifolds, just samples from them.
    * Idea is based on Topological Data Analysis. Something like:
        * If two points are within $\epsilon$ by some distance measure then they are connected.
        * This builds up the *simplicial complex* $\mathcal{R}_\epsilon$
    * They build up simplicial complexes for different values of $\epsilon$
    * They then summarize these by counting the connected components?
    * Which is then used for comparisons between manifolds.
    * ??? lost some details here
* Need to read paper.

## Is Generator Conditioning Causally Related to GAN Performance?
* Based on some insights from a previous paper stating that controlling the distribution of singular values in the Jacobian is important in deep learning, they try this on the generator in a GAN.
* They test the hypothesis that ill-conditioned singular value distribution of the generator in a GAN leads to bad results.
* They add a soft constraint to penalize very large and very small singular values.
    * Sample $z$
    * $z' = perturb(z)$
    * Q = ||G(z) - G(z')|| / ||z - z'||
    * $L_{penalty} = (max(Q, \lambda_{max}) - \lambda_{max})^2 + (min(Q, \lambda_{min}) - \lambda_{min})^2$
* This added loss term gives more stable training and better inception and FID scores.
* Relation to Spectral Normalization GAN? In that paper it's applied to discriminator and aims to normalize singular values of the *weight matrix*, leads to similar thing? But this paper also penalizes from below.

## Towards Binary-Valued Gates for Robust LSTM Training
* LSTMs have "gates" to control information flow over time.
* Authors want to "binarize" this, i.e. pushing them towards 0 or 1.
* They state that this gives better generalization because of the flatter minima produced by the "binarized" gates.
* They use the gumbel softmax reparametrization trick to accomplish this.

## Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
* Speech synthesis
TODO: Read this paper

## Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
* Speech synthesis with speaker style in latent variables.
TODO: Read this paper. Related to above.

## Assessing Generative Models via Precision and Recall
* They used precision and recall to evaluate GANs.

## MMD GANs