1. What are the main tasks that autoencoders are used for?

- dimensionality reduction: an autoencoder can be trained on reconstruction loss, and then the encoder codings can be used for dimensionality reduction. This can be used for further dimensionality reduction for visualization paired with tSNE.
- unsupervised pretraining: After training, the decoder can be used with as features for classifications tasks.
- noise reduction: denoising autoencoders can filter noise from the inputs.
- generative modeling: Variational Autoencoders can learn the distribution of inputs and then generate examples that are similar but novel from the input empirical distribution.

The author also mentioned anomaly detection

2. Suppose you want to train a classifier, and you have plenty of unlabeled training data but only a few thousand labeled instances. How can autoencoders help? How would you proceed?

One approach is to pretrain a deep autoencoder on the unlabeled training data and then simply use the decoder as features. Then a fully connected layer can be added on top for classification, and trained on the labeled instances.

If labeled data is very scarce, it may be beneficial to freeze the decoder layers while fine tuning the added dense layer.

3. If an autoencoder perfectly reconstructs the inputs, is it necessarily a good autoencoder? How can you evaluate the performance of an autoencoder?

If the autoencoder overfits the training set, it might not be a good autoencoder. The phrasing of the questions does not indicate whether the autoencoder is undercomplete or overcomplete. If the the autoencoder is undercomplete, it will be forced to approximate the reconstruction of inputs, and so will not be able to perfectly reconstruct examples in all cases.

On the other hand if the autoencoder is overcomplete and perfectly reconstructs examples, it could just be due to overfitting on the training set, and the autoencoder will be unlikely to generalize to good reconstructions of unseen examples.

The examples in this chapter evaluate the performance of an autoencoder via validation set loss. In the case of undercomplete autoencoders, mean squared error or cross-entropy could be sufficient, where overcomplete autoencoders like sparse autoencoders would require additional losses like KL divergence between the learned coding distribution and a desired average activation distribution (e.g. a specified prior desire for .1 average activation)

It is also wise to evaluate an autocoder on its actual desired use. For instance if the purpose is dimensionality reduction, how well does the autoencoder do at visualization? Classification is easier to validate in a non-subjective manner by using classification validation metrics.

4. What are undercomplete and overcomplete autoencoders? What is the main risk of an excessively undercomplete autoencoder? What about the main risk of an overcomplete autoencoder?

undercomplete: The codings layer as lower dimensionality than the input/output dimensions

overcomplete: The codings layer has higher dimensionality than the input/output dimensions

In terms of systems of equations, undercomplete means that there are choices of $b$ in $Ax = b$ that do not have a corresponding $x$ to solve the equation. This is because $A$ is a linear transformation from $m$ -> $n$ where $n > m$ (the matrix is taller than it is wide), or alternatively when $n <= m$ but $< n$ columns of $A$ are linearly independent.

Overcomplete means that there are infinitely many choices of $x$ that can solve for every $b$, resulting from the fact that $A$ is wider than it is tall ($n < m$), where $> n$ columns of $A$ are linearly independent.

This translates to autoencoders by considering $A$ to be the transformation implemented by the decoder, mapping codings to the output. It isn't a perfect analogy because the decoder implements a non-linear transformation due to non-linearities applied to activations at each hidden layer, but the intuition is useful.

5. How do you tie weights in a stacked autoencoder? What is the point of doing so?

To tie weights, set the weights $W_{N-L+1} = W_{L}^{T}$ for $L = 1, 2, ..., \frac{N}{2}$.

This can be implemented by subpclassing keras.layers.Layer for the `DenseTranspose` layers in the decoder, which are passed the weights of the corresponding `Dense` layer in the encoder, as exemplified in the book.

The benefit of this method is that it reduces the number of parameters in the model by half, speeding up training. It also reduces the risk of overfitting (by reducing capacity).

6. What is a generative model? Can you name a type of generative autoencoder?

In general, a generative model is something that models the probability distribution of the model inputs, which allows it to then sample new instances from this distribution.

A type of generative autoencoder is a Variational Autoencoder. It models $p(X) = \int p(X|z) p(z) \,dz$ learned from an unlabeled empirical distribution of datapoints $X$. New datapoints can be attained by sampling from the latent distribution $p(z)$, and using the learned decoder, a function $f(z; \theta)$ which approximates $p(X|z)$.

During training, generating samples from $p(z)$ is sample innefficient in that learning $p(X|z)$ that maps brute for samples from $p(z)$ to plausible datapoints requires a very large number of samples, so the variational autoencoder learns an encoder function $q(z|X; \phi)$ which if high enough capacity approximates the true $p(z|X)$ and (hopefully) assigns high probability to a much smaller region in latent space than the uninformative prior $p(z)$. This has the effect of focusing on the regions of latent space that are likely to map to data points with high $p(X)$, thus greatly improving sample efficiency.

The model chooses a prior belief that $p(z) = \mathcal{N}(\mathbf{0}, \mathbf{I})$, so generating new samples from the model simply requires sampling $z$ from the standard normal and feeding the sample through the decoder.

7. What is a GAN? Can you name a few tasks where GANs can shine?

A GAN is a Generative Adversarial Network. It consists of two components: a generator network, and a discriminator network.

The discriminator's task is to discern whether an input $X$ is drawn from the true data generating distribution, or whether it synthetic and drawn from a different distribution.

The generator's task is to generate data inputs $X$ that appear to be drawn from the true data distribution. It attempts to fool the discriminator network into thinking the data points it generates are drawn from the data generating distribution.

The competition between the two networks can result in a generator that is able to generate highly realistic samples.

GANs are known to shine in the areas of:
- image generation
- generating fake chemical compound structures
- dataset augmentation
- finding weaknesses of models and strengthening them

8. What are the main difficulties when training GANs?

The good news is that the generator and discriminator have only a single nash equilibrium, where the best policy the discriminator can choose is to guess 50/50.

The problem is that training GANs can be very difficult, and the nash equilibrium is not guaranteed to be reached. Training can diverge or encounter one of several problems.

Mode collapse occurs when the generator network learns a very narrow distribution that focuses on the types of things it is doing well on. For example it could get very good at generating pictures of kitchens, when the actual task is to generate pictures of rooms in general. This leads the discriminator to get better at discerning kitchen images, so the generator will eventually move on to a different type of room, and so on.

Training can be going well and then suddenly diverge without an obvious cause.

It is difficult to choose the correct hyperparameters.