# Unsupervised Representation Learning with Deep Convolutional GANs
This paper is mainly concerned with learning feature representations. They do this by using the GAN framework while also using CNNs for the generator and discriminator which later has become the standard in most GAN variants.

Contributions
* A more stable architecture using convolutions
* Good way to learn image representations (unsupervised) with some representation arithmetic capabilities

## Model architecture guidelines
* Replace deterministic spatial pooling functions (like maxpool) with strided convolutions in discriminator and fractional-strided convolutions (transposed convolution or "deconvolutions") in generator.
* Eliminate fully connected layers. 
    * Can use global average pooling.
    * Often just flatten last conv layer into 1 sigmoid output though.
* Batch normalization in both discriminator and generator gives more stability in training.
    * No batch norm on generator output layer.
    * No batch norm on discriminator input layer.
* ReLU activations in all layers of generator except output
* Leaky ReLU activations in all layers of discriminator (I guess except output?).

## Experiments
They train DCGANs on
* LSUN
* Imagenet-1K
* A new face dataset

With the following settings
* Minibatch size 128
* Weight initialization with $\mathcal{N}(0, 0.02^2)$
* Leaky ReLU leak slope = 0.2
* Adam optimizer with learning rate 0.0002, $\beta_1=0.5$

### Classification using learnt representations
They evaluate the quality of the representations that the DCGANs yield by using it as a feature extractor on images in a labeled datasets and then train different linear models on the features extracted.

They train a DCGAN on Imagenet-1K. The extracted representation of each image is computed by
1. Taking the output of each convolutional layer of the discriminator.
2. Maxpooling them into 4x4 (at each filter). 
3. Flatten these and concatenated to form a 28672 dimensional vector.

They fit a $l_2$-SVM classifier on these representations which gives pretty good results when testing on CIFAR-10.

### Investing the internals of the networks
They state that
* Nearest neighbor search on training set in pixel or feature space is often bad because it can be fooled by small image transforms.
* No log likelihood (I guess parzen window estimates?) as it's a bad visual quality metric.

#### Walking the latent space
* Sharp transitions in latent space usually means that the model just memorizes data.
* If walking in latent space results in semantic changes like objects being added, the model has probably learnt relevant representations.

#### Visualizing discriminator features
Using guided backpropagation they show that the discriminator learns features that activate for semantically interesting parts of a bedroom (LSUN dataset) like windows and beds.

#### Manipulating generator representation
* They say that the quality of samples suggest that the generator learns specific object representations (beds/windows/lamps/doors etc).
* They try to remove windows from being generated in an experiment.
    * Manually draw bounding boxes on windows in 150 samples.
    * On second highest conv layer features, do logistic regression to predict if feature activation was on window or not to learn which features correspond to drawing windows.
    * They drop the features corresponding to drawing windows from all spatial locations and then continue generating without them to generate samples without windows. Working ok:ish.
    
#### Vector arithmetic
They test if doing vector arithmetic on the $Z$ representation can do anything useful.

* Single samples not working well.
* Averaging Z vectors for multiple examples worked better.
* Example: *smiling woman* - *neutral woman* + *neutral man* = *smiling man* 
