# Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
* This paper presents an approach to image translation, i.e. translating and image in one "domain" to its corresponding image in another domain.

* Note that this approach does not require training data to consist of image pairs but instead just trains on two bags of images that we want to be able to translate between. The main idea is to use an adversarial loss to accomplish this.

* Image domains are basically just different distributions over the image space, e.g. nature pictures with and without snow. In the paper they are denoted by $X$ and $Y$. Not sure how to define the "translation" mathematically, conceptually it's pretty straight forward though. I think the assumption though, is that for two images that are each other's translation there exists some common latent factor or as the authors say, "two different renderings of the the same underlying scene".

## The Parts of Cycle GAN
* One generator $G: X \rightarrow Y$
* One generator $F: Y \rightarrow X$
* One discriminator $D_X: X \rightarrow \text{determine whether real or fake/translated X}$
* One discriminator $D_Y: Y \rightarrow \text{determine whether real or fake/translated Y}$

## Training
### Generators
$G$ and $F$ are trained by a optimizing for a loss defined in multiple parts.
* First the adversarial loss, $G(x) \rightarrow \hat{y}$ is trained to look like $y \sim Y$.
    * No guarantee of meaningful translation of individual $x$ and $y$, only at a macro level.
* Second, the cycle consistency loss. Translations should be cycle consistent, i.e. be translatable back and forth. Or in other words $G$ and $F$ should be each other's inverses and be bijections (each element in each set is mapped to one other element in the other set and vice versa).
    * Cycle consistency loss: $G$ and $F$ is also trained to $F(G(x)) \approx x$ and $G(F(y)) \approx y$ by $l_1$ distance.
* The total loss is $\mathcal{L}_{GAN} + \lambda \mathcal{L}_{cyclic}$

### Discriminators
Discriminators $D_X$ and $D_Y$ are only trained by the adversarial loss, that is they want to classify real samples as real and translated samples as fake.
* In practice they use least squares loss instead of normal negative log likelihhood loss in the adversarial loss parts.
* They also update the discriminators based on a history of generated/translated images instead of just the ones from the latest version of the generative networks. They that they use the 50 previously generated images but I suppose this should depend on batch size? TODO: Read implementation

## Architectures
### Generators
* Adapted from the neural style paper.
* According to paper:
    * 2 stride-2 convolution layers.
    * 6 or 9 (depending on image size) residual blocks.
    * 2 fractionally strided convolutions with stride 1/2
    * They use instance normalization.

### Discriminators
* Taken from PatchGAN paper
    * Not a single scalar output but instead one output per image patch determined by the convolutions.
    * This means fewer parameters.
    * They are then combined for the final discriminator output.

## TODO:
* They point out that this setup is similar to that of adversarial autoencoders, in that two of these are trained jointly where the target distributions are the translations of each respective domain.