This notebook is a report of the project done for the Deep Learning course in 2019-2020 of TUDelft. The course instructor is J.v.Gemert and our assistant during the project was P. Rajput

# Reproduction and research into Image-to-Image translation using cGANs
In this notebook we adopt the code from [Jun-Yan Zhu, Taesung Park and  Tongzhou Wang](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) to reproduce the results from Table 1 in [Image-to-Image Translation with Conditional Adversarial Networks](https://arxiv.org/abs/1611.07004). This paper was published in the 2017 CVPR conference and has amassed an impressive 2311 number of citations to this day. The Pytorch implementation was modified for reproduction purposes and can be found [here](https://). Since this is a relatively new direction in image-to-image translation, it is under excessive research. To this end, we also performed an exploration on the hyperparameters in search of improvement in results. Finally, we implemented an experiment of our own using this [dataset](http://mmlab.ie.cuhk.edu.hk/archive/facesketch.html) to translate human faces to sketches and reversely in order to demonstrate the power of conditional GANs.

This notebook provides the lines required to run our experiments step-by-step and is compatible with the Google Colab platform to be used with a GPU.

# Reproduction of Cityscapes label2photo experiment

## What are Generative Adversarial Networks (GANs)?
Generative Adversarial Networks answer the question "Having a (large) dataset of examples coming from one distribution, could we construct more examples that belong to the same distribution?" with an emphatic "yes". Of course, the results to date are not perfect and are subject to limitations, but they are certainly promising enough to justify continuous investigation.

How do GANs work though? They have their roots in game theory, where we train two adversarial networks to compete with each other, the Generator and the Discriminator. Say, for instance, that we have a dataset of cat pictures and we want to create more pictures with cats. The purpose of the Generator is to generate these brand new pictures with cats, in a way that they look real (how to define real is another complex matter that will be discussed further below). How do we ensure though, that the Generator is indeed trained to generate pictures that look real and not like sketches of cats, for example? This is precisely the job of the Discriminator, to be able to distinguish the real images from the fake (new) images. As the game (training) plays on, both networks learn to do a better job. Consequently, at one point the Generator should be able to create such realistic images, that the Discriminator would have a hard time to distinguish them from real ones (ideally, 50% chance).

<img src=https://pathmind.com/images/wiki/GANs.png width="500">

Let us add some mathematical formulation to the above. The Generator takes as input a random noise vector $z$ and learns a mapping to an output image $y$, $G: z \rightarrow y$. All that is left is to combine the objective of the Generator and the Discriminator in a single loss function, so that the two networks do compete with each other. The formula is $$L_{GAN} = E_{x\sim p_{data}}[logD_{\theta_D}(x)] + E_{x\sim p(z)}[log(1-D_{\theta_D}(G_{\theta_G}(z)))].$$ Can you guess what each network is trying to achieve? The Disciminator $D_{\theta_D}$ aims to maximize the loss so that $D_{\theta_D}(x)$ is close to 1 for real data and $D_{\theta_D}(G_{\theta_G}(x))$ is 0 for fake. On the other hand, the Generator $G_{\theta_G}$ aims to minimize the loss so that $D_{\theta_D}(G_{\theta_G}(x))$ is close to 1 (essentially fooling the Discriminator). This can be synopsized as $$arg\,\underset{\theta_G}{min}\,\underset{\theta_D}{max}\,L_{GAN}.$$ This last equation is the essence of GANs.

## From GANs to Conditional GANs
Now, let us change a little bit the settings of the initial problem. Let's instead say that we want to translate the images of a distribution to images of a different distribution, _conditioned_ on the first distribution. For example, we may want to translate a satellite image into a map image (similar to Google maps). This is more of a traditional supervised learning problem, as we can have pairs of images _x_ and _y_, for which we want our Generator network to take as input the image _x_ (and noise _z_) and output _y_, $G: (x, z) \rightarrow y$. 

This problem is addressed in the paper under study: Image-to-Image Translation with Conditional Adversarial Networks, where the authors (Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros) put together different pieces of the latest research and discover a conditional GAN (cGAN) which provides impressive results to the image-to-image translation problem.

The objective function for cGANs is different from that in GANs due to the conditioning of the input. The Generator does not just take noise and output an image, but also takes as input the image to be translated. In addition to that, the authors discover that conditioning the input in the Discriminator (in a sense that the Discriminator takes as input both the image _x_ and the output image _y_) yielded better results. As a result, the final loss function of the cGAN can be expressed as: $$L_{cGAN} = E_{x, y}[logD_{\theta_D}(x, y)] + E_{x, z}[log(1-D_{\theta_D}(x, G_{\theta_G}(x, z)))].$$ In addition to that, an L1 term loss was also found to be beneficial when added to the cGAN loss. The final objective is then $$G* = arg\,\underset{G}{min}\,\underset{D}{max}\,L_{cGAN}(G, D) + \lambda L_{L1}(G).$$

Below is an image translation from a pokemon sketch to a... pokemon.

<img src=https://phillipi.github.io/pix2pix/images/iPokemon.jpg width="500">

## Evaluation using Semantic Segmentation
How does one evaluate the quality of the generated images using (c)GANs? That remains an open research question to which multiple answers exist but none seems perfect. The obvious solution is through visual inspection by humans. However, this does not only pose a matter of subjectivity but also requires human resources similar to annotation in supervised learning. It would be preferable to avoid such highly inefficient evaluation and opt some for a more automated solution.

A good idea would perphaps be to find out if the GAN successfully reproduces the objects in the image, with localized accuracy. Think of an object detection algorithm. The algorithm recognizes objects in an image and finds their location by putting boxes around them, as in the image below.


<img src=https://azati.ai/wp-content/uploads/2019/02/object-detection-800x400.jpg width="500">

An even harder classification task would be to mark every pixel to correspond to a particular class. This is precisely what Semantic Segmentation does. An example is given in the image below, where each colour represents a class.

<img src=https://theaisummer.com/assets/img/posts/semseg.jpg width="500">

To evaluate results quantitively the authors use the [FCN 8S](https://arxiv.org/abs/1411.4038) (Fully Convolutional Networks) for Semantic Segmentation. This way, they can check if the generated images successfully represent the correct objects per-pixel basis.

## Experimental Setup
The authors train and test 5 different objective functions on the [Cityscapes](https://www.cityscapes-dataset.com/) dataset to evaluate the significance of the cGANs loss in combination with the L1 loss. That is table 1 in the paper, which is shown below. To test the importance of conditioning the input on the Discriminator, they also used a pure GAN loss (note though that the Generator still takes as input image $x$, not just noise). The input images to the network are 256x256, resized from the Cityscapes dataset.

We use 2975 images as our training set and test on 500 images from the validation set of Cityscapes. There is no validation involved as we are not fine-tuning any hyperparameters, but attempting to reproduce the Table's results. The table contains three numbers for each evaluation, computed using FCN-8s, the per-pixel accuracy, the per-class accuracy and the class IoU (Intersection over Union). As one can see, the combined L1 and cGAN loss scores the most points in all three categories.

| Loss | Per-pixel acc. | Per-class acc. | Class IoU |
| :---: | :---: | :---: | :---: |
| L1 | 0.42 | 0.15 | 0.11 |
| GAN | 0.22 | 0.05 | 0.01 |
| cGAN | 0.57 | 0.22 | 0.16 |
| L1 + GAN | 0.64 | 0.20 | 0.15 |
| **L1 + cGAN** | **0.66** | **0.23** | **0.17** |
| Ground truth | 0.80 | 0.26 | 0.21 |

## Generator Architecture
The architecture of the Generator is given concisely in the appendix of the pix2pix paper. In their words:

The Generator is composed of a U-Net architecture, which is an encoder-decoder with skip connections between layers i of the encoder and layers n-i of the decoder (where n is the total number of layers). The skip connections concatenate activations from layer i to layer n-i. Let Ck denote a Convolution-BatchNorm-ReLU layer with k filters and CDk a Convolution-BatchNorm-Dropout-ReLU layer with a dropout rate of 50%. All convolutions are 4x4 spatial filters applied with stride 2. Convolutions in the encoder downsample by a factor of 2, whereas in the decoder they upsample by a factor of 2. Having introduced the notation, the architecture is:

**Encoder:**
C64 - C128 - C256 - C512 - C512 - C512 - C512 - C512

**Decoder:**
CD512 - CD1024 - CD1024 - C1024 - C1024 - C5212 - C256 - C128

After the last layer in the decoder, a convolution is applied to map to the number of output channels, followed by a Tanh function. As an exception to the above notation, BatchNorm is not applied to the first C64 layer in the encoder. All ReLUs in the encoder are leaky, with slope 0.2, whereas ReLUs in the decoder are not leaky.

## Discriminator Architecture
The authors claim that an L1 (or L2) loss is sufficient for capturing the low frequencies in an image, but not the crispness that is encoded in the high frequencies. That is why only using an L1/L2 loss produces blurry results in image generation problems.

To tackle this problem, they employ PatchGAN in the Discriminator, which penalizes structure at the scale of patches. In other words, the Discriminator attempts to classify if each NxN patch in an image is real or fake. N does not have to be as large as the full size of the image and still do a good job while being efficient and applicable to arbitrarily large images.

For $N=70$, which is used for this experiment, the Discriminator's architecture is:

C64 - C128 - C256 - C512

After the last layer, a convolution is applied to produce a 1-dimensional output. Originally, it was followed by a Sigmoid function which was removed in the Pytorch implementation of the code by the authors. In an attempt to fully mimick the original architecture, the Sigmoid was added but results turned out to be really poor. For that reason, it was decided to leave out the Sigmoid, as results were reported to be "comparable or better" than the original implementation. Again, BatchNorm is not applied to the first C64 layer and all ReLUs are leaky with slope 0.2.

## Running the Experiment on Google Colab
To reproduce the experiment, you have to download our version of the code [here](https://) and upload it to your Google Drive. Then, to run it remotely on Google Colab, you have to run the following snippet of code (in Google Colab) to mount your Google Drive. Your workspace is located in the folder you previously uploaded and you have to specify its path. The last line installs several packages required by the cGANs.

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

#!git clone https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

import os
# specify your own path to pytorch-CycleGAN-and-pix2pix folder
path = '/content/drive/My Drive/cGans/pytorch-CycleGAN-and-pix2pix'
os.chdir(path)

# install several requirements to run the scripts below
!pip install -r requirements.txt

## Hyperparameter Setting
The hyperparameters can be selected before running the train/test scripts. 

When referring to BatchNorm above, the authors actually refer to instance normalization, rather than Batch Normalization (either can be specified as an option). Both were attempted and indeed instance normalization yielded better results and closer to the ones reported in Table 1.

The batch size is 1 and the learning rate is tuned to 0.0002. Weights are initialized from a Gaussian Distribution with mean 0 and standard deviation 0.02. The network is trained for 200 epochs, for each configuration.

The code was modified so that the loss function could be altered to be able to reproduce the results of the table. Specifically, options _use_GAN_ and _condition_GAN_ were added, while setting _lambda_L1_ to 0 was sufficient to turn off the L1 term. The Pix2PixModel class with the modifications is shown below.

In [0]:
import torch
from models import BaseModel
from models import networks


class Pix2PixModel(BaseModel):
    """ This class implements the pix2pix model, for learning a mapping from input images to output images given paired data.

    The model training requires '--dataset_mode aligned' dataset.
    By default, it uses a '--netG unet256' U-Net generator,
    a '--netD basic' discriminator (PatchGAN),
    and a '--gan_mode' vanilla GAN loss (the cross-entropy objective used in the orignal GAN paper).

    pix2pix paper: https://arxiv.org/pdf/1611.07004.pdf
    """
    @staticmethod
    def modify_commandline_options(parser, is_train=True):
        """Add new dataset-specific options, and rewrite default values for existing options.

        Parameters:
            parser          -- original option parser
            is_train (bool) -- whether training phase or test phase. You can use this flag to add training-specific or test-specific options.

        Returns:
            the modified parser.

        For pix2pix, we do not use image buffer
        The training objective is: GAN Loss + lambda_L1 * ||G(A)-B||_1
        By default, we use vanilla GAN loss, UNet with batchnorm, and aligned datasets.
        """
        # changing the default values to match the pix2pix paper (https://phillipi.github.io/pix2pix/)
        parser.set_defaults(norm='batch', netG='unet_256', dataset_mode='aligned')
        if is_train:
            parser.set_defaults(pool_size=0, gan_mode='vanilla')
            parser.add_argument('--lambda_L1', type=float, default=100.0, help='weight for L1 loss')
            parser.add_argument('--condition_GAN', type=int, default=1, help='set to 0 to use unconditional discriminator')
            parser.add_argument('--use_GAN', type=int, default=1, help='set to 0 to turn off GAN term')

        return parser

    def __init__(self, opt):
        """Initialize the pix2pix class.

        Parameters:
            opt (Option class)-- stores all the experiment flags; needs to be a subclass of BaseOptions
        """
        BaseModel.__init__(self, opt)
        # specify the training losses you want to print out. The training/test scripts will call <BaseModel.get_current_losses>
        if self.opt.use_GAN == 1:
            self.loss_names = ['G_GAN', 'G_L1', 'D_real', 'D_fake']
        else:
            self.loss_names = ['G_GAN', 'G_L1']
        # specify the images you want to save/display. The training/test scripts will call <BaseModel.get_current_visuals>
        self.visual_names = ['real_A', 'fake_B', 'real_B']
        # specify the models you want to save to the disk. The training/test scripts will call <BaseModel.save_networks> and <BaseModel.load_networks>
        if self.isTrain:
            self.model_names = ['G', 'D']
        else:  # during test time, only load G
            self.model_names = ['G']
        # define networks (both generator and discriminator)
        self.netG = networks.define_G(opt.input_nc, opt.output_nc, opt.ngf, opt.netG, opt.norm,
                                      not opt.no_dropout, opt.init_type, opt.init_gain, self.gpu_ids)

        if self.isTrain:  # define a discriminator; conditional GANs need to take both input and output images; Therefore, #channels for D is input_nc + output_nc
            # use condition_GAN
            self.netD = networks.define_D(opt.input_nc*opt.condition_GAN + opt.output_nc, opt.ndf, opt.netD,
                                          opt.n_layers_D, opt.norm, opt.init_type, opt.init_gain, self.gpu_ids)

        if self.isTrain:
            # define loss functions
            self.criterionGAN = networks.GANLoss(opt.gan_mode).to(self.device)
            self.criterionL1 = torch.nn.L1Loss()
            # initialize optimizers; schedulers will be automatically created by function <BaseModel.setup>.
            self.optimizer_G = torch.optim.Adam(self.netG.parameters(), lr=opt.lr, betas=(opt.beta1, 0.999))
            self.optimizer_D = torch.optim.Adam(self.netD.parameters(), lr=opt.lr, betas=(opt.beta1, 0.999))
            self.optimizers.append(self.optimizer_G)
            self.optimizers.append(self.optimizer_D)

    def set_input(self, input):
        """Unpack input data from the dataloader and perform necessary pre-processing steps.

        Parameters:
            input (dict): include the data itself and its metadata information.

        The option 'direction' can be used to swap images in domain A and domain B.
        """
        AtoB = self.opt.direction == 'AtoB'
        self.real_A = input['A' if AtoB else 'B'].to(self.device)
        self.real_B = input['B' if AtoB else 'A'].to(self.device)
        self.image_paths = input['A_paths' if AtoB else 'B_paths']

    def forward(self):
        """Run forward pass; called by both functions <optimize_parameters> and <test>."""
        self.fake_B = self.netG(self.real_A)  # G(A)

    def backward_D(self):
        """Calculate GAN loss for the discriminator"""
        # Fake; stop backprop to the generator by detaching fake_B
        if self.opt.condition_GAN == 1:
            fake_AB = torch.cat((self.real_A, self.fake_B), 1)  # we use conditional GANs; we need to feed both input and output to the discriminator
        else:
            fake_AB = self.fake_B # unconditional GAN, only penalizes structure in B

        pred_fake = self.netD(fake_AB.detach())
        self.loss_D_fake = self.criterionGAN(pred_fake, False)
        # Real
        if self.opt.condition_GAN == 1:
            real_AB = torch.cat((self.real_A, self.real_B), 1)
        else:
            real_AB = self.real_B # unconditional GAN, only penalizes structure in B

        pred_real = self.netD(real_AB)
        self.loss_D_real = self.criterionGAN(pred_real, True)
        # combine loss and calculate gradients
        self.loss_D = (self.loss_D_fake + self.loss_D_real) * 0.5
        self.loss_D.backward()

    def backward_G(self):
        """Calculate GAN and L1 loss for the generator"""
        if self.opt.use_GAN == 1:
            # First, G(A) should fake the discriminator
            if self.opt.condition_GAN == 1:
                fake_AB = torch.cat((self.real_A, self.fake_B), 1)
            else:
                fake_AB = self.fake_B
            pred_fake = self.netD(fake_AB)
            self.loss_G_GAN = self.criterionGAN(pred_fake, True)
        else:
            self.loss_G_GAN = 0 
        # Second, G(A) = B
        self.loss_G_L1 = self.criterionL1(self.fake_B, self.real_B) * self.opt.lambda_L1
        # combine loss and calculate gradients
        self.loss_G = self.loss_G_GAN + self.loss_G_L1
        self.loss_G.backward()

    def optimize_parameters(self):
        self.forward()                   # compute fake images: G(A)
        if self.opt.use_GAN == 1:
            # update D
            self.set_requires_grad(self.netD, True)  # enable backprop for D
            self.optimizer_D.zero_grad()     # set D's gradients to zero
            self.backward_D()                # calculate gradients for D
            self.optimizer_D.step()          # update D's weights
        # update G
        self.set_requires_grad(self.netD, False)  # D requires no gradients when optimizing G
        self.optimizer_G.zero_grad()        # set G's gradients to zero
        self.backward_G()                   # calculate graidents for G
        self.optimizer_G.step()             # udpate G's weights


## Dataset
The dataset Cityscapes contains one set of city photos with many different classes such as roads, sidewalks, buildings etc and another set of images with their labels (similar to Semantic Segmentation). The experiment performed concerns the direction $labels \rightarrow images$, even though the other direction is also possible. To run the experiment, you have to download the _gtFine_trainvaltest.zip_ and _leftImg8bit_trainvaltest.zip_ files from the Cityscapes dataset [here](https://www.cityscapes-dataset.com/downloads/). The original images are 2048x1024. Using the provided script in _dataset/prepare_cityscapes_dataset.py_, the images are resized to 256x256 and placed into a folder named _cityscapes_ which you should place into the datasets folder in order to continue.

## Running the Experiment
The following snippet of code runs the training with some arguments specified. The direction is from B (labels) to A (photos), it runs for 200 epochs without any learning rate decay. The 5 different lines of code specify a different configuration which corresponds to the respective row of the table. Uncomment the one you want to run and comment the other ones. Training on a Tesla P100 GPU took about \~ 8 hours (~140 seconds per epoch).  

In [0]:
# !python train.py --dataroot ./datasets/cityscapes --name cityscapes_pix2pix_L1 --model pix2pix --direction BtoA --n_epochs 200 --n_epochs_decay 0 --use_GAN 0 
# !python train.py --dataroot ./datasets/cityscapes --name cityscapes_pix2pix_GAN --model pix2pix --direction BtoA --n_epochs 200 --n_epochs_decay 0 --lambda_L1 0 --condition_GAN 0
# !python train.py --dataroot ./datasets/cityscapes --name cityscapes_pix2pix_cGAN --model pix2pix --direction BtoA --n_epochs 200 --n_epochs_decay 0 --lambda_L1 0
# !python train.py --dataroot ./datasets/cityscapes --name cityscapes_pix2pix_L1_GAN --model pix2pix --direction BtoA --n_epochs 200 --n_epochs_decay 0 --condition_GAN 0
!python train.py --dataroot ./datasets/cityscapes --name cityscapes_pix2pix_L1_cGAN --model pix2pix --direction BtoA --n_epochs 200 --n_epochs_decay 0

And testing:

In [0]:
# specify the name of your experiment after --name
# this should match the name specified during training
# for example, for L1+cGAN loss the name should be "cityscapes_pix2pix_L1_cGAN"
!python test.py --dataroot ./datasets/cityscapes --direction BtoA --model pix2pix --name name_of_your_model --num_test 500 --results_dir results

## Evaluation
To evaluate the code using the FCN-8s network, first you need to download it. Then, caffe needs to be installed and the resulting photos to be renamed to the original names of the dataset, using our script. Finally, they can be processed by the evaluation script. Run the following snippets of code and replace any arguments needed as indicated in the comments.

In [0]:
# download the FCN-8s model
# this only needs to be downloaded once
!bash ./scripts/eval_cityscapes/download_fcn8s.sh

In [0]:
# install caffe
!apt install -y caffe-cpu

In [0]:
# import the path
!export PYTHONPATH=${PYTHONPATH}:./scripts/eval_cityscapes/
# rename the resulting photos to the original names of the dataset
# again, specify the name of your model that matches the one of the test
!python ./scripts/eval_cityscapes/rename.py --photos_dir ./results/name_of_your_model/test_latest/images

In [0]:
# Finally, run the evaluation scripts
# This required ~22 GB of RAM and took 4-5 hours in Google Colab using CPU (could not fit in GPU)
# To allow for enough RAM storage in Google Colab change this setting: Runtime -> Change Runtime type -> Runtime shape -> High RAM
# If it can fit in your GPU, replace evaluate_cpu.py by evaluate_gpu.py
# replace name_of_your_model in the --result_dir argument with your name of the model
# evaluation results will be placed in folder evaluation_results or any other you specify
!python ./scripts/eval_cityscapes/evaluate_cpu.py --cityscapes_dir ./scripts/eval_cityscapes/original_dataset --result_dir ./results/name_of_your_model/test_latest/images/original_names --output_dir ./evaluation_results


## Evaluation Results

The results of the evaluation are shown in the table below, next to the paper's results. We obtained (much) better results for many configurations, which is due to the Pytorch implementation which is different from the original paper. Most notably, the best configuration (L1+cGAN loss) has a significantly higher per-pixel accuracy, which is to be expected because it matches more closely the numbers obtained in the [CycleGAN paper](https://arxiv.org/abs/1703.10593), in which they used the same implementation. To our astonishment, L1 alone scores just a little lower than the best configuration, defying all assumptions made during. Or does it not?

| Loss | Per-pixel acc. (ours) | Per-pixel acc. | Per-class acc. (ours) | Per-class acc. | Class IoU (ours) | Class IoU | 
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| L1 | 0.70 | 0.42 | 0.24 | 0.15 | 0.18 | 0.11 |
| GAN | 0.25 | 0.22 | 0.08 | 0.05 | 0.03 | 0.01 |
| cGAN | 0.59 | 0.57 | 0.20 | 0.22 | 0.14 | 0.16 |
| L1 + GAN | 0.65 | 0.64 | 0.20 | 0.20 | 0.15 | 0.15 |
| **L1 + cGAN** | **0.75** | **0.66** | **0.24** | **0.23** | **0.19** | **0.17** |


## Discussion
Let us take a closer look at the images generated for each different loss.
![results](https://drive.google.com/uc?id=1F3vznhyhQ917BqW1j0Gscug8z8HCVnW5)

The label is the image which is translated to the different images on its right depending on the loss. Now, how can the L1 picture score so high? Any human observer would classify it as fake without a second look due to the amount of blur in it. Then, how does the FCN decide to award it so many points? 

The label on the left could easily be the Semantic Segmentation of the real image. Based on that, did the L1 do a good job in reproducing the real image? Yes! Because the different objects can be distinguished in it with high pixel accuracy. However, _does it seem real_? Of course not.

This poses serious doubt to the quality of the evaluation with the use of FCN. Even though this method can successfully offer an insight into whether the generated image contains the right objects at the right locations, it cannot guarantee that their texture can deceive the eye into thinking that they are real. Any human would say that the L1+cGAN generated image is more than 5 times more realistic than the L1 image. However, the FCN scores show that the difference of quality is almost insignificant. Therefore, we conclude that Semantic Segmentation is insufficient on its own to evaluate the quality of generated images. Further research on the evaluation methods needs to be conducted.

# Hyperparameter Search
Due to time restrictions, we only were able to attempt a few runs with different configurations focusing on the following hyperparameters:
1. Learning rate
2. patchGAN receptive field

## Learning Rate
Learning rate is arguably considered the most important hyperparameter in a ML/DL project. This is why this was our primary target to tune. The initial value was $0.0002$. We attempted two different settings where we increased/decreased the learning rate by a factor of 3. For $lr=0.0006$ the final quality is not affected, whereas in the case of a $lr=0.00007$ the performance is slightly worse. The reason for that is that the network might not converge in the second case, since a smaller step is taken in each time step.

| LR | Per-pixel acc. (ours) | Per-class acc. (ours) | Class IoU (ours) |
| :---: | :---: | :---: | :---: |
| 0.00007 | 0.72 |  0.22 |  0.17 |
| **0.0006** | **0.75** |  **0.24** |  **0.19** |
| **0.0002** | **0.75** |  **0.24** |  **0.19** | 

In the case of the highest learning rate (0.0006) the effect of the number of epochs was also tested. Since the network is learning faster, the results were examined at 100 epochs of training. The generated images were only slightly worse (FCN scores similar to $lr=0.00007$). This implies that the model has almost converged and only slight improvement was achieved for the latter half of the training, saving up to ~4 hours on a Tesla P100 GPU for a small trade-off in performance.

## PatchGAN receptive field
As mentioned above, the Discriminator determines if patches of size NxN in an image are real or fake. In that sense, N characterizes the receptive field. The default value is 70. We increased it to 142 by increasing the Discriminator's number of convolutional layers to 4. This resulted in the following FCN score:

| N | Per-pixel acc. (ours) | Per-class acc. (ours) | Class IoU (ours) |
| :---: | :---: | :---: | :---: |
| 70 | 0.75 |  0.24 |  0.19 |
| 142 | 0.70 |  0.22 |  0.18 |

It seems that increasing the receptive field does not only increase the parameters of the network but also worsens performance. It is expected that different datasets would better fit for different N values. For this particular dataset (Cityscapes) $N=70$ appears to be the best choice, as indicated by the authors of pix2pix.

# Experiment Sketch2Face

## New Datasets

 A new dataset was downloaded from http://mmlab.ie.cuhk.edu.hk/archive/facesketch.html. This dataset includes 188 faces from the Chinese University of Hong Kong (CUHK) student database. For each face, there is a sketch drawn by an artist based on a photo taken in a frontal pose, under normal lighting condition, and with a neutral expression.

 The goal of this dataset is to train a cGAN that can create sketches from face images and the other way around. Two different types of cGANs were used and compared to each other, pix2pix and cycleGAN. In pix2pix the images have to be in pairs, so that the network can learn a mapping from one of these to the other (one way mapping). Sometimes we only have two sets of images that are not paired (think of faces and sketches where the sketches belong to different faces). In that case, cycleGAN can be used with great success. Its inputs are unpaired images from two different domains that the network will learn to translate (in both directions). CycleGANs do not need to have tightly correlated images combined in one picture. It is a more general way to translate images and this is the reason why sometimes they give better results. Another difference between these two types of cGANs is that in cycleGAN's objective function there is an extra criterion, the cycle consistency loss that helps to achieve the translation. Moreover, cycleGAN consists of four networks (two discriminators and two generators) while pix2pix consists of only two (one discriminator, one generator). 

 In order for both of these types to work properly, images of both faces and sketches had to be resized to 414*512. This was implemented with a script ('massive_resizer' below). Moreover, for pix2pix the face and its corresponding sketch needed to be in the same image, as described above. This was done by combining them horizontally, once again, with a script ('pix2pix_combine') which has a similar format as an existing one ('combine_A_and_B') in the original paper. For cycleGAN the training set was chosen to contain 178 images and 10 for testing whereas for pix2pix we ran the experiment with a training set of 170 images, a validation set of 8 images and a testing set of 10 images.

  For cycleGAN, even though it was trained only on people's faces from Asia, it can generalize well into any face image (it gives even better results if there is constant background). Below some pictures from celebrities and their sketches are presented. <br>
![](https://drive.google.com/uc?export=view&id=1Vtyn_jWVDsgyW95yWIf98UWWRAVKCCyb)
![](https://drive.google.com/uc?export=view&id=1EfW1Y56viULqiWJahNxq2BaJdOwJ1SiP)
<br>
![](https://drive.google.com/uc?export=view&id=1-UcDh9v_R0ZepUhWo4VQ6MP2Nk7o_rrm)
![](https://drive.google.com/uc?export=view&id=1675ug9zIE-c6Q-q5J4p6UlfJ-CtTSQyZ)
<br>
![](https://drive.google.com/uc?export=view&id=1zTRLGWtUQGWWaSsHDppOo_NdAeoVJXvy)
![](https://drive.google.com/uc?export=view&id=11sZk2AvNO5d4h3zdKhZPmMSuvIazF40i)

There are also some difficult cases (strange lightning conditions) but even then that a non-constant background exists, proper results can be obtained.
<br>
![](https://drive.google.com/uc?export=view&id=10QybGYktzCSogK6Wd0XtcQT4CMZRCeAK)
![](https://drive.google.com/uc?export=view&id=1RwkvzbgJMVLIyn48joFd7wvCU4N0H7e0)
<br>
![](https://drive.google.com/uc?export=view&id=1mlDV3uF3yWchFiN1ajIsKIxVEQvPfcoP)
![](https://drive.google.com/uc?export=view&id=1NTXyUFyRNmWHYx3jR3OrRV8_C1QxoRhg)
<br>

 A significant problem appears when reconstruction of images from sketches was attempted. It is very difficult for the network to learn that mapping. This is expected since the structure in these images has a large entropy due to the faces being able to adopt different colours and still look human (chaotic structure similar to paper). 
<br>
![](https://drive.google.com/uc?export=view&id=1dyEP5m4l0S0J0hwykfZ8ifH4Umw-qNse)
![](https://drive.google.com/uc?export=view&id=1L6f-JavEGpixe3Tv3O5rjNdGpF1_U2jH)
<br>
![](https://drive.google.com/uc?export=view&id=1q2AjJrjDheM0EmiukJOzF8iPa61Nxcmn)
![](https://drive.google.com/uc?export=view&id=1yTFjBY7NX-6qsBfIyl2LAIoYdEKZsTC9)
<br>

Another important observation is the following: When a network is fed with face images it outputs some sketches of them. Then, if these sketches are given as input to the network, the reconstructed face images are significantly improved compared to the reconstructed images that emerge if the ground truth sketches are given as input. Nevertheless, in both cases the results do not seem very realistic.
<br>
![](https://drive.google.com/uc?export=view&id=1UCddG3LPdk7lOfqCJlj2r3Bb4FV_seCs)
![](https://drive.google.com/uc?export=view&id=1oGD3b5pZFrZrq5IqleNAk4AkgOQA7j2r)
<br>
![](https://drive.google.com/uc?export=view&id=1rkvTytPkIl5xCk8ZtEh2fhbuROrrs41T)
![](https://drive.google.com/uc?export=view&id=1ai6is7xenOQ774vZjr0api7IkrlU9sq3)
<br>

This happens because of the cycle-consistency, an inherent property of cycle-GANs. It should also be mentioned that this holds true only for images from the original dataset of sketches. If a similar procedure is performed with sketches created from images that are not included in the original dataset, the reconstruction will fail. It is assumed that the reason for this failure is that $sketch \rightarrow photo$ translation involves more entropy and thus, it is more difficult for the network to learn the mapping, since it has a developed a bias towards certain face characteristics.

 In addition to the above, it was attempted to transform a tiger into a lion and a guitar into a violin. Both cases failed, as expected, since the model can not find geometric transformations of the images (this is also described in the paper). An interesting fact about these results is that the model learned to distinguish the texture/color differences between the objects. For example, it learned that tiger and lion have different color from each other. The same holds true for violin and guitar, as demonstrated in the pictures below. 
<br>
![](https://drive.google.com/uc?export=view&id=1BiD_PPZs2TPZ5uLfzEWA-PPIOfl8k0WP)
![](https://drive.google.com/uc?export=view&id=1X700WPtLjsKdogxAW0WlwjuIA7AvzseH)
<br>
![](https://drive.google.com/uc?export=view&id=1GXJX5uApswHsxnZUek_ABzj2vR4nxEje)
![](https://drive.google.com/uc?export=view&id=1aQNaKmvuJ6dMRSSnOvGxxEscsB4hd6-x)
<br>

In addition, it seems that it figured out that tigers have stripes while lions do not. Failure cases of both of these can also be seen in the pictures below.
<br>
![](https://drive.google.com/uc?export=view&id=1PaLTtjiMuWlwMfEmEwTVmz4M8yAPB5pU)
![](https://drive.google.com/uc?export=view&id=1r1u4OYtHzgO-0giZOXgjE1TEEc9M7wYS)
<br>
![](https://drive.google.com/uc?export=view&id=1CL-DRp6p2cQapp3y4gc3EsPHwvDw-3tZ)
![](https://drive.google.com/uc?export=view&id=1uV_9LiEN9YvKqguUAbsw-OhWGRB0fdQf)
<br>

 For pix2pix, only the face dataset was used. It should be mentioned that pix2pix works only in one direction and so, two different trainings had to be done, one for $photo \rightarrow sketch$ and one for $sketch \rightarrow photo$. The results are comparable, but a little worse for $photo \rightarrow sketch$ direction.
<br>
![](https://drive.google.com/uc?export=view&id=17XzEwt8fg-vfQDkLgljhBiIBRlDFhF6y)
![](https://drive.google.com/uc?export=view&id=1TGoC8uc6PXJvyACjVVd5v-P-9n07aPjT)
<br>
![](https://drive.google.com/uc?export=view&id=1qWxPp3ci247rpK7F5hdE82tMq4eJPVeK)
![](https://drive.google.com/uc?export=view&id=16IV3B4xEYtUZRuHJqKWWX_4PkR3EJD3R)
<br>

The reason for this may be that with resizing, it is not possible to perfectly align the sketch and the image and put them next to each other, as the pix2pix model requires. Another hypothesis is that pix2pix model hyperparameters could be less fine-tuned compared to the ones from cycleGAN. In contrast to the above, the $sketch \rightarrow image $ pix2pix model seems to give better results, but still not quite realistic since, among others, there are some failures in the contour of the faces.
<br>
![](https://drive.google.com/uc?export=view&id=1uvUQllUyynDfF933oOtv83mfuULNKpM8)
![](https://drive.google.com/uc?export=view&id=1H1uNLYKDDyHK3oiQNEm4JSxO-Kl6ad89)
<br>
![](https://drive.google.com/uc?export=view&id=1PS93N4I2RzlejkgTZ-xC4ruffX3B5C5J)
![](https://drive.google.com/uc?export=view&id=17MKFoRTy4-s71zltj0idzrD6rA6Xb9Xk)
<br>

  In the default options (with a receptive field of patchGAN equal to 70 which corresponds to 3 convolutional layers in the discriminator) the resulting sketches were blurred. This was an indication that more convolutional layers are needed in order for the discriminator to learn to distinguish fake sketches (blurred) from real ones. So, in order to obtain the above results, the number of convolutional layers in the discriminator was set to 5.

##Dataset Preparation
In order to resize and combine the images with the scripts that are described below some preprocessing is required. The reason for that is that the original dataset contains some of the images in a strange format. 

One of the problems with the format is that some images have an ending in ".JPG" instead of ".jpg" which our scripts cannot understand. Moreover, sketches have an additional '-sz1' in their names while they are supposed to have the same name in order for the combine_A_and_B script to work. Finally, the prefix between sketches and images are not the same for all the images and so, for those that is different it had to be changed. All these changed were done by us using the command 'rename.ul' from the terminal in Linux and the processed images can be found [here](https://drive.google.com/drive/folders/1VsoKx26xCywodaeQVOh4SVE5ZP3YvBsz?usp=sharing).

##Code for resizing images - massive_resizer

In [0]:
#Usage
#python massive_resizer.py --image /path/to/trainsetA --size /path/to/image/with/size/to/output --destination /path/to/export/images

from PIL import Image
from resizeimage import resizeimage
import os
import argparse
import cv2

ap = argparse.ArgumentParser()
ap.add_argument("-p", "--image", required=True,
	help="path to images folder that we want to change size")
ap.add_argument("-i", "--size", required=True,
	help="path to image that we want to get its size")
ap.add_argument("-d", "--destination", required=True,
	help="path to output folder")
args = vars(ap.parse_args())

#original 1024*768 to 414*582

image1 = cv2.imread(args["size"])
sh=image1.shape

os.chdir(args["image"]) 
for filename in os.listdir(args["image"]):
    a=filename
    with open(a, 'r+b') as f:
        with Image.open(f) as image:
            cover = resizeimage.resize_cover(image, [sh[1],sh[0]])
            os.chdir(args["destination"]) 
            cover.save(a, image.format)
            os.chdir(args["image"]) 

##Code for combining images

In [0]:
#Usage
#python pix2pix_combine.py --image /path/to/images/to/put/on/the/left --sketch /path/to/images/to/put/on/the/right --destination /path/to/export/images

import cv2
import os
import numpy as np
import argparse

ap = argparse.ArgumentParser()
ap.add_argument("-p", "--image", required=True,
	help="path to real images folder")
ap.add_argument("-i", "--sketch", required=True,
	help="path to sketch images folder")
ap.add_argument("-d", "--destination", required=True,
	help="path to output folder")
args = vars(ap.parse_args())

#We have already both real images and their sketches have the same dimension (see code above) and same name in their folders
#Below are three path examples

#path = '/home/user/Desktop/train/photos' #Assuming that real images are in this folder
#path2 = '/home/user/Desktop/train/sketches' #Assuming sketches are in this folder
#path3 = '/home/user/Desktop/train/new' #Combined photos+sketches images will be created here

os.chdir(args["image"]) 

for filename in os.listdir(args["image"]):
    a=filename 
    print(a)
    image1 = cv2.imread(a)
    print(image1.shape)
    os.chdir(args["sketch"]) 
    image2 = cv2.imread(a)
    print(image2.shape)
    os.chdir(args["destination"]) 
    comb = 255 * np.ones((image1.shape[0], image1.shape[1]+image2.shape[1], 3), dtype=np.uint8)
    print(comb.shape)
    comb[:image1.shape[0],image1.shape[1]:,:]=image1
    comb[:image1.shape[0],:image1.shape[1],:]=image2
    #cv2.imshow("Output", comb)
    cv2.imwrite(a[:-4]+"_AB"+a[-4:], comb) 
    os.chdir(args["image"])

##Code for training a pix2pix model

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

In [0]:
# specify your own path to where pytorch-CycleGAN-and-pix2pix folder is
import os
os.chdir('/content/drive/My Drive/pytorch-CycleGAN-and-pix2pix/')

In [0]:
!pip install -r requirements.txt

In [0]:
#fold_A is the path of the folder with the images on the left
#fold_B is the path of the folder with the images on the right
#fold_AB is the path of the folder in which the results will appear
!python datasets/combine_A_and_B.py --fold_A ./datasets/human2sketch/B --fold_B ./datasets/human2sketch/A --fold_AB ./datasets/human2sketch/AB

In [0]:
#dataroot is the path to the combined images. Note that there should be a foldr named 'AB' inside of which there will be folders named 'train', 'test' and 'val'
#n_layers_D specifies the number of convolutional layers and by increasing it, the receptive field is increased
#if we want to restart training from a given checkpoint we use the option --continue_train and the specify the epoch to restart with --epoch_count 50
!python train.py --dataroot ./datasets/human2sketch/AB  --name human2sketchnew  --netD n_layers --n_layers_D 5 --model pix2pix 

In [0]:
#name and dataroot as those used in training
#direction is here just to clarify that we can only go from the image on the left to the image on the right and not the other way around
!python test.py --dataroot datasets/human2sketch/AB  --name human2sketchnew --model pix2pix --no_dropout --direction AtoB

##Code for training a cycleGAN model

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

In [0]:
# specify your own path to where pytorch-CycleGAN-and-pix2pix folder is
import os
os.chdir('/content/drive/My Drive/pytorch-CycleGAN-and-pix2pix/')

In [0]:
!pip install -r requirements.txt

In [0]:
#dataroot is the path to the images. Note that inside that folder there should be 4 folders named 'trainA', 'trainB', 'testA', 'testB'
#if we want to restart training from a given checkpoint we use the option --continue_train and the specify the epoch to restart with --epoch_count 50
!python train.py --dataroot ./datasets/human2sketch --name human2sketchnew  --model cycle_gan

In [0]:
#name the same as the one used in training
#dataroot is the path to the folder that contains the images that we want to convert
!python test.py --dataroot datasets/human2sketch/testB  --name human2sketchnew --model test --no_dropout

##References
X. Wang and X. Tang, “Face Photo-Sketch Synthesis and Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 31, 2009.

Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (https://arxiv.org/abs/1611.07004 and GitHub https://github.com/phillipi/pix2pix )

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros (https://arxiv.org/abs/1703.10593 and GitHub https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix)

Fully Convolutional Networks for Semantic Segmentation, Jonathan Long, Evan Shelhamer, Trevor Darrell, https://arxiv.org/abs/1411.4038

https://blog.paperspace.com/unpaired-image-to-image-translation-with-cyclegan/

https://towardsdatascience.com/cyclegans-and-pix2pix-5e6a5f0159c4

https://machinelearningmastery.com/a-gentle-introduction-to-pix2pix-generative-adversarial-network/

https://pathmind.com/wiki/generative-adversarial-network-gan

https://theaisummer.com/Semantic_Segmentation/

https://azati.ai/image-detection-recognition-and-classification-with-machine-learning/

https://github.com/seyrankhademi/ResNet_CIFAR10

