# GAN for Anime Faces
Project for CSE 512 Machine Learning   
Tianchi Mo 112281322  
Stony Brook University, Department of Computer Science   
Fall 2019  

The goal of this project is utilizing Generative Adversarial Networks (GAN) to draw non-existing anime faces. To achieve this, I try 3 techniques:
- DCGAN (on my desktop)
- W-GAN (on my desktop)
- Transfer learning based on Progressive Growing of GAN (PGGAN) (on AWS)

## Dataset 
I use https://github.com/Mckinsey666/Anime-Face-Dataset. After cleaning it contains 57238 faces. I resize all of them 64x64. 


In [1]:
import torch
import torchvision
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.optim as optim  
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
import argparse
import torchvision.utils as vutils
from random import randint
from IPython.display import clear_output
from collections import OrderedDict
from torch.nn import init
import argparse
import os
from datetime import datetime
import torch.backends.cudnn as cudnn
import torch.nn.parallel
import torch.utils.data
import torchvision.utils as vutils
from tqdm import tqdm, trange

# DCGAN

For the DCGAN, I define the generator network as:
- ConvTranspose2d (take in 100-channel 1x1 noise, out channel = 64x8, kernel_size=4, stride=1, padding=0, bias=False)
- Batch norm 
- ReLU
- ConvTranspose2d (in channel = 64x8, out channel = 64x4, kernel_size=4, stride=2, padding=1, bias=False)
- Batch norm 
- ReLU
- ConvTranspose2d (in channel = 64x4, out channel = 64x2, kernel_size=4, stride=2, padding=1, bias=False)
- Batch norm 
- ReLU
- ConvTranspose2d (in channel = 64x2, out channel = 64, kernel_size=4, stride=2, padding=1, bias=False)
- Batch norm 
- ReLU
- Output layer: ConvTranspose2d (in channel = 64, out channel = 3, kernel_size=5, stride=3, padding=1, bias=False) + an nn.Tanh()


In [2]:
class NetG(nn.Module):
    def __init__(self, ngf, nz):
        super(NetG, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.ConvTranspose2d(nz, ngf * 8, kernel_size=4, stride=1, padding=0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(inplace=True)
        )
        
        self.layer2 = nn.Sequential(
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(inplace=True)
        )
        
        self.layer3 = nn.Sequential(
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(inplace=True)
        )
     
        self.layer4 = nn.Sequential(
            nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(inplace=True)
        )
        
        self.layer5 = nn.Sequential(
            nn.ConvTranspose2d(ngf, 3, 5, 3, 1, bias=False),
            nn.Tanh()
        )


    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        return out

And I define the discriminator as:
- Conv2d (in channel = 3, out channel = 64, kernel_size=5, stride=3, padding=1, bias=False),
- Batch norm 
- Leaky ReLU
- Conv2d (in channel = 64, out channel = 64x2, kernel_size=4, stride=2, padding=1, bias=False),
- Batch norm 
- Leaky ReLU
- Conv2d (in channel = 64x2, out channel = 64x4, kernel_size=4, stride=2, padding=1, bias=False),
- Batch norm 
- Leaky ReLU
- Conv2d (in channel = 64x4, out channel = 64x8, kernel_size=4, stride=2, padding=1, bias=False),
- Batch norm 
- Leaky ReLU
- Classifier: Conv2d (in channel = 64 x 8, out channel=1, kernel_size=4, stride=1, padding=0, bias=False) + an nn.Sigmoid()

In [3]:

class NetD(nn.Module):
    def __init__(self, ndf):
        super(NetD, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, ndf, kernel_size=5, stride=3, padding=1, bias=False),
            nn.BatchNorm2d(ndf),
            nn.LeakyReLU(0.2, inplace=True)
        )
       
        self.layer2 = nn.Sequential(
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True)
        )
       
        self.layer3 = nn.Sequential(
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True)
        )
       
        self.layer4 = nn.Sequential(
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 8),
            nn.LeakyReLU(0.2, inplace=True)
        )
       
        self.layer5 = nn.Sequential(
            nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

  
    def forward(self,x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        return out

For other settings: I resize pictures to 96x96. Batch size 128. I use BCELoss, Adam optimizer with learning rate=0.0002 and betas=(0.5, 0.999). After 200 epochs, we can get:
![](fig-a.png)
We can see: the generator learns hair and eyes well, but lack of details, and the face diversity is poor. 

In [4]:
batchSize=128
imageSize=96
nz=100
ngf=64
ndf=64
epoch=200
lr=0.0002
beta1=0.5
data_path='train/'
outf='imgs/'


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

transforms = torchvision.transforms.Compose([
    torchvision.transforms.Scale(imageSize),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ])

dataset = torchvision.datasets.ImageFolder(data_path, transform=transforms)

dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    batch_size=batchSize,
    shuffle=True,
    drop_last=True,
)

netG = NetG(ngf, nz).to(device)
netD = NetD(ndf).to(device)

criterion = nn.BCELoss()
optimizerG = torch.optim.Adam(netG.parameters(), lr=lr, betas=(beta1, 0.999))
optimizerD = torch.optim.Adam(netD.parameters(), lr=lr, betas=(beta1, 0.999))

label = torch.FloatTensor(batchSize)
real_label = 1
fake_label = 0

for ep in range(1, epoch + 1):
    for i, (imgs,_) in enumerate(dataloader):
        optimizerD.zero_grad()
        
        imgs=imgs.to(device)
        output = netD(imgs)
        label.data.fill_(real_label)
        label=label.to(device)
        errD_real = criterion(output, label)
        errD_real.backward()
        
        label.data.fill_(fake_label)
        noise = torch.randn(batchSize, nz, 1, 1)
        noise=noise.to(device)
        fake = netG(noise) 
        output = netD(fake.detach())
        errD_fake = criterion(output, label)
        errD_fake.backward()
        errD = errD_fake + errD_real
        optimizerD.step()

        optimizerG.zero_grad()
        label.data.fill_(real_label)
        label = label.to(device)
        output = netD(fake)
        errG = criterion(output, label)
        errG.backward()
        optimizerG.step()

        clear_output(True)
        print('[%d/%d][%d/%d] Loss_D: %.3f Loss_G %.3f'
              % (ep, epoch, i, len(dataloader), errD.item(), errG.item()))

    vutils.save_image(fake.data,
                      '%s/fake_samples_epoch_%03d.png' % (outf, ep),
                      normalize=True)
    torch.save(netG, '%s/netG_%03d.pt' % (outf, ep))
    torch.save(netD, '%s/netD_%03d.pt' % (outf, ep))

[5/200][140/447] Loss_D: 0.293 Loss_G 5.185


## -----------------------------------------------------------------------------------------------

# W-GAN

The main differences between DCGAN and W-GAN are:
- For the discriminator, the sigmoid in the classifier is removed. 
- The loss functions are changed to Wasserstein Divergence. 

The generator is defined as:
- Linear (100, 128)
- Leaky ReLU
- Linear (128, 256)
- Batch norm 
- Leaky ReLU
- Linear (256, 512)
- Batch norm 
- Leaky ReLU
- Linear (512, 1024)
- Batch norm 
- Leaky ReLU
- Output layer: Linear (1024, 3x64x64) + an nn.Tanh()

The discriminator is defined as:
- Linear (3x64x64, 512)
- Leaky ReLU
- Linear(512, 256)
- Leaky ReLU
- Linear(256, 1)


In [2]:
import argparse
import os
import numpy as np
import math
import sys

import torchvision.transforms as transforms
from torchvision.utils import save_image

from torch.utils.data import DataLoader
from torchvision import datasets
from torch.autograd import Variable

import torch.nn as nn
import torch.nn.functional as F
import torch

In [4]:
img_shape = (channels, img_size, img_size)

cuda = True if torch.cuda.is_available() else False


class WGANGenerator(nn.Module):
    def __init__(self):
        super(WGANGenerator, self).__init__()

        def block(in_feat, out_feat, normalize=True):
            layers = [nn.Linear(in_feat, out_feat)]
            if normalize:
                layers.append(nn.BatchNorm1d(out_feat, 0.8))
            layers.append(nn.LeakyReLU(0.2, inplace=True))
            return layers

        self.model = nn.Sequential(
            *block(opt.latent_dim, 128, normalize=False),
            *block(128, 256),
            *block(256, 512),
            *block(512, 1024),
            nn.Linear(1024, int(np.prod(img_shape))),
            nn.Tanh()
        )

    def forward(self, z):
        img = self.model(z)
        img = img.view(img.shape[0], *img_shape)
        return img


class WGANDiscriminator(nn.Module):
    def __init__(self):
        super(WGANDiscriminator, self).__init__()

        self.model = nn.Sequential(
            nn.Linear(int(np.prod(img_shape)), 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(256, 1),
        )

    def forward(self, img):
        img_flat = img.view(img.shape[0], -1)
        validity = self.model(img_flat)
        return validity



For other settings: Batch size 64. I use self-coded Wasserstein Divergence as loss function, RMSprop optimizer with learning rate=0.00005. After 200 epochs, we can get:
![](fig-b.png)
We can see: Compared with DCGAN, W-GAN’s result is less distorted, but still lack of details. Maybe because the discriminator is too simple.

In [2]:
n_epochs=200
batch_size=64
lr=0.00005
latent_dim=100
img_size=64
channels=3
n_critic=5
clip_value=0.01
sample_interval=400

In [5]:
generator = WGANGenerator()
discriminator = WGANDiscriminator()

if cuda:
    generator.cuda()
    discriminator.cuda()
    
transforms = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ])

dataset = torchvision.datasets.ImageFolder('train/', transform=transforms)

dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
)


optimizer_G = torch.optim.RMSprop(generator.parameters(), lr=lr)
optimizer_D = torch.optim.RMSprop(discriminator.parameters(), lr=lr)

Tensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor

In [6]:
batches_done = 0
for epoch in range(n_epochs):
    for i, (imgs, _) in enumerate(dataloader):
        real_imgs = Variable(imgs.type(Tensor))
        optimizer_D.zero_grad()
        z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], latent_dim))))

        fake_imgs = generator(z).detach()
        loss_D = -torch.mean(discriminator(real_imgs)) + torch.mean(discriminator(fake_imgs))
        loss_D.backward()
        optimizer_D.step()

        for p in discriminator.parameters():
            p.data.clamp_(-clip_value, clip_value)

        if i % n_critic == 0:
            optimizer_G.zero_grad()
            gen_imgs = generator(z)
            loss_G = -torch.mean(discriminator(gen_imgs))

            loss_G.backward()
            optimizer_G.step()

            clear_output(True)
            print(
                "[Epoch %d/%d] [Batch %d/%d] [D loss: %f] [G loss: %f]"
                % (epoch, n_epochs, batches_done % len(dataloader), len(dataloader), loss_D.item(), loss_G.item())
            )

        if batches_done % sample_interval == 0:
            save_image(gen_imgs.data[:25], "wgan-images/%09d.png" % batches_done, nrow=5, normalize=True)
            torch.save(generator, 'wgan-model/netG_%09d.pt' % (batches_done))
            torch.save(discriminator, 'wgan-model/netD_%09d.pt' % (batches_done))
            
        batches_done += 1

[Epoch 199/200] [Batch 890/894] [D loss: -1.656635] [G loss: 0.977705]


In [13]:
use_cuda = torch.cuda.is_available()
print('use cuda: %s'%(use_cuda))

use cuda: True


## -----------------------------------------------------------------------------------------------

# Transfer learning based on Progressive Growing of GAN (PGGAN)

As we can see, simple self-coded models perform not very well, so I decide to turn to the transfer learning. At first I wanted to try SinGAN recommended by Professor Nguyen. But after reading the paper (https://arxiv.org/abs/1905.01164), **I noticed that SinGAN works well when the pictures contain repetitive patterns (crowd of mountains, sheep, or tree branches), so it may not be good at drawing anime faces. After searching on the Internet, I select Progressive Growing of GAN (PGGAN) published by Nvidia in 2017 (https://arxiv.org/abs/1710.10196). It is also a pretty new technique.**          

The basic idea of PGGAN is: learn to generate 4x4 pictures, then 8x8 pictures, then 16x16, and so on. Originally, PGGAN is used to generate 1024x1024 human faces with high resolution. I transfer it to generate 64x64 anime faces.     

I use the official Pytorch API and tutorial provided by Facebook (https://pytorch.org/hub/facebookresearch_pytorch-gan-zoo_pgan/ and https://github.com/facebookresearch/pytorch_GAN_zoo, published in 2019). I deployed it on AWS server. 
![](fig-c.png)

**I still keep running it before you grading, you can see it in http://18.221.62.53:8097/. Please change “main” into “mtcpgan_training”.**
![](fig-d.png)
The model type I choose is PGAN(progressive growing of gan).  
![](fig-e.png)  


Let’s see some results. I began the training on 2019-12-14. It takes over 3 days to get good 64x64 pictures.   

Dec 15, 1:50 pm. This is 8x8 training (first two lines are generated pictures and the third line is the real data). Eyes are visible: 
![](fig-f.png)  

Dec 15, 3:56 pm. 16x16 training (first two lines are generated pictures and the third line is the real data):
![](fig-g.png)  

Dec 16, 7:25 pm. 32x32 training (first two lines are generated pictures and the third line is the real data):
![](fig-h.png)  
32x32 training ran 96000 iterations.
![](fig-i.png)  


Dec 17, 9:10 pm. 64x64 training (first two lines are generated pictures and the third line is the real data). The result is gorgeous. Some generated faces cannot be told apart from the real data. The details are clear and rich. In this phase, we can see in the top 2 lines, the generated faces have different hair styles, hair colors, eye colors. Some girls have mouths (in Japanese cartoons, when characters close their mouths, the mouths are invisible sometimes. So it is a feature hard to learn), while there are no opened mouth in DCGAN and W-GAN generated faces. The first two girls have redness on their faces, that means they are shy. The 6, 7 and 8th girl have highlight spots in their eyes, that is the feature of liveliness in the cartoons. And the faces are in slightly different directions.    
![](fig-j.png)  

**You may think that the faces are still blurry. That is because the figure above shows the faces much larger than 64x64. If we shrink them around 64x64, the effect is marvelous (first two lines are generated pictures and the third line is the real data).**     
![](fig-k.png)  

**The 64x64 training began around 7 pm on Dec 17, and it is still running on the AWS server. When I write this report it has ran 2900 iterations only. Maybe the quality of the generates faces can be improved further later.**