Image colorization with GAN
Image and video colorization by hand is a very labour-intensive process. In recent years with the advances in machine learning techniques, automatic colorization has gain significant traction in the field. In this project, I attempt to bring color to a clip from the original Twilight Zone using the GAN (generative adversarial network) architecture. This repository consists of scripts that gather and process data, build and train the network as well as a subset of training data, trained model weights, and sample outputs.
- Learning Representations for Automatic Colorization (code)
- DeOldify (code)
- Image Colorization using Transfer Learning
The training set consists of 15,000 images from a 70s color TV show called Ghost Story and the test images are from the Twilight Zone. I used pytube to download YouTube videos and OpenCV to capture images from videos, transform them from RGB
to LAB
colorspace, and resize the images for sake of training efficiency. The advantage of using LAB
colorspace is that the lightness L
layer is the grayscale version of the image, separated from the color channels.
All images are stored in a single HDF5 file as arrays of unsigned integer type in two tables (Train
and Test
). Due to size limitation, only a subset (200 images from the trianing set, 50 from test set) is provided in the repo. When loaded, they are cast as floats and re-scaled to be between -1 and 1 before being passed into the network.
GAN is a popular model for image (or synthetic data, more generally) generation. The core idea is that a generative and a discriminative network will be trained simultaneously where the generative network attempts to create data similar to the real data so that the discriminative network cannot distinguish while the discriminative network learns to separate fake from real data.
Since the goal is to generate believable colors for grayscale images, GAN seems to be a suitable choice for this task. The main architecture is inspired by Unsupervised Diverse Colorization via Generative Adversarial Networks (code) with slight modifications. Here, the generator tries to produce colors as close to the ground truth as possible while the discriminator distinguishes the generator outputs from the real color version of the inputs.
Both the generator and the discriminator use convolutional layers with strides of 1 to avoid resizing and to retain spatial information. I have tried other models the use strides greater than 1 or pooling layers, and artifacts such as color blocks and abrupt transition become apparent in the results. The generator continuously concatenates the lightness channel before each convolution operation since the image size is invariant through the layers. This is a convenient way to allow the network to leverage conditional information throughout. I also use a larger learning rate for the generator optimizer than the discriminator optimizer since the discriminator is getting too good too quickly when the learning rates are equal.
Instead of Wasserstein loss, I opt for mean squared error in the generator and binary cross entropy in the discriminator. Wasserstein loss causes the training to be extremely unstable and the authors of the original paper also noted that Wasserstein did not improve the performance for them.
Perhaps because most screenshots have human faces in them, the network is pretty good at detecting and coloring faces. However, for objects that don't have any distinct color, the network often opts for the safe brownish color to minimize errors.
This is a short clip from the original (black and white) Twilight Zone. Click to play on YouTube.
And here is the network output.
A gallery of images from the training set.
A more diverse training dataset might help with more general object colorization. A generator that leverages a pre-trained image recognition and segmentation model can reduce training time and potentially improve performance. However, the pre-trained models are expecting the input to have three channels (since they were trained on color images) so some adjustments are needed. In addition, a different loss function that compares output in a reduced dimensionality space in the generator might alleviate the constant brown color prediction issue. One such example would be a VGG loss.
Currently, the model expects the input image to be 120 by 90 pixels. This is entirely arbitrary and only restricted by the first layer of the generator which projects the noise vector onto input image. A more dynamic approach can be used to remove the image size restriction.
A Dockerfile is included in this repo if you wish to run the model end-to-end in a container (CPU only).
The command below builds the full database and train the network from scratch.
python train.py --build_db
You can omit the --build_db
flag if you've already built the database. Use the --load_weights
flag if you wish to use the pre-trained weights and --epoch [some number]
if you wish to overwrite the default number of epochs (100).
To evaluate the trained network on a single image (say, image #8) from the training set:
python evaluate.py --train 8
To evaluate on multiple images from the training set:
python evaluate.py --train 0 99 175
To evaluate on the test video (note that codec is missing from the OpenCV Python library, see Caveat):
python evaluate.py --test
The output images and video will be stored in the data
directory.
DeOldify is a state of the art deep learning model developed by Jason Antic for image colorization. See below for predicted images from the test set from the model and DeOldify API.
The model is written in Python using TensorFlow. The network was trained on a single NVIDIA P5000 GPU over 500 epochs, with each epoch taking about 5 minutes. Training on a GPU provides significant speed up (20 times or more) as you can see from the plot below.
The CPU and GPU results above are from the same machine, with GPU disabled and enabled, respectively.
Hardware specs:
- CPU: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz, 4 cores (8 threads), 30GB RAM
- GPU: NVIDIA Quadro P5000, 16GB RAM
If you install OpenCV through pip, it doesn't come with the necessary encoding for generating the MP4 file from test data (see this Github issue). I had installed OpenCV through APT instead on Ubuntu.
sudo apt-get install python3-opencv