Skip to content

kw01sg/neural-style-transfer

Repository files navigation

Neural Style Transfer

Implementation of Neural Style Transfer from the paper A Neural Algorithm of Artistic Style (Gatys et al.) in Tensorflow 2.0.

Examples

These examples are generated using default options.

Images used can be found in the data/demo directory.

Example 1

Example 2

Demo

A demo is available on Google Colab in the form of a Colab notebook.

Code for the Colab notebook can be found in google-colab-demo.ipynb, which also includes a link to the notebook itself.

Make sure to install Tensorflow 2.0 by running

!pip install tensorflow-gpu==2.0.0-beta1

and that GPU is enabled for the notebook.

Usage

Requirements

  • Pillow==6.0.0
  • tensorflow-gpu==2.0.0-beta1

Required packages can be installed by running pip install on requirements.txt

pip install -r requirements.txt

To use the CPU version, replace tensorflow-gpu==2.0.0-beta1 with tensorflow==2.0.0-beta1 in requirements.txt. However, running on CPU is very slow and is generally advised against.

Running

python neural_transfer.py --content-path <path of content image> --style-path <path of style image>

Options

  • -h, --help : Display help message
  • -c, --content-path : Path of content image. Default: data/demo/chicago.jpg
  • -s, --style-path : Path of style image. Default: data/demo/candy.jpg
  • -sw, --content-weight : Content weight. Default: 1e-3
  • -cw, --style-weight : Style weight. Default: 1.0
  • -vw, --variation-weight : Variation weight. Default: 1e4
  • -slw, --style-layer-weights : Weights for individual layers in style layers. Will be normalized before calculation of style loss. Default: 1 1 1 1 1
  • -wn, --white-noise-input : Flag to use white noise image as initial image. If false, content image will be used as initial image. Default: False
  • -lr, --learning-rate : Learning rate for Adam optimizer. Default: 10.0
  • -e, --epochs : Number of epochs. Default: 10
  • -steps, --steps : Number of steps per epoch. Default: 100
  • -o, --output-file : File name for generated image file. Path can include extension, for example example.png. If no extension is given, default extension is png. If no file name is provided, generated image will be output as result.png. All output files are saved in data/results directory. Default: result.png

Implementation Details

Convolutional Neural Networks (CNN) consist of multiple layers of computational units that process visual information hierarchically in a feed-forward manner. Each layer of units can be understood as a collection of image filters, and each filter extracts a certain feature from the input image. Thus, the output of a given layer consists of a feature map that offers a differently filtered version of the input image.

By separating how content and style are represented in CNNs, images that simultaneously matches the content representation of one image and the style representation of another image can be generated. While the global arrangement of the original content image is preserved, the colours and local structures that compose the global scenery are provided by the style image. Effectively, this renders the content in the style of the style image.

Implementation

A given input image x is encoded in each layer of the CNN by the output of its filters. For a layer l with Nl distinct filters, it has Nl feature maps each of size Ml, where Ml is the height times the width of the feature map.

Thus, the responses in a layer l can be stored in a matrix where Flij is the activation of the ith filter at position j in layer l.

Content Representation

Content representation is matched on layer 'block4_conv2' of the VGG19 network.

Let p and x be the original image and the image that is generated, and Pl and Fl their respective feature representation (activation) in layer l.

Content loss can be defined as the squared-error loss between the two feature representations:



Style Representation

Style representation is matched on layers 'block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1' and 'block5_conv1' of the VGG19 network.

Style representation is defined by computing the correlations between the different filter responses in each layer of the network. These feature correlations are given by the Gram matrix , where Glij is the inner product between the vectorised feature map i and j in layer l:



Let a and x be the original image and the image that is generated, and Al and Gl their respective style representations (Gram matrix) in layer l. The contribution of layer l to the total style loss is defined as the mean-squared distance between the entries of the two Gram matrices:



Total style loss is then defined as the weighted sum across all layers where style representation is matched on:



where wl are weighting factors of the contribution of each layer to the total style loss.

wl is implemented to be 1 for all layers by default. These weights are later normalized before calculation of style loss.

Total Variation Loss

Total variation loss has also been included as a component in the total loss function. This was not covered in the paper by Gatys et al. but was inspired by Tensorflow's implementation of neural style transfer.

Generating images that matches the content and style representations produces a lot of high frequency artifacts in the generated image. These can be decreased using an explicit regularization term on the high frequency components of the image.

Let x be the generated image and h, w and c be its height, width, and number of channels.

Then horizontal variation is defined as:



while vertical variation is defined as:



Total variation loss:



Overall Loss

To generate images that mix the content of the content image with the style of the style image, the distance of an initial image from the content representation of the content image and the style representation of the style image in multiple layers is minimized through gradient descent. Variation loss is also taken into consideration for a smoother generated image.

Let p be the content image, a be the style image and x be the generated image. The loss function can be defined as



where α and β are the weighting factors for content and style reconstruction respectively, and wvariation is the weighting factor for variation loss.

The paper uses a α/β ratio of 1 x 10-3 or 1 x 10-4 for relative weightings of the content and style reconstruction loss. This implementation uses a α/β ratio of 1 x 10-3 by default, although both α and β are tunable options.

For wvariation, a default weight of 1 x 10-4 is used.

Another implementation detail that is different from the paper is the initial image to perform gradient descent on. The paper uses a white noise image while this implementation uses the content image as the initial image by default as it offers a faster option to apply the style of the style image to the content image.

To use a white noise image as the initial image, the -wn or the --white-noise-input option can be used. A larger learning rate and number of epochs/steps is also advised.

License

This project is licensed under the MIT License.