Interested in understanding neural style transfer, I set out to create one. This is adapted from the tf implementation. Obviously, I am going to use a photo of my beautiful dog, Bacchus.
The results are pretty great...
- Metzinger's Two Nudes
- Gleizes's The Bridges of Paris
- Delaunay's Window on the City
- Kandinsky's Composition VII
- Monet's The Water-Lily Pond
- Bruegel's Tower of Babel
- Bruegel's The Triumph of Death
- Bosch Follower's Christ in Limbo
- Bosch Follower's Tondal's Vision
The basic idea behind neural style transfer is to take an image and transfer the artistic style of another image onto it. This can be done by the following basic process:
- Take a CNN (convolutional neural network) trained for multi-purpose image examination (such as VGG19, used here)
- Extracting features from some of the layers of the network for both a "content" image and "style" image
- Creating a new image, "combo," that minimizes the loss from the content's deviation from "content" and the loss from the style's deviation from "style"
In a paper by Gatys et al (https://arxiv.org/pdf/1508.06576.pdf), the authors construct this method by using two loss functions:
- Content loss: The "pixel" distance between the "combo" and "content" on a deep layer of the network.
- Style loss: The difference in the Gram matrix of the "combo" and "style" across several layers throughout the network. In essence the Gram matrix takes a pixel image (h x w x n_f), where h and w are the height and width of the image, and n_f is the number of filters (i.e. features), and converts it into a n_f x n_f matrix that is a measure of how many and by how much each of the layer's features have been represented in that image. The difference between the gram matrix of the style image and combo image tracks how much these features have been captured. Importantly, this doesn't care so much where the features are located, only that they are present. This is polled over multiple layers of the CNN.
- Adam optimizer: Many implementations use L-BFGS-B to minimize the loss. This is included in scipy as a wrapper to a FORTRAN function. Here, I use Adam because it is easier to implement and seems to function just fine.
- Variational loss: This demands adjancent pixels in the combo image do not move too much, i.e. the image is somewhat smooth.
- High tunability: The three different sources of loss are weighted by scalable parameters and raised to variable powers, allowing for an arbiratry customization of transfer. (Up until now, I have been mostly following in the footsteps of others, but here I start to deviate.)
- Common weighting: I normalize the style losses so that the style image has the same contribution from all included layers. This means that the features that are present are rewarded, but more significantly, the features that do appear in the combo image that are absent in the style are punished. The effect of this is sizable, and I found that it tends to create a more interesting combined image.
- Removing content loss: Surprisingly, I found starting with the content image, but not having any contribution to the content loss results in the most interesting images (i.e., I typically set the content weight to 0).
- Option to start with "combo" = "content" or "combo" = noise: code allows for either option, which was heavily used in the Exploration/ folder. For the noise, I simply construct an image from random uniformly distributed values for each pixel channel.
Instead of transfering one style, we could try to transfer two styles: a fine style, which captures colors and textures from the early layers, and a coarse style, which captures more global stylistic features from the image. This is implemented in the file DualNST.ipynb. Below, we have a grid of images. The top left is the content source (Bacchus), while the rest of the top row are the coarse style source images (Composition VII, Tondal's Vision, The Bridges of Paris). The left column has the fine style source (Tower of Babel, The Water-Lily Pond, The Triumph of Death). The remaining images are the fine + coarse styles applied as indicated by the image position.
Note how the colors are textures are mostly preserved from the images on the left (fine style), while larger features (music symbols, creepy faces, sharp angles) are largely preserved from the top row. All look very "Bacchus," even though the content (his image) is only the initial condition. We could also use noise as the initial condition and apply fine and coarse features to create chaotic and beautiful art (Fine: The Water-lily Pond; Coarse: Composition VII):
As can be observed from the Exploration folder, there is a lot of difference between the 1st, 3rd, and 5th block of the network. We could instead try to transfer three styles to the image - roughly as colors, small features, and large features (A, B, C respectively). This is in TriNST.ipynb. These style transfers are definitely more tempermental - if adjacent styles are too discordant, there tend to be a lot of artifacts produced (e.g. dots of red or green in a black surface). Turning up the smoothing (v_w) can help a lot, but it isn't always enough without completely blurring out the image.
In order, these are:
Colors | Fine Style | Coarse Style | |
---|---|---|---|
1 | Bridges of Paris | Window on the City | Christ in Limbo |
2 | Two Nudes | Bridges of Paris | Window on the City |
3 | Coast of Northumberland[1] | Christ in Limbo | Composition VII |
4 | Christ in Limbo | Triumph of Death | Composition VII |
5 | Window on the City | Triumph of Death | Composition VII |
6 | Bridges of Paris | Tower of Babel | Christ in Limbo |
7 | Triumph of Death | Tondal's Vision | Christ in Limbo |
8 | Bridges of Paris | Window on the City | Two Nudes |
[1] J.M.W. Turner's Wreckers - Coast of Northumberland
In addition to making the transfer of three styles possible, it is possible to use this triple style transfer to improve and give a finer control on a dual transfer, for instance, having one image tranfer A & B, and the second transfer C, or one do A, and the other do B & C (where A = colors; B = small features; C = large features). Below from left to right, we have Water-lily Pond + Composition VII using 1) the dual method above; 2) A & B - Water-lily, C - Composition; 3) A - Water-lily, B & C - Composition. It is clear how much of a difference these middle layers can make in the resultant image.
Similarly, we don't have to transfer all levels. By setting some weights to 0, we can just transfer some of the feature scales. Here we transfer only (left to right) 1) A & B; 2) A & C; 3) B & C from Albert Gleizes's Bridges of Paris.
We can also use the content image as a choice for style image in order to preserve more features. For example, here is Tower of Babel as content, with Monet's Blue Water Lillies applied in levels A & B, while C is again Tower of Babel Style:
We can choose multiple ways to apply this. As an example, we will show Tower of Babel as content, Bridges of Paris / Babel as style - in order 1) ABC - Babel (no Paris applied, only smoothing); 2) AB - Babel, C - Paris; 3) AC - Babel, B - Paris; 4) BC - Babel, A - Paris; 5) A - Babel, BC - Paris; 6) B - Babel, AC - Paris; 7) C - Babel, AB - Paris; 8) ABC - Paris
Notice how A is really governing the color scheme (1235 vs 4678); B is capturing small details, e.g., tower windows, foreground (1246 vs 3578); and C is getting the largest features, e.g., shape of the tower, surrounding hills (1347 vs 2568).
Neural style transfer can be customized quite a bit. The transfer of styles from different regions can result in very different images. This technique can produce some fascinating visual art, and also serve to help us better understand the layers in a convnet.
The jupyter notebooks should just work out of box, once the requisite files are properly linked. With a GPU, this should take about a minute to run over an image (a bit less time for single transfer). Without a GPU - you are in for a long haul (e.g. half an hour). If you want to experiment and don't have your own, I'd encourage you to use the GPUs on Google colab or kaggle.