Implementation of Inversion-based Style Transfer Model based on this paper. The deep network we use as a feature extractor is SqueezeNet, a small model that has been trained on ImageNet. You could use any network, but we chose SqueezeNet here for its small size and efficiency.
Here are examples of input sourcing images (style then content) and then progressive stages of ouput images produced by this algorithm:
To take two images, and produce a new image that reflects the content of one but the artistic "style" of the other.
We did this by first formulating a loss function that matches the content and style of each respective image in the feature space of a deep network, and then performing gradient descent on the pixels of the image itself.
-
Style Transfer. This refers to the process of applying the artistic style of one image (style image) to the content of another image (content image) to generate a stylized output image that retains the content of the original image but has the artistic style of the style image.
-
Content and Style Representation. The paper proposes separating content and style representations in a CNN. The content representation captures the arrangement of objects and their features in the image, while the style representation captures the texture and color information, which is done by analyzing correlations between features across different layers of the network.
-
Optimization Framework. An optimization framework is used to minimize a loss function that measures the difference between the content, style, and generated images in terms of their representations in the CNN. This optimization alters the pixels of the generated image to minimize this loss, thereby achieving style transfer.
Python 3.10, modern version of PyTorch, numpy and scipy module. Most of these are okay to install with pip. To install all dependencies at once, run the command pip install -r requirements.txt
I only tested this code with Ubuntu 20.04, but I tried to make it as generic as possible (e.g. use of os module for file system interactions etc. So it might work on Windows and Mac relatively easily.)
- Get the code.
$ git clonethe repo and install the Python dependencies - Train and Evaluate the trained model. Run the training
$ train_test.pyand wait. You'll see that the learning code writes checkpoints intocv/and periodically print its status.
We can generate an image that reflects the content of one image and the style of another by incorporating both in our loss function. We want to penalize deviations from the content of the content image and deviations from the style of the style image. We can then use this hybrid loss function to perform gradient descent not on the parameters of the model, but instead on the pixel values of our original image.
First, we wrote the content loss function. Content loss measures how much the feature map of the generated image differs from the feature map of the source image. We only care about the content representation of one layer of the network (say, layer
Then the content loss is given by:
Now we can tackle the style loss. For a given layer
First, compute the Gram matrix G which represents the correlations between the responses of each filter, where F is as above. The Gram matrix is an approximation to the covariance matrix -- we want the activation statistics of our generated image to match the activation statistics of our style image, and matching the (approximate) covariance is one way to do that. There are a variety of ways you could do this, but the Gram matrix is nice because it's easy to compute and in practice shows good results.
Given a feature map
Assuming
In practice we usually compute the style loss at a set of layers
It turns out that it's helpful to also encourage smoothness in the image. We can do this by adding another term to our loss that penalizes wiggles or "total variation" in the pixel values.
You can compute the "total variation" as the sum of the squares of differences in the pixel values for all pairs of pixels that are next to each other (horizontally or vertically). Here we sum the total-variation regualarization for each of the 3 input channels (RGB), and weight the total summed loss by the total variation weight,

