Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a layer heat map to better understand the layers? #2

Open
ProGamerGov opened this issue Jan 24, 2017 · 54 comments
Open

Creating a layer heat map to better understand the layers? #2

ProGamerGov opened this issue Jan 24, 2017 · 54 comments

Comments

@ProGamerGov
Copy link

ProGamerGov commented Jan 24, 2017

When using convis on a model's higher level layers, a large amount of individual images are produced. This makes trying to view the image as the model sees it, very impractical.

So I was wondering about the practicality of having a heat map that combines all of the images, into a single false color image?

@ProGamerGov ProGamerGov changed the title Creating a layer heat map to better understand the layers themselves? Creating a layer heat map to better understand the layers? Jan 24, 2017
@htoyryla
Copy link
Owner

htoyryla commented Jan 24, 2017

Each feature map of the model sees the image differently. One sees horizontal lines, another vertical lines, others see diagonal lines, circles, boxes, windows, eyes and so on. A layer may consist of as many as 512 feature maps, which respond to different features. Combining them does not sound a good idea to me, just like I wouldn't put 512 photos from London on top of each other to show what London is like.

My main idea in making convis was to be able to check, when training a model, how the training is succeeding. One can also use it to gain some understanding of what a model sees. But the feature maps respond to thousands of different features, and I don't see how one could compress that into a heat map in a meaningful way. To understand the layers, one would have to feed the model different kinds of images and the examine all the feature maps to determine what features exactly each feature map is seeing. But convis is probably too simplistic for that kind of work.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 10, 2018

I recently noticed that MIT's Places 365 models were used to generate saliency maps: http://cnnlocalization.csail.mit.edu/

That is exactly what I was trying to do here with convis. I wonder if we can apply class activation mapping (CAM) to other models or if it's specific to the Places 365 project?

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 10, 2018

I found an implementation of CAM that works on the regular caffemodels that Neural-Style uses: https://github.com/ramprs/grad-cam

Though that implementation only supports a single layer at a time. It would be interesting to see how the heatmap changes between iterations in Neural-Style.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 11, 2018

So classification.lua contains the code, along with: utils.lua

Specifically these two functions are used to create the heatmap:

https://github.com/ramprs/grad-cam/blob/master/misc/utils.lua#L84-L128

https://github.com/ramprs/grad-cam/blob/master/misc/utils.lua#L154-L176

Edit:

@htoyryla I can't seem to figure out how to get the code working in Neural-Style. I've been trying to place it all in the feval(x) function. Maybe it needs to be implemented like a loss function to work correctly?


This might work?


      self.gradInput2 = self.crit:forward(input, self.target)
      self.activations = self.gradInput2:squeeze()
    
      self.gradInput = self.crit:backward(input, self.target)
      self.gradInput:zeroGradParameters()
      self.gradients = self.gradInput:squeeze()
    
      self.weights = torch.sum(gradients:view(activations:size(1), -1), 2)
      self.map = torch.sum(torch.cmul(activations, weights:view(activations:size(1), 1, 1):expandAs(activations)), 1)
      self.map = map:cmul(torch.gt(map,0):typeAs(map))
    

Maybe something like the TV Loss function:

local HeatmapLoss, parent = torch.class('nn.HeatmapLoss', 'nn.Module')

-- Heatmap CAM
function HeatmapLoss:__init()
  parent.__init(self)
  self.loss = 0
  self.gradients
  self.activations
  self.weights
  self.map
  self.crit = nn.MSECriterion()
  self.target = torch.Tensor()
end

function HeatmapLoss:updateOutput(input)
  self.output = input
  return self.output
end

function HeatmapLoss:updateGradInput(input, gradOutput)
  self.gradInput:resizeAs(gradOutput):copy(gradOutput)
  if input:nElement() == self.target:nElement() then
      self.gradInput2 = self.crit:forward(input, self.target)
      self.activations = self.gradInput2:squeeze()
    
      self.gradInput = self.crit:backward(input, self.target)
      self.gradInput:zeroGradParameters()
      self.gradients = self.gradInput:squeeze()
    
      self.weights = torch.sum(gradients:view(activations:size(1), -1), 2)
      self.map = torch.sum(torch.cmul(activations, weights:view(activations:size(1), 1, 1):expandAs(activations)), 1)
      self.map = map:cmul(torch.gt(map,0):typeAs(map))
  end
  return self.map
end

@htoyryla
Copy link
Owner

htoyryla commented Feb 11, 2018

It seems to me that you are trying to achieve both
a) get a heatmap that somehow combines all the featuremap activations from a given layer
b) monitor how this heatmap changed during iterations

I don't think there is any major difficulty doing this in neural-style, one simply needs to find a good way to combine, say, 128 feature maps into a single heatmap. Like taking the average or maximum of all feature maps from a layer.

I did something related in this in one of our earlier threads here. This does not display feature maps though, but the gradients at each iteration, as if to indicate which part of the image is now changing and how much.

This gist contains modified neural-style to display the total gradient which is converted to an image by normalizing to range 0 .. 255. It just gives a rough visual indication how and weher the image is changing during the iteration process.
https://gist.github.com/htoyryla/445e2649293f702a940c58a8a3cef472

Convis was made simple to just map the activations. The more sophisticated visualization methods attempt to follow gradients to indicate exactly which areas in the image caused those activations. I guess your difficulties arise from the need to make neural-style do both the usual iterations and to trace the gradients for visualization. Good luck.

@htoyryla
Copy link
Owner

htoyryla commented Feb 11, 2018

I made a quick test modifying convis to save a single combined activation map from a layer. Easy to do, it is only an open question as to how meaningful such a map is as the different channels respond to different features, so it is quite natural that the combined activations from a layer cover most of the image.

c9out

Another thought: you cannot do this inside feval, because the output of each layer is not available there. However, one could calculate the activation map inside each style (and content) loss module and store the results inside feval. So when using the simple activations like in convis, no additional loss modules are needed. And I don't have time or interest to start looking into this following the gradients thing.

@htoyryla
Copy link
Owner

OK, I can see how the simple combined activation maps could be useful.

Relu1_1
c9out

Relu2_1
c9out

Relu3_1
c9out

Relu5_3
c9out

@htoyryla
Copy link
Owner

Just as a sidetrack... so my convis shows me the activations from each individual filter in a VGG network. Now I noticed that the second filter of relu1_1 of the usual VGG19 reacts mainly on the sky in the default Thuringen image. So taking the output from that feature map and adding some postprocessing I can get masks like these. The point here is that the filters in relu1_1 act directly on the image and therefore can also be used as ordinary image filters (if they happen to produce useful output, that is).

mask
mask2
27655530_10156322936538729_5612581989359965427_n

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 11, 2018

Could you please share the convis modifications for creating simple combined activation maps?

I am also wondering how torch.max can be used to get a predicted class value, when no classification list is provided. The code lines here seem to do this and I can't seem to recreate it in convis or Neural-Style.

Like for example:

local y = net:forward(img)
local score, pred_label = torch.max(y,1)
label = pred_label[1]
print("Predicted label: ", pred_label)

This seems like it might be interesting to use for models that don't readily available category lists.

Another idea I just had was what if instead of arbitrary restricting Neural-Style's layer channels to specific values, we could instead restrict it to what matches the most likely label. Is is possible to get a list of layers and their filters using the above code?

Edit:

I figured it, it was really simple:

local width, height = 224, 224
content_image = image.scale(content_image, width, height)

local cnn = loadcaffe.load(params.proto, params.model, "nn"):float()
local y = cnn:forward(img)
local score, pred_label = torch.max(y,1)
print("Predicted label: ", pred_label)

I also see your neural_mirage5.lua does not resize the image before making the predictions? Is the above method basically the same as yours, only it uses torch.max to get the label with the highest accuracy prediction, whereas your code checks every label and creates a top 5 set of labels.

@ProGamerGov
Copy link
Author

@htoyryla For the mask image you created with relu1_1, I guess that particular filter was looking for a "sky texture"?

And in the context of our conversations here, "filter" and "layer channel", are the same thing, right?

@htoyryla
Copy link
Owner

For the mask image you created with relu1_1, I guess that particular filter was looking for a "sky texture"?

Not really, the lowest levels cannot detect complex entities like "sky", they simply act as basic convolutional filters. It could be that it detects a certain color.

And in the context of our conversations here, "filter" and "layer channel", are the same thing, right?

Yes. Functionally they are filters. In neural-style/torch terms, a channel in a layer.

@htoyryla
Copy link
Owner

htoyryla commented Feb 12, 2018

Is the above method basically the same as yours, only it uses torch.max to get the label with the highest accuracy prediction, whereas your code checks every label and creates a top 5 set of labels.

Yes. It shows the top 5 labels. In addition, when neural-mirage creates a new image, the target is the complete set classification probabilities, not only a single class with the highest probability. It tries to create an image that gives the same mix of label probabilities.

Note also that neural-mirage modifies the model (add an adaptive pooling layer between the conv and FC layers) so that the FC layers can be used with images of varying size. Therefore no resize is used before prediction.

Another idea I just had was what if instead of arbitrary restricting Neural-Style's layer channels to specific values, we could instead restrict it to what matches the most likely label. Is is possible to get a list of layers and their filters using the above code?

No. One has to look at each filter at each layer to see which activations are essential. Perhaps one could follow the gradients from the classification downward to see which filters contribute more and which less. That's not something I am familiar with. Anyway, also the lower activations may be significant and dropping those filters may change the results.

@htoyryla
Copy link
Owner

htoyryla commented Feb 12, 2018

This is the (quick & dirty) code I used to make an average activation map from a given layer. I am simply taking the CxHxW output from the layer and summing the different channels which gives a HxW tensor, then normalizing it to 0...255 value range for display.

If I remember correctly, the modified part starts at the line local fmaps = net:forward(img)

require 'torch'
require 'nn'
require 'image'
require 'loadcaffe' 

function preprocess(img)
   local mean_pixel = torch.DoubleTensor({103.939, 116.779, 123.68})
   local perm = torch.LongTensor{3, 2, 1}
   img = img:index(1, perm):mul(256.0)
   mean_pixel = mean_pixel:view(3, 1, 1):expandAs(img)
   img:add(-1, mean_pixel)
   return img
end

function deprocess(img)
  local mean_pixel = torch.DoubleTensor({103.939, 116.779, 123.68})
  mean_pixel = mean_pixel:view(3, 1, 1):expandAs(img)
  img = img + mean_pixel
  local perm = torch.LongTensor{3, 2, 1}
  img = img:index(1, perm):div(256.0)
  return img
end

local cmd = torch.CmdLine()

cmd:option('-image', 'examples/inputs/tubingen.jpg')
cmd:option('-output_dir', 'convis', 'directory where to place images')
cmd:option('-image_size', 800, 'output image size')
cmd:option('-proto', 'models/VGG_ILSVRC_19_layers_deploy.prototxt')
cmd:option('-model', 'models/VGG_ILSVRC_19_layers.caffemodel')
cmd:option('-layer', 'relu4_2', 'layer for examine')

local params = cmd:parse(arg)

local content_image = image.load(params.image, 3)
content_image = image.scale(content_image, params.image_size, 'bilinear')
local content_image_caffe = preprocess(content_image):float()
local img = content_image_caffe:clone():float()


local cnn = loadcaffe.load(params.proto, params.model, "nn"):float()

local net = nn.Sequential()


for i = 1, #cnn do
      local layer = cnn:get(i)
      local typ = torch.type(layer)
      local name = layer.name
      print(name, typ)
      net:add(layer)
      if (name == params.layer) then break end
      if (i == #cnn) then 
        print("No such layer: "..params.layer)
        return 
      end   
end

local fmaps = net:forward(img)

local n = fmaps:size(1)
local filename = "c9out.png" --params.output_dir .. "/" .. string.sub(params.image:match("[^/]+$"), 1, -5) .. "-" .. params.layer 

local y = torch.sum(fmaps, 1)
local m = y:max()
y = y:mul(255):div(m)

local y3 = torch.Tensor(3,y:size(2),y:size(3))
local y1 = y[1]
y3[1] = y1
y3[2] = y1
y3[3] = y1
local disp = deprocess(y3:double())
disp = image.minmax{tensor=disp, min=0, max=1}
disp = image.scale(disp, content_image:size(3), content_image:size(2))
image.save(filename, disp)
print("saving image ",filename)

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 13, 2018

In this comment here, I noted that the FCN-32s PASCAL model creates grey rectangle artifacts.

Image size 512:

Image size 1536:

I used a modified version of your convis.lua: https://gist.github.com/ProGamerGov/8f0560d8aea77c8c39c4d694b711e123

Then I just averaged all the layer output together with:

convert <layer1.png layer2.png layer3.png layer4.png> -average average_layers.png

Do you think that this has something to do with the artifacts? None of the other models I tested have anything like this, and the angles match the artifact's angles.

@htoyryla
Copy link
Owner

htoyryla commented Feb 13, 2018

You mean the added frame around the image. I think that comes from the 100 pixel padding used in the model, see https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/voc-fcn32s/val.prototxt#L27

I think there are ways to modify the model to remove the padding, I haven't done exactly this kind of operation thought. It is probably easier to try modifying style loss modules to remove the padding before calculating the Gram matrix. Almost started trying this but it was not as straightforward either: one has to adjust to how the size of the feature maps changes in different layers.

@htoyryla
Copy link
Owner

In fact it is quite easy to remove the padding. Load the model into th, take the first layer and set padH and padW to zero (for instance). But one cannot save into a caffemodel from torch. I guess there are tools in caffe to do this though, but I haven't used them.

But one can do this at runtime like this:

   if next_content_idx <= #content_layers or next_style_idx <= #style_layers then
      local layer = cnn:get(i)
      local name = layer.name
      local layer_type = torch.type(layer)
      local is_pooling = (layer_type == 'cudnn.SpatialMaxPooling' or layer_type == 'nn.SpatialMaxPooling')
      --remove extra padding in fcn32s model
      if i==0 then
	    layer.padH = 0
	    layer.padW = 0
	  end
      if is_pooling and params.pooling == 'avg' then

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 13, 2018

@htoyryla I'm just curious if the padding is somehow the cause of the artifacts I experience. If it is, then I wonder what other parts of a model may cause artifacts. If parts of other models do cause artifacts, then maybe they can be removed by editing Neural-Style, or the model itself.

Also, do you have any idea where I should start if I want to record information about the individual filters and their activations so that I can generate a list of usable layer channels?

Could I use your convis tool to generate all the images for each filter/channel, and repeat that on multiple images of a specific category. Then I could run some sort of analysis on those channel/filter images for light and dark pixels. Would this be a viable idea? I imagine that more bright pixels equals better/more activations for each filter?

@htoyryla
Copy link
Owner

Just modify neural-style to remove the padding by adding these lines

  if i==0 then
	    layer.padH = 0
	    layer.padW = 0
	  end

and see if it makes a difference.

@htoyryla
Copy link
Owner

htoyryla commented Feb 13, 2018

Also, do you have any idea where I should start if I want to record information about the individual filters and their activations so that I can generate a list of usable layer channels?

You seem to be asking for a simple way to do something which is quite complex.

Yet, as a second thought, the most relevant channels are probably those with the strongest activations for the relevant images (both content and style). We could feed in an image and calculate some statistics on each channel, and then list the channels with the strongest activations.

Mere average would be too crude: it dismisses channels with strong activations within a smaller area. But it could be a way to start. Or taking the maximum. One can then try to find a better formula to measure the activations. Perhaps something like number of pixels with activation above a threshold?

Let's see if I can try this approach, seems interesting to try. In fact, if one can define a criteria for dropping a channel, based on low activations from the style image, one can do it automatically. Just give a threshold and the style loss calculation will ignore channels which do not respond well enough to the style image.

@htoyryla
Copy link
Owner

htoyryla commented Feb 13, 2018

Try this https://gist.github.com/htoyryla/49cb3ab0864d2a12f558631c7b3d87a3

Give a layer and an image (for use with neural-channels probably should be the style image) and you get a list of channels which might be the best ones suitable. Param nc specifies how many are listed.

My neural-channels.lua is not the best way to make use of this anymore, as it in practice works only with a single style layer (because you cannot make channel selections per layer). It would probably be best to include this "channel pruning" into neural style, so that when the style target is capture, each style loss module evaluates which are the best channels and the uses only those when calculating loss. Seems quite strightforward.

@htoyryla
Copy link
Owner

Here's hopefully a working version, that tests the model during style capture and selects, per style layers, nc channels with the strongest activations (as measured during torch.norm of the channel output). These channels are then favored during iterations similar to how the earlier neural_channel worked. The rest the channels is not ignored totally (as this would stop the iterations from working) but given lower weight.

Remember that when decreasing nc, you need to increase style_weight yourself to keep the same content-style balance.

https://gist.github.com/htoyryla/b7940d31d329ee6ffb67b3185f414b8e

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 13, 2018

I'm noticing that the loss values with 10 of the best channels for each layer with neural_bestchannels.lua, resemble the loss values you see in the later stages of multiscale resolution.


I had a theory that channels/filters with strong activations result in a high degree of stylization while channels/filters with weak activations result in a low degree of stylization.

I first noticed this clearly (I has suspicions about it from neural-channels.lua) in my Protobuf-Dreamer project. For example, you can see that different channels of the mixed5a_1x1 layer, have different intensities of activations: https://i.imgur.com/icJjqm9.png

The difference is especially apparent on channel 106 (left), and channel 184 (right), where this was the input image:

While the inception5h model used in Protobuf Dreamer uses the inception architecture and not the VGG architecture that Neural-Style uses, I have suspect that the two are similar in regards to these high and low activation channels. Playing around with neural-channels.lua, it looked like I could influence the degree of stylization with by only changing the channel values. While testing my fine-tuned models, I noticed what appeared to be a similar effect:

  • The output from the Places 365 Hybrid model (left), and the fine-tuned version of the model (right).

  • The output from the Places 365 Hybrid model (left), and the fine-tuned version of the model (right).

What's interesting here, is that the degree of stylization is less with one style image, and more with another style image. The parameters never changed, but the channels/filters in the model did. I think this also backs up my theory.

Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 13, 2018

For convis, I noticed that the Illustration2vec model's activations, resemble the model's "style". Compared to other models, the Illustration2vec model transfers styles with a very distinct anime style of it's own.

It seems to "see" every input image in an anime style. This is most apparent on input images with faces (especially the eyes).

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 14, 2018

I wonder how well placing an emphasis on the best content layer channels in addition to the best style layer channels would work? How would just placing an emphasis on the content layers compare to just placing an emphasis on the style layers?

I think I got neural_bestchannels.lua to do the same thing with the content layer(s): https://gist.github.com/ProGamerGov/ef79cc3d47f6647f8f5a1582a657ce3d

@ProGamerGov
Copy link
Author

I tried using:

  if i==0 then
	    layer.padH = 0
	    layer.padW = 0
  end

And it did not stop the artifacts from the FCN-32s PASCAL.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 14, 2018

These are the results from my experiments with bestchannels.lua, style channels only, and the default channel weighting: https://imgur.com/a/yxnZm

This is the result from using style and content channels, in addition to the default channel weighting:

These are the results from using style and content channels, and custom channel weighting values: https://imgur.com/a/LVekL

And this was the control test: https://i.imgur.com/kFUEZK0.png

  • Of course, the same parameters were used for each test, and only the -nc parameter along with the channel weighting value, was changed.

  • The style image was examples/inputs/starry_night_google.jpg and the content image was examples/inputs/hoovertowernight.jpg, for all of the above experiments.

  • Changing the channel weighting was done by changing the value in this line of code: local m = torch.Tensor(C,H,W):fill(0.2), from the inputMask function.

These results are certainly interesting, but I am having a hard time quantifying the differences in a meaningful way that makes sense based on the chosen parameters. Things will probably become more clear as I experiment with other style images, content images, and models.

@htoyryla
Copy link
Owner

htoyryla commented Feb 14, 2018

Changing the channel weighting was done by changing the value in this line of code: local m = torch.Tensor(C,H,W):fill(0.2), from the inputMask function.

This line defines the default weight of the channels. The weight of the selected channels is set here:

     m[sch] = 5 

Ideally, I think, one would set the default weight to zero. When I was testing the original neural_channels, however, the iterations failed if the default weight was zero. The matrix became too sparse, I guess. But then I was testing with a single channel. With nc=10 I guess the default weight could be much smaller, like your experiment shows.

Remember also that tampering with channel weights changes the effective style weight, and so does changing nc, too. Which makes testing a bit uncertain.

@htoyryla
Copy link
Owner

Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels.

This could be an interesting experiment, but the results could be quite erratic: we would be emphasising features NOT found in the images!

@htoyryla
Copy link
Owner

I think I got neural_bestchannels.lua to do the same thing with the content layer(s): https://gist.github.com/ProGamerGov/ef79cc3d47f6647f8f5a1582a657ce3d

I had to add the mode captureS to make sure that the styleLoss module captures the best channels from the style image, not from the content image. I think in the contentLoss module this danger does not exist.

But interesting idea... ignoring all but the strongest content features.

@htoyryla
Copy link
Owner

htoyryla commented Feb 14, 2018

Ouch... there is bug in neural_bestchannels.lua so that no channels actually get emphasis. So the only thing that happens that the style weight is decreased.

https://gist.github.com/htoyryla/b7940d31d329ee6ffb67b3185f414b8e#file-neural_bestchannels-lua-L530

This line should be

 if channels ~= nil then

I noticed this when I tested decreasing default channel weight to 1e-2 and then increasing the emphasis channel weight, to no effect on the losses. After the correction, when changing nc from 4 to 10, the effect on the losses is dramatic (having the same effect as increasing style weight).

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 14, 2018

Yea, I was looking over that part of the code earlier and wondering if it was indeed a bug. It looks like I did fix it myself, but that fix wasn't actually in the script I used to create the above examples... I couldn't find the style_channels variable anywhere else in the code, but for whatever reason I assumed that there was something else going on that I was missing as I tried to follow the code (I guess I was more tired than I realized).

@htoyryla
Copy link
Owner

htoyryla commented Feb 14, 2018

One observation: this method of using mainly nc channels per layer now appears to favor relu1_x layers, which now have the highest loss values, while previously I think relu3_x was the strongest.

This is probably because relu1_x has fewest channels, so dropping most channels off has a smaller effect than on higher levels. But it might be good to test also without relu1 layers.

@htoyryla
Copy link
Owner

I couldn't find the style_channels variable anywhere else in the code,

In neural_channels style_channels contained the layers given in the parameter style_channels, which was now replaced by the automatically detected best channels. I had only overlooked this if statement.

@htoyryla
Copy link
Owner

Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels.

One might calculate the average norm for the channels, and then populate the channel mask with multipliers: (average norm / channel norm). This would in effect make all channels equally strong. Should be easily to implement, although the effect could be strange: we would be favoring features not present in the style model.

Meanwhile I made a more simple test: by just adjusting the code for inputMask as follows:

function inputMask(C, H, W, channels)
  local t = torch.Tensor(C,H,W):fill(1)
  local m = torch.Tensor(C,H,W):fill(2)
  if channels ~= nil then
    for i=1,#channels do
      local sch = channels[i]
      --print(i, sch)
      if sch > C then
        print("skipping non-existent channel ",sch)
      end
      m[sch] = 0.2
    end  
  end
  return t:cmul(m):cuda()
end 	

we can suppress a few of the strongest channels, while still keeping close to the original style. For instance I like this result using the defaults but suppressing 8 strongest channels: simple, not too much detail.

nnobc-def-nc8-sw1e3_1950

@htoyryla
Copy link
Owner

htoyryla commented Feb 14, 2018

This https://gist.github.com/htoyryla/072e1f0475eebc9a4dfc0c011498da9c
implements weighting each channel by (average norm of channels / norm of this channel).

nec-def-sw1e3

Makes nothing dramatical, as far as I can see. It does not (as I may have thought) bring out features which are not in the style. I was thinking wrong: the process is still moving towards the style target. But what this may do is make finding the target more difficult, as those channels which contribute more to this style are attenuated. The weights affect how the steering wheel works, and we modify the weights to favor turns away from the target?

Which makes me think: could be make the search faster by doing the reverse, amplifying the already strong channels. I guess not... like when you increase the learning rate, you are likely to miss the target. Which again can be compared by turning the steering wheel too much each time.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 15, 2018

I'm getting NANs from neural-equalchannels.lua.

One observation: this method of using mainly nc channels per layer now appears to favor relu1_x layers, which now have the highest loss values, while previously I think relu3_x was the strongest.

This is probably because relu1_x has fewest channels, so dropping most channels off has a smaller effect than on higher levels. But it might be good to test also without relu1 layers.

We can only use the maximum number of channels in the lowest layer right now. But maybe we could counteract some of this favoritism by treating the layer normally when the number of channels is larger than what the layer has.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 15, 2018

Using the modified bestchannels.lua that supports both content and style layer channels, the results seem to make more sense now:

The weighting works correctly now as well it seems:

The control test:

Using different amounts of channels with the -nc parameter results in a focus on different aspects of the style image and content image. Adding or subtracting by even a single value from the -nc parameter, can result in a dramatic change on the resulting output image. For example, the control image has vivid spirals, while -nc 50 created some really nice wave like patterns instead. I also think that the style "flows" better in relation to the content image with -nc 50, compared to the control image and other -nc values.

@htoyryla
Copy link
Owner

I meanwhile have come to like more the approach of suppressing the strongest channels. Produces simpler, less detailed image (kind of follows better the large forms of the style).

Suppressing strong channels:

nec-def-sharmaa001
nec-def-sharmaa001
nec-def-sharmaa001-sw50

Suppressing weaker channels

nb-def-sharmaa001-sw5e2
nb-def-sharmaa001

@htoyryla
Copy link
Owner

htoyryla commented Feb 15, 2018

We can only use the maximum number of channels in the lowest layer right now. But maybe we could counteract some of this favoritism by treating the layer normally when the number of channels is larger than what the layer has.

Don't immediately see how that would work, but never mind. You are free to try it. I was thinking rather that, so that the effect on effective style weight would be the same in all layers, one would suppress a given proportion of channels. E.g. nc would be given 1..64, and then it would be multiplied by C/64.

I'll have a look at neural-equalchannels in a moment.

@htoyryla
Copy link
Owner

htoyryla commented Feb 15, 2018

I download neural-equalchannels from the gist and give the command (downloaded under a different name)

th neural-equalchannel-gist.lua -print_iter 1 -backend cudnn

and it iterates nicely. Using adam works too. But it can well be that it will not work with all models or in all cases. After all, equalizing the activations from all channels is a quite extreme idea.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 15, 2018

Suppressing the stronger channels creates a result that looks a bit more like fast style transfer, especially in that last example you posted. I've only been messing around giving emphasis to the strongest layers. How well do the values work in your suppression code that was shared earlier in an above comment?

@htoyryla
Copy link
Owner

htoyryla commented Feb 15, 2018

Personally, I feel I never got fast-neural-style to give anything this close to my styles. I am often after styles that are not too detailed, even towards abstract, and neural-style is not so good at it, and fast-neural-style was much worse. Now suppressing stronger channels look promising.

Here's my inputMask() for suppressing strongest channels. I usually set nc = 1 ... 10, at times 24 or 32. These values were intended for low values of nc. That's why I changed 5 to 2 for suppressing nc channels... not to upset the style-content balance too much.

function inputMask(C, H, W, channels)
  local t = torch.Tensor(C,H,W):fill(1)
  local m = torch.Tensor(C,H,W):fill(2)
  if channels ~= nil then
    for i=1,#channels do
      local sch = channels[i]
      --print(i, sch)
      if sch > C then
        print("skipping non-existent channel ",sch)
      end
      m[sch] = 0.2
    end  
  end
  return t:cmul(m):cuda()
end 	

@htoyryla
Copy link
Owner

Suppressing the weaker channels creates a result that looks a bit more like fast style transfer

Just to make sure... "suppressing weaker channels" is neural-bestchannels.lua as it is now in gist. The reverse approach would be suppressing stronger channels, with inputMask as in the comment above.

@ProGamerGov
Copy link
Author

Just to make sure... "suppressing weaker channels" is neural-bestchannels.lua as it is now in gist. The reverse approach would be suppressing stronger channels, with inputMask as in the comment above.

I meant suppressing stronger channels.

@htoyryla
Copy link
Owner

I meant suppressing stronger channels.

The last example with suppressing stronger channels is with the lowest style weight. I think that makes it similar to fast-neural-style (with which one gets mainly color and texture effects while the shapes are not much affected... at least my impression of it).

I guess I did not try suppress weak channels with as low style weight at all. So to make a comparison ignore that example.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 15, 2018

Here are the equalized content and style layer channel results:

I'm not sure what to say about the equalization results, but they are certainly different than all the previous tests.

And here's what happened when I suppressed the top 50 strongest channels on each layer for both the content and style layers:

I find it interesting that suppressing the top 50 strongest channels, helped the moon be transferred from the style image in a more complete form, than in the previous experiments.

(I hope that posting the images in this way, where you can click on them in order to get the full size, is better than creating really long/large comments filled with image)

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 16, 2018

Some experiments with different amount of channels for style and content layers, and experiments with equalizing either the content or style layer channels:

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 16, 2018

So all the "equalized" results that I have created seem to be have a flaw in the code that allowed for the loss values to not be NANs. I don't know what it is, but the equalization does not seem to work for me.

I'll have to play around with the parameters and see if that's the cause.

Edit:

I think one or both of these parameters are the cause:

-backend cudnn -cudnn_autotune

Removing both of them results in:

Capturing style target 1
relu1_1
relu2_1
relu3_1
relu4_1
relu5_1
Running optimization with L-BFGS
<optim.lbfgs>   creating recyclable direction/step/history buffers
Iteration 50 / 1500
  Content 1 loss: 4518842.968750
  Style 1 loss: 1396907.775879
  Style 2 loss: 734496046.875000
  Style 3 loss: 916056093.750000
  Style 4 loss: nan
  Style 5 loss: 1462419.982910
  Total loss: nan
Iteration 100 / 1500
  Content 1 loss: 4518842.968750
  Style 1 loss: 1396907.775879
  Style 2 loss: 734496046.875000
  Style 3 loss: 916056093.750000
  Style 4 loss: nan
  Style 5 loss: 1462419.982910
  Total loss: nan

This repeating loss values are like the other issue I had earlier, but when I use -backend cudnn -cudnn_autotune, I only get NANs for everything instead of just a few.

@htoyryla
Copy link
Owner

htoyryla commented Feb 17, 2018

I am not surprised in equalizing channels produces NaNs, because also channels that do not respond to the image at all, or only very weakly, are pushed up to the same level as the strongest. I never thought that equalization makes sense, but tried it anyway.

What might work better is first suppressing sufficiently weak channels and then equalizing.

PS. noticed that you actually wrote "to not be NANs", which I do not understand, but anyway, pushing up even the channel that see nothing of interest is not very good for optimization. Maybe there is indeed a bug, that allows it to work at all.

@ProGamerGov
Copy link
Author

A while back I was using equal content and style weights in order to see what artifacts a particular content or style layer would produce. I found that for the VGG-16 SOD Finetune model, two style layers in particular: relu2_1, and relu3_1, were responsible for almost all of the "artifacts". The relu3_2
content layer produced the least amount of artifacts compared to the other content layers.

Some examples:

https://i.imgur.com/wQlvFml.jpg

https://i.imgur.com/YbQrwXj.png

For the NIN model, I found that using -content_layers relu1,relu2,relu7 -style_layers relu1,relu2,relu3,relu5,relu7, produced the least amount of artifacts. This worked really well in my tiling experiments, in addition to a -tv_weight of 0.000001 (just high enough to destroy the remaining artifacts, but not high enough to affect the output that much).

For your goal of having less detail and a "simpler" look to your outputs, it might be useful to try and eliminate the "high noise" layers from the -content_layers and -style_layers input values. That is, if my idea of using equal content and style weight values, works for this sort of task. If this does work, then I wonder what the effects of using "low noise" layers together with channel manipulation are?

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 22, 2018

PS. noticed that you actually wrote "to not be NANs", which I do not understand, but anyway, pushing up even the channel that see nothing of interest is not very good for optimization. Maybe there is indeed a bug, that allows it to work at all.

I wrote "not to be NANs" because when I fixed the issue, my parameters resulted in NANs.

This was the flaw in my code:

    if self.mode == 'captureS' then
		  print(self.name)
	      self.channelNorms = channelNorms(input)
    end
	self.mask = inputMaskStyle(input:size(1), input:size(2), input:size(3), self.bestChannels):cuda()	
  end	

And it seems that the following examples had this flaw:

  • Equalized Style -Content 50

  • Equalized Style - Suppressed Content 50

  • Equal Content & Style Channels

I think what happened was I copied the original flawed code for style channel equalization, for subsequent experiments using style channel equalization. All of the tests that did not try to equalize the style channels, don't have the flaw in their code.

It is curious that style channel equalization is resulting in NANs for me, while content channel equalization is not.

@ProGamerGov
Copy link
Author

ProGamerGov commented Feb 22, 2018

Also, using multiple style images with channel prominence (top 50), and equalized content channels results in a familiar gray haze:

So the optimal parameters have changed with respect to Adam (I was using the default parameters, not the better ones I discovered). I'm not sure if this was just the the results of the input images like it has been in the past, or if using multiple style images had something to with it. The loss values went down a lot slower than they should have with my highly optimized set of parameters.

Edit:

This occurred when I was trying to suppress the top 50 channels using your best-channels.lua, so I think it's because I am using multiple style images.

@htoyryla
Copy link
Owner

htoyryla commented Feb 22, 2018

For your goal of having less detail and a "simpler" look to your outputs, it might be useful to try and eliminate the "high noise" layers from the -content_layers and -style_layers input values.

When I say "simple", I am actually thinking using a Finnish work "pelkistää", which the dictionary translates "simplify" but which actually means "reduce to the bare essentials". I don't believe that can be done through layer selection. The essentials can include both high-level and low-level features.

Channel selection, maybe, not sure even about that.

@htoyryla
Copy link
Owner

It is curious that style channel equalization is resulting in NANs for me, while content channel equalization is not.

Just something that comes to my mind. I have not tried content channel equalization, so I do not really know, but there was something tricky about capturing the channel norms in style modules, as during the capture phases, the module gets both content and style images, one by one, as input. That's why I had to introduce the mode captureS, in order to capture the channel norms using the style image. Otherwise one also would run into size mismatches.

For content loss modules, captureS mode should be irrelevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants