-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a layer heat map to better understand the layers? #2
Comments
Each feature map of the model sees the image differently. One sees horizontal lines, another vertical lines, others see diagonal lines, circles, boxes, windows, eyes and so on. A layer may consist of as many as 512 feature maps, which respond to different features. Combining them does not sound a good idea to me, just like I wouldn't put 512 photos from London on top of each other to show what London is like. My main idea in making convis was to be able to check, when training a model, how the training is succeeding. One can also use it to gain some understanding of what a model sees. But the feature maps respond to thousands of different features, and I don't see how one could compress that into a heat map in a meaningful way. To understand the layers, one would have to feed the model different kinds of images and the examine all the feature maps to determine what features exactly each feature map is seeing. But convis is probably too simplistic for that kind of work. |
I recently noticed that MIT's Places 365 models were used to generate saliency maps: http://cnnlocalization.csail.mit.edu/ That is exactly what I was trying to do here with convis. I wonder if we can apply class activation mapping (CAM) to other models or if it's specific to the Places 365 project? |
I found an implementation of CAM that works on the regular caffemodels that Neural-Style uses: https://github.com/ramprs/grad-cam Though that implementation only supports a single layer at a time. It would be interesting to see how the heatmap changes between iterations in Neural-Style. |
So classification.lua contains the code, along with: utils.lua Specifically these two functions are used to create the heatmap: https://github.com/ramprs/grad-cam/blob/master/misc/utils.lua#L84-L128 https://github.com/ramprs/grad-cam/blob/master/misc/utils.lua#L154-L176 Edit: @htoyryla I can't seem to figure out how to get the code working in Neural-Style. I've been trying to place it all in the feval(x) function. Maybe it needs to be implemented like a loss function to work correctly? This might work?
Maybe something like the TV Loss function:
|
It seems to me that you are trying to achieve both I don't think there is any major difficulty doing this in neural-style, one simply needs to find a good way to combine, say, 128 feature maps into a single heatmap. Like taking the average or maximum of all feature maps from a layer. I did something related in this in one of our earlier threads here. This does not display feature maps though, but the gradients at each iteration, as if to indicate which part of the image is now changing and how much.
Convis was made simple to just map the activations. The more sophisticated visualization methods attempt to follow gradients to indicate exactly which areas in the image caused those activations. I guess your difficulties arise from the need to make neural-style do both the usual iterations and to trace the gradients for visualization. Good luck. |
I made a quick test modifying convis to save a single combined activation map from a layer. Easy to do, it is only an open question as to how meaningful such a map is as the different channels respond to different features, so it is quite natural that the combined activations from a layer cover most of the image. Another thought: you cannot do this inside feval, because the output of each layer is not available there. However, one could calculate the activation map inside each style (and content) loss module and store the results inside feval. So when using the simple activations like in convis, no additional loss modules are needed. And I don't have time or interest to start looking into this following the gradients thing. |
Just as a sidetrack... so my convis shows me the activations from each individual filter in a VGG network. Now I noticed that the second filter of relu1_1 of the usual VGG19 reacts mainly on the sky in the default Thuringen image. So taking the output from that feature map and adding some postprocessing I can get masks like these. The point here is that the filters in relu1_1 act directly on the image and therefore can also be used as ordinary image filters (if they happen to produce useful output, that is). |
Could you please share the convis modifications for creating simple combined activation maps? I am also wondering how torch.max can be used to get a predicted class value, when no classification list is provided. The code lines here seem to do this and I can't seem to recreate it in convis or Neural-Style. Like for example:
This seems like it might be interesting to use for models that don't readily available category lists. Another idea I just had was what if instead of arbitrary restricting Neural-Style's layer channels to specific values, we could instead restrict it to what matches the most likely label. Is is possible to get a list of layers and their filters using the above code? Edit: I figured it, it was really simple:
I also see your neural_mirage5.lua does not resize the image before making the predictions? Is the above method basically the same as yours, only it uses torch.max to get the label with the highest accuracy prediction, whereas your code checks every label and creates a top 5 set of labels. |
@htoyryla For the mask image you created with relu1_1, I guess that particular filter was looking for a "sky texture"? And in the context of our conversations here, "filter" and "layer channel", are the same thing, right? |
Not really, the lowest levels cannot detect complex entities like "sky", they simply act as basic convolutional filters. It could be that it detects a certain color.
Yes. Functionally they are filters. In neural-style/torch terms, a channel in a layer. |
Yes. It shows the top 5 labels. In addition, when neural-mirage creates a new image, the target is the complete set classification probabilities, not only a single class with the highest probability. It tries to create an image that gives the same mix of label probabilities. Note also that neural-mirage modifies the model (add an adaptive pooling layer between the conv and FC layers) so that the FC layers can be used with images of varying size. Therefore no resize is used before prediction.
No. One has to look at each filter at each layer to see which activations are essential. Perhaps one could follow the gradients from the classification downward to see which filters contribute more and which less. That's not something I am familiar with. Anyway, also the lower activations may be significant and dropping those filters may change the results. |
This is the (quick & dirty) code I used to make an average activation map from a given layer. I am simply taking the CxHxW output from the layer and summing the different channels which gives a HxW tensor, then normalizing it to 0...255 value range for display. If I remember correctly, the modified part starts at the line local fmaps = net:forward(img)
|
In this comment here, I noted that the FCN-32s PASCAL model creates grey rectangle artifacts. Image size 512: Image size 1536: I used a modified version of your convis.lua: https://gist.github.com/ProGamerGov/8f0560d8aea77c8c39c4d694b711e123 Then I just averaged all the layer output together with:
Do you think that this has something to do with the artifacts? None of the other models I tested have anything like this, and the angles match the artifact's angles. |
You mean the added frame around the image. I think that comes from the 100 pixel padding used in the model, see https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/voc-fcn32s/val.prototxt#L27 I think there are ways to modify the model to remove the padding, I haven't done exactly this kind of operation thought. It is probably easier to try modifying style loss modules to remove the padding before calculating the Gram matrix. Almost started trying this but it was not as straightforward either: one has to adjust to how the size of the feature maps changes in different layers. |
In fact it is quite easy to remove the padding. Load the model into th, take the first layer and set padH and padW to zero (for instance). But one cannot save into a caffemodel from torch. I guess there are tools in caffe to do this though, but I haven't used them. But one can do this at runtime like this:
|
@htoyryla I'm just curious if the padding is somehow the cause of the artifacts I experience. If it is, then I wonder what other parts of a model may cause artifacts. If parts of other models do cause artifacts, then maybe they can be removed by editing Neural-Style, or the model itself. Also, do you have any idea where I should start if I want to record information about the individual filters and their activations so that I can generate a list of usable layer channels? Could I use your convis tool to generate all the images for each filter/channel, and repeat that on multiple images of a specific category. Then I could run some sort of analysis on those channel/filter images for light and dark pixels. Would this be a viable idea? I imagine that more bright pixels equals better/more activations for each filter? |
Just modify neural-style to remove the padding by adding these lines
and see if it makes a difference. |
You seem to be asking for a simple way to do something which is quite complex. Yet, as a second thought, the most relevant channels are probably those with the strongest activations for the relevant images (both content and style). We could feed in an image and calculate some statistics on each channel, and then list the channels with the strongest activations. Mere average would be too crude: it dismisses channels with strong activations within a smaller area. But it could be a way to start. Or taking the maximum. One can then try to find a better formula to measure the activations. Perhaps something like number of pixels with activation above a threshold? Let's see if I can try this approach, seems interesting to try. In fact, if one can define a criteria for dropping a channel, based on low activations from the style image, one can do it automatically. Just give a threshold and the style loss calculation will ignore channels which do not respond well enough to the style image. |
Try this https://gist.github.com/htoyryla/49cb3ab0864d2a12f558631c7b3d87a3 Give a layer and an image (for use with neural-channels probably should be the style image) and you get a list of channels which might be the best ones suitable. Param nc specifies how many are listed. My neural-channels.lua is not the best way to make use of this anymore, as it in practice works only with a single style layer (because you cannot make channel selections per layer). It would probably be best to include this "channel pruning" into neural style, so that when the style target is capture, each style loss module evaluates which are the best channels and the uses only those when calculating loss. Seems quite strightforward. |
Here's hopefully a working version, that tests the model during style capture and selects, per style layers, nc channels with the strongest activations (as measured during torch.norm of the channel output). These channels are then favored during iterations similar to how the earlier neural_channel worked. The rest the channels is not ignored totally (as this would stop the iterations from working) but given lower weight. Remember that when decreasing nc, you need to increase style_weight yourself to keep the same content-style balance. https://gist.github.com/htoyryla/b7940d31d329ee6ffb67b3185f414b8e |
I'm noticing that the loss values with 10 of the best channels for each layer with I had a theory that channels/filters with strong activations result in a high degree of stylization while channels/filters with weak activations result in a low degree of stylization. I first noticed this clearly (I has suspicions about it from The difference is especially apparent on channel 106 (left), and channel 184 (right), where this was the input image: While the inception5h model used in Protobuf Dreamer uses the inception architecture and not the VGG architecture that Neural-Style uses, I have suspect that the two are similar in regards to these high and low activation channels. Playing around with neural-channels.lua, it looked like I could influence the degree of stylization with by only changing the channel values. While testing my fine-tuned models, I noticed what appeared to be a similar effect:
What's interesting here, is that the degree of stylization is less with one style image, and more with another style image. The parameters never changed, but the channels/filters in the model did. I think this also backs up my theory. Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels. |
For convis, I noticed that the Illustration2vec model's activations, resemble the model's "style". Compared to other models, the Illustration2vec model transfers styles with a very distinct anime style of it's own. It seems to "see" every input image in an anime style. This is most apparent on input images with faces (especially the eyes). |
I wonder how well placing an emphasis on the best content layer channels in addition to the best style layer channels would work? How would just placing an emphasis on the content layers compare to just placing an emphasis on the style layers? I think I got |
I tried using:
And it did not stop the artifacts from the FCN-32s PASCAL. |
These are the results from my experiments with This is the result from using style and content channels, in addition to the default channel weighting:
These are the results from using style and content channels, and custom channel weighting values: https://imgur.com/a/LVekL And this was the control test: https://i.imgur.com/kFUEZK0.png
These results are certainly interesting, but I am having a hard time quantifying the differences in a meaningful way that makes sense based on the chosen parameters. Things will probably become more clear as I experiment with other style images, content images, and models. |
This line defines the default weight of the channels. The weight of the selected channels is set here:
Ideally, I think, one would set the default weight to zero. When I was testing the original neural_channels, however, the iterations failed if the default weight was zero. The matrix became too sparse, I guess. But then I was testing with a single channel. With nc=10 I guess the default weight could be much smaller, like your experiment shows. Remember also that tampering with channel weights changes the effective style weight, and so does changing nc, too. Which makes testing a bit uncertain. |
This could be an interesting experiment, but the results could be quite erratic: we would be emphasising features NOT found in the images! |
I had to add the mode captureS to make sure that the styleLoss module captures the best channels from the style image, not from the content image. I think in the contentLoss module this danger does not exist. But interesting idea... ignoring all but the strongest content features. |
Ouch... there is bug in neural_bestchannels.lua so that no channels actually get emphasis. So the only thing that happens that the style weight is decreased. https://gist.github.com/htoyryla/b7940d31d329ee6ffb67b3185f414b8e#file-neural_bestchannels-lua-L530 This line should be
I noticed this when I tested decreasing default channel weight to 1e-2 and then increasing the emphasis channel weight, to no effect on the losses. After the correction, when changing nc from 4 to 10, the effect on the losses is dramatic (having the same effect as increasing style weight). |
Yea, I was looking over that part of the code earlier and wondering if it was indeed a bug. It looks like I did fix it myself, but that fix wasn't actually in the script I used to create the above examples... I couldn't find the |
One observation: this method of using mainly nc channels per layer now appears to favor relu1_x layers, which now have the highest loss values, while previously I think relu3_x was the strongest. This is probably because relu1_x has fewest channels, so dropping most channels off has a smaller effect than on higher levels. But it might be good to test also without relu1 layers. |
In neural_channels style_channels contained the layers given in the parameter style_channels, which was now replaced by the automatically detected best channels. I had only overlooked this if statement. |
One might calculate the average norm for the channels, and then populate the channel mask with multipliers: (average norm / channel norm). This would in effect make all channels equally strong. Should be easily to implement, although the effect could be strange: we would be favoring features not present in the style model. Meanwhile I made a more simple test: by just adjusting the code for inputMask as follows:
we can suppress a few of the strongest channels, while still keeping close to the original style. For instance I like this result using the defaults but suppressing 8 strongest channels: simple, not too much detail. |
This https://gist.github.com/htoyryla/072e1f0475eebc9a4dfc0c011498da9c Makes nothing dramatical, as far as I can see. It does not (as I may have thought) bring out features which are not in the style. I was thinking wrong: the process is still moving towards the style target. But what this may do is make finding the target more difficult, as those channels which contribute more to this style are attenuated. The weights affect how the steering wheel works, and we modify the weights to favor turns away from the target? Which makes me think: could be make the search faster by doing the reverse, amplifying the already strong channels. I guess not... like when you increase the learning rate, you are likely to miss the target. Which again can be compared by turning the steering wheel too much each time. |
I'm getting NANs from
We can only use the maximum number of channels in the lowest layer right now. But maybe we could counteract some of this favoritism by treating the layer normally when the number of channels is larger than what the layer has. |
Using the modified The weighting works correctly now as well it seems: The control test: Using different amounts of channels with the |
Don't immediately see how that would work, but never mind. You are free to try it. I was thinking rather that, so that the effect on effective style weight would be the same in all layers, one would suppress a given proportion of channels. E.g. nc would be given 1..64, and then it would be multiplied by C/64. I'll have a look at neural-equalchannels in a moment. |
I download neural-equalchannels from the gist and give the command (downloaded under a different name)
and it iterates nicely. Using adam works too. But it can well be that it will not work with all models or in all cases. After all, equalizing the activations from all channels is a quite extreme idea. |
Suppressing the stronger channels creates a result that looks a bit more like fast style transfer, especially in that last example you posted. I've only been messing around giving emphasis to the strongest layers. How well do the values work in your suppression code that was shared earlier in an above comment? |
Personally, I feel I never got fast-neural-style to give anything this close to my styles. I am often after styles that are not too detailed, even towards abstract, and neural-style is not so good at it, and fast-neural-style was much worse. Now suppressing stronger channels look promising. Here's my inputMask() for suppressing strongest channels. I usually set nc = 1 ... 10, at times 24 or 32. These values were intended for low values of nc. That's why I changed 5 to 2 for suppressing nc channels... not to upset the style-content balance too much.
|
Just to make sure... "suppressing weaker channels" is neural-bestchannels.lua as it is now in gist. The reverse approach would be suppressing stronger channels, with inputMask as in the comment above. |
I meant suppressing stronger channels. |
The last example with suppressing stronger channels is with the lowest style weight. I think that makes it similar to fast-neural-style (with which one gets mainly color and texture effects while the shapes are not much affected... at least my impression of it). I guess I did not try suppress weak channels with as low style weight at all. So to make a comparison ignore that example. |
Here are the equalized content and style layer channel results: I'm not sure what to say about the equalization results, but they are certainly different than all the previous tests. And here's what happened when I suppressed the top 50 strongest channels on each layer for both the content and style layers: I find it interesting that suppressing the top 50 strongest channels, helped the moon be transferred from the style image in a more complete form, than in the previous experiments. (I hope that posting the images in this way, where you can click on them in order to get the full size, is better than creating really long/large comments filled with image) |
So all the "equalized" results that I have created seem to be have a flaw in the code that allowed for the loss values to not be NANs. I don't know what it is, but the equalization does not seem to work for me. I'll have to play around with the parameters and see if that's the cause. Edit: I think one or both of these parameters are the cause:
Removing both of them results in:
This repeating loss values are like the other issue I had earlier, but when I use |
I am not surprised in equalizing channels produces NaNs, because also channels that do not respond to the image at all, or only very weakly, are pushed up to the same level as the strongest. I never thought that equalization makes sense, but tried it anyway. What might work better is first suppressing sufficiently weak channels and then equalizing. PS. noticed that you actually wrote "to not be NANs", which I do not understand, but anyway, pushing up even the channel that see nothing of interest is not very good for optimization. Maybe there is indeed a bug, that allows it to work at all. |
A while back I was using equal content and style weights in order to see what artifacts a particular content or style layer would produce. I found that for the VGG-16 SOD Finetune model, two style layers in particular: Some examples: https://i.imgur.com/wQlvFml.jpg https://i.imgur.com/YbQrwXj.png For the NIN model, I found that using For your goal of having less detail and a "simpler" look to your outputs, it might be useful to try and eliminate the "high noise" layers from the |
I wrote "not to be NANs" because when I fixed the issue, my parameters resulted in NANs. This was the flaw in my code:
And it seems that the following examples had this flaw:
I think what happened was I copied the original flawed code for style channel equalization, for subsequent experiments using style channel equalization. All of the tests that did not try to equalize the style channels, don't have the flaw in their code. It is curious that style channel equalization is resulting in NANs for me, while content channel equalization is not. |
Also, using multiple style images with channel prominence (top 50), and equalized content channels results in a familiar gray haze: So the optimal parameters have changed with respect to Adam (I was using the default parameters, not the better ones I discovered). I'm not sure if this was just the the results of the input images like it has been in the past, or if using multiple style images had something to with it. The loss values went down a lot slower than they should have with my highly optimized set of parameters. Edit: This occurred when I was trying to suppress the top 50 channels using your |
When I say "simple", I am actually thinking using a Finnish work "pelkistää", which the dictionary translates "simplify" but which actually means "reduce to the bare essentials". I don't believe that can be done through layer selection. The essentials can include both high-level and low-level features. Channel selection, maybe, not sure even about that. |
Just something that comes to my mind. I have not tried content channel equalization, so I do not really know, but there was something tricky about capturing the channel norms in style modules, as during the capture phases, the module gets both content and style images, one by one, as input. That's why I had to introduce the mode captureS, in order to capture the channel norms using the style image. Otherwise one also would run into size mismatches. For content loss modules, captureS mode should be irrelevant. |
When using convis on a model's higher level layers, a large amount of individual images are produced. This makes trying to view the image as the model sees it, very impractical.
So I was wondering about the practicality of having a heat map that combines all of the images, into a single false color image?
The text was updated successfully, but these errors were encountered: