# Imagenet Trained CNNs are Biased Towards Texture

## TLDR
- Common intution / hypothesis has often been that CNNs learn increasingly complex patterns of shapes of objects.
- Recent studies show that texture actually play a larger role. In the cases looked at, at least.
- They test this by making a dataset where texture is not as good a signal.
- No inherent "flaw" of CNNs. They show that a ResNet-50 can learn to distinguish better on shapes on a specialized dataset where texture is not a good signal.
- Training with this in mind can give improvements in robustness.


## Introduction
- Comparison with human visual perception:
    - Humans rely more on shape information.
- Findings in other works:
    - ImageNet can be solved very well by only considering texture information. (Linear classifier + Gram Matrix texture representation).
    - CNNs do badly when texture information is removed.
    - CNNs do well (on ImageNet) even when receptive field size is constrained (BagNet paper), i.e. no larger structure can be modeled.
    - CNNs do well when texture remains but global structure is destroyed / altered to a large degree.
    - "A cat with an elephant texture is an elephant to CNNs, and still a cat to humans"


## Method
- Experiments comparing humans vs CNNs in 6 different classification tasks based on 16-class ImageNet.
    - Considerations were done to make the comparison as fair as possible. Time limits etc. I guess they know what they're doing.
- Experiments were the following:
    - Original
    - Grayscale
    - Silhouette, just the silhouettes of the shape of gt.
    - Edges, just edges from an edge detector.
    - Texture, images of only texture, e.g. elephant skin.
    - Cue conflict, images generated using iterative style transfer with texture images as the style images.

## Results
![fig2](figs/cnn-texture-bias-fig2.png)

![fig4](figs/cnn-texture-bias-fig4.png)

### Overcoming Texture Bias
- Idea is to alter the training task. If the classification task can be solved with just texture information (as was shown was the case for ImageNet) then the CNN will be biased towards texture.
- Stylized ImageNet (SIN). They create a new dataset based on AdaIN style transfer on the original images and different paintings as style images.
    - Note that they used different style transfers algorithms for this dataset and in cue conflict so not to rely on this.
    - Training and evaluation on SIN gives worse performance than training and evaluation on ImageNet (IN).
    - Training on SIN and evaluating on IN works much better than the opposite.
    - The shape bias is stronger after training on SIN.
    
![fig3](figs/cnn-texture-bias-fig3.png)
    
![table1](figs/cnn-texture-bias-table1.png)
    
### Robustness 
- Classification performance
    - Improvements can be had meaning SIN training might be good augmentation technique.
- Transfer learning
    - Improved results when using as backbone on object detection task (in line with intuition).
- Robustness against distortions
    - Improved robustness againt different noise types compared to IN trained CNNs.
    - Higher shape bias seems to mean more robustness to different types of noise.
    - Close to human level.

![table1](figs/cnn-texture-bias-table2.png)


## Notes / takeaways
- Useful if domain knowledge says shape should be more important.
- Perhaps a good data augmentation technique in general.
    - Both standard classification task and for transfer learning (text detection).
- Useful for NSFW task? Leather sofa false positives might indiate the texture bias being present.
