<a href="https://colab.research.google.com/github/paruliansaragi/Notebooks/blob/master/cs231n_A2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#CNN's

http://cs231n.github.io/convolutional-networks/

CNNs are similar to NN. The whole net still outputs a single differentiable score function from pixels to class scores. 

Regular nets dont scale well to full images. In Cifar-10 images are 32x32x3 (32 H by 32 W by 3 Channels), so a single FC neuron in a first hidden layer would have 32*32*3 = 3072 weights plus a bias vector so 3073. If the image was larger this would increase the weights exponentially. This full connectivity with the input is wasteful and the huge number of parameters would quickly lead to overfitting. 

3D volumes of neurons: CNNs unlike NNs, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Depth is the third dimension of an activation volume, not the depth of a full NN, which can refer to the total number of layers in a network.) E.g. the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth). 

The neurons in a layer are only connected to a small region in the layer before. The final output layer for CIFAR-10 will have dimensions 1x1x10, because by the end of the ConvNet we will reduce the full image to a single vector of class scores, arranged along the depth dimension. 

![alt text](http://cs231n.github.io/assets/nn1/neural_net2.jpeg)

##Layers used to build ConvNets

A simple convnet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three types of layers to build a Convnet: Convolutional Layer, Pooling Layer and FC Layer, stacked together. 

Example architecture: Input - Conv - Relu - Pool - FC. In detail:

- Input is 32x32x3 will hold the raw pixel values with heighxwidthxcolour channel
- Conv layer compute the output neurons connected to local region of the input, each computing a dot product (moving) between their weights (kernel/filter) and a small connected region in the input. This may result in a volume such as 32x32x12 if we use 12 filters. 
- Relu layer applies an elementwise activation function such as max(0, x) thresholding at 0. This leaves the size of the volume unchanged.
- Pool layer will perform a downsampling operation along the spatial dimensions (width, height) resulting in volume such as 16x16x12.
- FC layer computes class scores, resulting in volume of size 1x1x10, where each 10 numbers correspond to class score such as 10 classes of CIFAR-10. 

ConvNets transform the original image to class scores. CONV/FC layers perform transformations that are a function of not only the activations in the input volume but also the parameters (weights & biases of neurons). RELU/POOL layers implement a fixed function. The parameters in CONV/FC will be trained with gradient descent.

A ConvNet is a list of layers that transform the image volume into an output volume. Takes a 3D volume and outputs a 3D volume. CONV/FC have params and RELU/POOL don't. 

####Convolutional Layer

The conv layer's params consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).

During the forward pass, we slide (convolve) each filter across the width and height of the input and compute dot products between entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we produce a 2-D activation map that gives the responses of that filter at every spatial position. The network will learn the filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layet, or eventually entire honeycomb or wheel-like patterns on high layers. We have seperate set of filters in each CONV layer and each will produce a separate 2-D activation map. We stack these activation maps along the depth dimension and produce an output volume.

**Local Connectivity**: as mentioned its not practical to connect neurons to all neurons in the input. We connect a local region instead. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (its filter size). The extent of the connectivity along the depth axis is equal to the depth of the input volume. Connections are local in space (height, width) but full in depth of the input. 

Example 1: suppose the input is 32x32x3. The filter size is 5x5, then each neuron in the Conv Layer will have weights 5x5x3 region in the input volume so 5x5x3 = 75 weights + 1 bias parameter. 

Example 2: suppose the input is 16x16x20 and a filter of 3x3, every neuron will have weights 3x3x20 = 180 connections to the input volume. The local connectivity is 3x3 but full along the depth. 

![alt text](http://cs231n.github.io/assets/cnn/depthcol.jpeg)

The input image left is 32x32x3 and an example volume in the first conv layer. Notice there are 5 neurons along the depth, all looking at the same region in the input. 

![alt text](http://cs231n.github.io/assets/nn1/neuron_model.jpeg)

The neurons from the NN remain unchanged, they still compute a dot product of their weights with the input followed by a non-linearity but their connectivity is now restricted to be local spatially. 

**Spatial arrangement**: We haven't discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the depth, stride and zero-padding. 

1. First, the depth of the output is a hyperparam: it corresponds to the number of filters we would like to use, each learning to look for something different in the input. For instance if the first conv layer takes raw image, then different neurons along the depth may activate in the presence of various oreinted edges, or blobs or color. We refer to a set of neurons that are all looking at the same region of the input as a depth column.

2. Second, the stride we slide the filter. Stride 1 moves filters one pixel at a time. Stride 2 jumps 2 pixels at a time as we slide. Producing smaller outputs spatially.

3. Sometimes its good to pad the input volume with zeros around the border. This is set by zero-padding. This allows us to control the spatial size of the output volumes (most commonly to preserve the spatial size of the input volume so the input and output with and height are the same).

We can compute the spatial size of the output volume as a function of the input volume size ($W$), filter size ($F$), stride ($S$) and zero-padding ($P$). 

So the correct formula for calculating how many neurons fit is given by ($W - F + 2P)/S + 1$. E.g. 7x7 input and 3x3 filter with stride 1 and pad 0 would get a 5x5 output. With stride 2 we get 3x3 output. 

![alt text](http://cs231n.github.io/assets/cnn/stride.jpeg)

(only one spatial dimension in these examples)
The left shows stride of 1, giving an output of size (5 - 3 + 2)/1+1=5. The right uses stride 2 so the output (5 - 3 + 2)/2 + 1 = 3. 

Use of zero-padding:  In the left example, note the input was 5 and output was equal also 5. This worked out because our filters were 3 and used zero padding of 1. If there was no zero padding then the output would have had spatial dimension of only 3. In general, setting zero padding to be $P = (F - 1)/2$ when stride is $S=1$ ensures that the input volume and output volume will have the same size spatially. 

Constraints on strides: spatial arrangement hyperparameters have mutual constraints. E.g. when input has size $W = 10$, no zero-padding is used $P = 0$ then it is **impossible** to use stride $S=2$, since $(W - F + 2P)/S + 1 = (10 - 3 + 0)/2+1=4.5$ i.e. not an integer, indicating neurons don't fit neatly and symmetrically across the input.  This setting is invalid and could throw an exception, or zero pad the rest to make it fit or crop the input. 

**Parameter Sharing**: Parameter sharing scheme is used in Conv Layers to control the number of parameters. E.g. say we have 55x55x96=290,400 neurons in the first Conv Layer, each has 11x11x3 = 363 weights and 1 bias. This adds up to 290,400 * 364 = 105,705,600 parameters on the first layer alone. Very high! (Depth $K$ is set to 96)

We can drastically reduce the number of parameters by making one reasonable assumption: if one feature is useful to compute at some spatial position (x,y) then it should also be useful to compute at a different position (x2,y2). In other words, a single 2-D slice of depth as a depth slice (e.g. volume of size 55x55x96 has 96 depth slices each of size 55x55) we are going to constrain the **neurons in each depth slice to use the same weights and bias.** With this parameter sharing scheme, **the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice)**, for a total of 96*11*11*3=34,848 unique weights or 34,944 parameters + 96 biases. 

Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backprop, every neuron in the volume will compute the gradient for its weights, but these gradients are added up across each depth slice and only update a single set of weights per slice.

If all the neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV can in each depth slice be computed as a convolution of the neuron's weights with the input volume. This is why its refered to the set of weights as a filter or a kernel that is convolved with the input. 

![alt text](http://cs231n.github.io/assets/cnn/weights.jpeg)

We set the depth to 96 so we get 96 depth slices. For each depth slice it is 11x11x3. We share parameters of the 11x11x3 filters, they will learn to focus on similar patterns.
Each of the 96 filters here is of size 11x11x3 and each one is shared by the 55*55 neurons in one depth slice. Note the param sharing: if detecting a horizontal edge is important at some location in the image, then it should be useful at some other location as well due to translationally-invariant structure of images. Thus, we don't need to relearn to detect horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume. 

Parameter sharing assumption may not make sense. Input images tend to have a specific centered structure, where we expect that different features should be learned on one side of the image than another. One example is when inputs are faces that are centered. You expect that different eye specific or hair specific features could and should be learned in different spatial locations. In this case, it is common to relax parameter sharing and simply call the layer a locally-connected layer. 

**Numpy examples:* Suppose the input volume a is a np array X. Then:

- a depeth column at position (x,y) would be the activations ```X[x,y,:]```
- a depth slice (commonly called an activation map) at depth d would be the activations ```X[:,:,d]```

Conv Layer Example: suppose the input volume X has shape X.shape: (11,11,4). Suppose further that we use no zero padding P = 0, filter size F = 5 and stride S = 2. The output volume would have spatial size (11-5)/2+1=4, giving a volume with width and heigh of 4. The activation map in the output volume call it V, would look as follows:

```
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
```
In np * operation is elementwise multiplication between arrays. W0 is the weight vector of that neuron and same for bias. W0 is shape W0.shape: (5,5,4), since the filter is 5 and depth of input volume is 4. At each point we are computing the dot product. We use the same weights and bias, and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct activation map 2:

```
V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1
V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1
V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1
V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1

V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1#example of going along y
V[2,3,1] = np.sum(X[4:9,6:1,:] * W1) + b1#example of going along both

```
we index into the second depth dimension in V at index 1 because we are computing the second activation map, and that a different set of parameters (W1) is now used. 

**Summary**:

The conv layer: 

- Accepts a volume of size $W_1 \times H_1 \times D_1$
- Requires four hyperparameters: 
    - Number of filers $K$
    - their spatial extent $F$
    - stride $S$
    - amount of zero-padding $P$
- Produces a volume size $W_2 \times H_2 \times D_2$ where:
    - $W_2 = (W_1 - F + 2P)/S + 1$
    - $H_2 = (H_1 - F + 2P)/S + 1$ i.e. width and height are computed equally by symmetry
    - $D_2 = K$
- With param sharing, it introduces $F \cdot F \cdot D_1$ weights per filter for a total of $(F \cdot F \cdot D_1) \cdot K$ weights and $K$ biases.
- In the output volume, the d-th depth slice of size $W_2 \times H_2$ is the result of performing a valid convolution of the d-th filter over the input volume with a stride of $S$, and then offset by d-th bias.

You can take advantage of the fact that the conv operation is just a dot product between filter and local region of the input. We can do the forward pass as one big matrix multiply:

1. Local regions in the input are stretched out into columns in an operation called im2cool. E.g. 227x227x3 on a 11x11x3 filters at stride 4, then take the 11x11x3 blocks of pixels in the input and stretch each block into a column vector of size 11*11*3=363. Iterating this in the input at stride 4 gives (227-11)/4+1=55 locations along width and height leading to an output matrix X_col of size 363x3025, where every column is a stretched out receptive field and there are 55*55=3025 of them in total. Since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.

2. Weights of conv layer stretched into rows. If 96 filters of 11x11x3 then this gives a matrix W_row of 96x363.

3. The result of a convolution is equivalent to performing one large matrix multiply np.dot(W_row, X_col), which evaluates the dot product between every filter and every receptive field location. The output would be 96 x 3025, giving the output of the dot product of each filter at each location.

4. The result must be reshaped back to its proper output dims 55x55x96.

This is memory expensive since values in input are replicated multiple times in X_col. It can be resued for pooling op.

Backprop: the backward pass for a conv op for both data and weights is also a conv but with spatially flipped filters. It is easy to derive in the 1-D case.

1x1 Convolution: If input is 32x32x3 then 1x1 convolutions effectively doing 3 dimensional dot proucts since depth is 3.

Dilated Convolutions: One new hyperparam called dilation. So far we have seen filters that are contiguous but its possible to have spaces between each cell called diltion. E.g. a 1-D filter w of size 3 would compute over input x the following: w[0]*x[0] + w[1]*x[1]+w[2]*x[2]. This is dilation of 0, for dilation 1 the filter would compute: w[0]*x[0] + w[1]*x[2]+w[2]*x[4], there is a gap of 1 between. This can be useful as it allows you to merge spatial information across inputs more aggressively with fewer layers. Eg. if you stack 2 2x2 conv layers on top of each other then the neurons in layer 2 are a function of the 5x5 patch of the input (the effective receptive field of these neurons is 5x5). If we use dilated convolutions then this effective receptive field would grow much quicker.

####Pooling Layer

Common to insert a pooling layer between successive Conv layers. Its function is to reduce the spatial size of the representation to reduce parameters and control overfitting. The pooling layer operates independently on every depth slice of the input and resizes spatially using the MAX operation. The most common form is a pooling layer with filters 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of activations. Every MAX op would be taking a max over 4 numbers (2x2 region). The depth remains unchanged. The pooling layer:
- Accepts a volume of size $W_1 \times H_1 \times D_1$
- Has 2 hyperparams: spatial extent $F$, the stride $S$
- Produces a volume of size $W_2 \times H_2 \times D_2$ where:
    - $W_2 = (W_1 - F)/S + 1$
    - $H_2 = (H_1 - F)/S + 1$
    - $D_2 = D_1$
- Introduces zero parameters since it computes a fixed function of the input
- not common to pad

General Pooling: You can also perform average pooling or even L2-norm pooling. Average pooling has recently fallen out of favour.

![alt text](http://cs231n.github.io/assets/cnn/pool.jpeg)

Pooling layer downsamples the volume spatially. The above, the input volume is 224x224x64 is pooled with filter size 2, stride 2 into output volume of size 112x112x64. The volume is preserved.

![alt text](http://cs231n.github.io/assets/cnn/maxpool.jpeg)

Max pooling above of stride 2, taking the max of 4 numbers (2x2).

Backprop: Recall that the backward pass for a max(x,y) op has a simple interpretation as only routing the gradient to the input that had the highest value in the forward. Hence, during the forward pass of pooling its common to keep track of the index of the max activation so the gradient routing is efficient.

Getting rid of pooling: Some want to discard in favour of repeated conv layers through larger stride. Discarding pooling is inportant for generate models such as VAEs or GANs. 

####Normalization Layer
Fallen out of favour and their effectiveness is minimal.

####FC Layer
Neurons in a FC layer have full connections to all activations in the previous layer. Computed with matmul + bias offset.

####Converting FC layers to CONV layers
Only difference between FC vs CONV is that neurons are connected to local regions and that neurons share parameters. They compute dot products so their functional form is the same, thus you can convert between the two.

FC->CONV conversion: the ability to convert a FC to CONV is useful. Consider input 224x224x3 image, uses seris of CONV and POOL to reduce the image to activations of 7x7x512. AlexNet uses 2 FC layers of size 4096 and the last FC layers with 1000 neurons that compute the class scores. We can convert these 3 FCs to CONV:
- replace first FC that looks 7x7x512 volume with CONV that uses filter of 7 giving output volume 1x1x4096.
- Replace second FC with conv filter size 1, giving 1x1x4096
- Replace last FC with filter 1 giving 1x1x1000

E.g. if 224x224 image gives a volume of size 7x7x512 a reduction by 32, then forwarding an image of 384x384 through converted architecture gives 12x12x512 since 384/32=12. Following through with 3 conv layers we just converted FC layers would now give final volume 6x6x1000 since 12-7/1 + 1= 6. Instead of getting a single vector of class scores 1x1x1000, we're now getting an entire 6x6 array of class scores across 384x384. 

####ConvNet Architectures

#####Layer Patterns
Most common pattern:

INPUT -> [[CONV -> RELU] *N -> POOL?]*M -> [FC -> RELU]*K ->

where * indicates reptition. N usually <= 3, M >= 0, K < 3 >= 0. 

- Input -> FC implements a linear classifier. Here N = M = K = 0.
- Input -> CONV -> RELU -> FC
- Input -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC.
- INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC

Prefer to stack small filter of CONV to one large receptive field CONV layer. 

"*Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have C channels, then it can be seen that the single 7x7 CONV layer would contain C×(7×7×C)=49C2 parameters, while the three 3x3 CONV layers would only contain 3×(C×(3×3×C))=27C2 parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation.*"

Recent changes such as Google's inception depart from this. 

####Layer Sizing Patterns

Rules of thumb:

The input layer should be divisible by 2 many times. Conv layers should be small filters (3x3, 5x5) using stride 1 and zero padding in a way that the conv layer does not alter spatial dimensions of the input. Max-pool of 2x2. 

Why stride 1? Smaller strides work better and keep spatial dims and leave down-sampling to POOL, with conv transforming input volume depth-wise.

Why use padding? Preserving spatial sizes improves performance, if not, then the size of the volumes would reduce by a small amount after each conv, and the info at the borders washed away.

Filtering a 224x224x3 image with 3x3 CONV layers with 64 filters each and padding 1 creates 224x224x64, this amounts to 10 million activations, 0r 72MB of memory. 

Case studies:

VGGNet is composed of conv layers that perform 3x3 cons with stride 1 and pad 1, pool 2x2 max with stride 2 no padding. 

Largest consideration: memory bottleneck of GPUs. Most have limit of 3/4/6gb memory.3 culprits:
- number of activations, parameters misc.





#Understanding and Visualizing CNN's

http://cs231n.github.io/understanding-cnn/

##Visualizing what ConvNets learn



###Visualizing the activations and first-layer weights

**Layer Activations**. The most straight-forward vizualisation technique to show the activations of the network during the forward pass. For ReLU nets, activations start out looking relatively blobby and dense, but as training progresses the activations become more sparse and localized. One pitfall that can be noticed with this visualization is that some activation maps may be all zero for many different inputs, which can indicate dead filters a symptom of high lr.
![alt text](http://cs231n.github.io/assets/cnnvis/act1.jpeg)
Activations on the first CONV layer
![alt text](http://cs231n.github.io/assets/cnnvis/act2.jpeg) 
Activations on 5th CONV layer. Every box shows an activation map corresponding to some filter. Notice the activations are sparse (mostly 0) and mostly local.

**Conv/FC filters.** The second is to visualize the weights. These are most interpretable on the first CONV layer which is looking directly at raw pixel data, but it is possible to show the filter weights deeper in the net. The weights are useful to visualize because well-trained nets usually display nice and smmoth filters without any noisy patterns. Noisy patterns indicate a net that hasn't been trained for long enough or a very low regularization strength that may have led to overfitting. 

![alt text](http://cs231n.github.io/assets/cnnvis/filt1.jpeg) Typical filters on the first CONV layer

![alt text](http://cs231n.github.io/assets/cnnvis/filt2.jpeg)

2nd CONV layer. First layer weights are nice and smooth indicating nicely converged net. Color/grayscale features are clustered because AlexNet contains 2 streams of processing. The 2nd CONV layer weights are not as interpretable, but they are still smooth, well-formed and absent of noisy patterns.






###Retrieving images that maximally activate a neuron

Another technique, is to take a large dataset of images, feed them through a net and track which images maximally activate some neuron. ![alt text](http://cs231n.github.io/assets/cnnvis/pool5max.jpeg)

Maximally activating images for some POOL5 (5th pool layer) neurons. The activation values and the receptive field of the particular neuron are shown in the white. Note that POOL5 neurons are a function of a relatively large portion of the input image. One problem is that ReLU don't have any semantic meaning by themselves. Rather, it is more appropriate to think of multiple ReLU neurons as basis vectors of some space that represents in the image patches. In other words, the visualization is showing the patches at the edge of the cloud of representations, along the arbitrary axes that correspond to the filter weights. This can be seen by the fact that neurons in a ConvNet operate linearly over the input space, so any arbitrary rotation of that space is a no-op. 

###Embedding the codes with t-SNE

ConvNets can be interpreted as gradually transforming the images into a representation in which the classes are separable by a linear classifier. We can get an idea of the topology of this space by embedding images into two dimensions so that their low-dimensional representation has approximately equal distances than their high-dimensional representation. t-SNE is one of the best at embedding high-dimensional vectors in a low-dimensional space.

To produce an embedding, take a set of images use the ConvNet to extract CNN codes (4096-D vector right before the classifier and including the ReLU non-linearity). Plug these into t-SNE and get a 2-D vector for each image. 

![alt text](http://cs231n.github.io/assets/cnnvis/tsne.jpeg)

t-SNE embedding of a set of images based on their CNN codes. Images nearby each other are close in CNN representation space. Similarities are more class-based and semantic rather than pixel or color-based. 



##Occluding parts of the image

Suppose we want to classify a dog. How can we be certain the ConvNet is actually picking up on the dog in the image as opposed to some contextual cus from the background or some other miscellaneous object? One way is to plot the probability of the class of interest as a function of the position of an occluder object. That is, iterate over regions of the image, set a patch of the image to be all zero and look at the probability of the class. We can visualize the probability as a 2-D heat map. 

![alt text](http://cs231n.github.io/assets/cnnvis/occlude.jpeg)

Notice the occluder region in grey. As we slide the occluder over the image we record the probability of the correct class then visualize it as a heatmap. 

##EXPLAINING AND HARNESSING
ADVERSARIAL EXAMPLES

https://arxiv.org/pdf/1412.6572.pdf

- Several model misclassify adverserial examples with inputs formed by applying a small pertubation. The cause being NN's vulnerability due to their linear nature. 
- Scegedy et al. 2014b observed models vulnerability to adverserial examples
- Some examples were so close to the original they were indistinguishable to the human eye leading to misclassification with some confidence
- This is due to models being too linear (a property of high-dimensional dot products) as opposed to non-linear
- Adverserial examples act as a regularizer
-RBF Nets are robust to adverserial examples https://mccormickml.com/2013/08/15/radial-basis-function-network-rbfn-tutorial/ 
  - RBFN classify by measuring input's similarity to examples in the training set. Each neuron stores a prototype (example from training set) then computes Euclidean distance between its input and the prototype.
- Gradient based optimization is the engine for current successful DL but the ease of this optimization comes at the cost of models that are easily misled. This motivates the development of optimizers that can train models whose behavior is more locally stable. 

##Understanding Deep Image Representations by Inverting Them


https://arxiv.org/abs/1412.0035

- analysis of information contained in representations by asking: given an encoding of an image, to which extent is it possible to reconstruct the image itself? 
- To do se they invert representations such as HOG and SIFT. They show that several layers in CNNs retain photographically accurate information about the image, with different degrees of geometric and photometric invariance. 

##STRIVING FOR SIMPLICITY:
THE ALL CONVOLUTIONAL NET

https://arxiv.org/pdf/1412.6806.pdf

- The work analyzes current object detection model pipelines and finds that it can replace max-pooling with a Conv layer with increased stride without loss in accuracy

##Visualizing and Understanding Convolutional Networks

https://arxiv.org/pdf/1311.2901.pdf

- 


#Transfer Learning and Fine-Tuning CNN's

http://cs231n.github.io/transfer-learning/

##Transfer Learning

In practice we don't train an entire CNN from scratch with random initialization. It is common to pretrain a ConvNet and then use the ConvNet as an initialization or a fixed feature extractor for the task of interest. Three scenarios:

- **ConvNet as fixed feature extractor**. Take a ConvNet pretrained on ImageNet, remove last fc layer, then treat entire CNN as fixed feature extractor for the new dataset. In AlexNet, this computes a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features **CNN codes**. These codes are ReLUd. Once extracted, train a linear classifier for the new dataset.
- **Fine-tuning the ConvNet**. Second strategy is to not only replace and retrain the classifier on top of the ConvNet, but also fine-tune the weights of the pretrained net by continuing backprop. It is possible to keep some of the earlier layers fixed and only fine-tune some higher-layers. This is because earlier features of a CNN are more generic features (e.g. edge detectors/blob detectors) that should be useful to many tasks but later layers become more specific to the details of the classes contained in the original dataset. 
- **Pretrained models.** Since modern CNNs take weeks to train, it is common to see people release their checkpoints to be used for fine-tuning. 

**When and how to fine-tune?** How do you decide what type of transfer learning you should perform on a new dataset? The two most important factors: size of new dataset, and similarity to the original dataset. Rules of thumb:

1. New dataset is small and similar to original dataset. Since the dataset is small, it is not a good idea to fine-tune due to overfitting concerns. Since the data is small we expect higher-level features in CNN to be relevant to this dataset, best to train a linear classifier on the CNN codes.
2. New dataset is large and similar to the original dataset. Since we have more data, we are more confident we won't overfit if we try to fine-tune through the full-network.
3. New dataset is small but very different from the original dataset. Since data is small, it is best to train only a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top, it might be better to train the SVM classifier. 
4. New dataset if large and very different from the original dataset. Since data is large, we may expect we can afford to train a ConvNet from scratch. 

**Practical advice**. A few things to keep in ming when doing Transfer learning:

- Constraints from pretrained models. You may be constrained in terms of the architecture you can use for your dataset. Some changes are easy: due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size. In the case of FC layers, this holds true because FC layers can be converted to a Conv Layer: for example, AlexNet, the final pooling volume before the first FC layer is of size 6x6512. Thus, the FC layer looking at this volume is equivalent to having a Conv Layer that has receptive field size 6x6, and is applied with padding 0. 

- Learning rates. It's common to use a smaller lr for ConvNet weights that are being fine-tuned, in comparison to the randomly-initialized weights for the new linear classifier that computes the class scores of your new dataset. This is because we expect that the ConvNet weights are relatively good, so we don't want to distort them too quickly and too much.