## Data Preparation

We need to make all the features as clear for the neural network to pick up as possible.

E.g. If we only have 1 feature of date, it is hard for neural network to pick up, is a pattern happen on weekend.

So, we can pre-process and create a new feature called is_weekend.

![](lesson6/lesson6-1.png)

## Fastai preprocessor - Categorify
![](lesson6/lesson6-2.png)

Some of the values in the panda dataframe is a string of multiple values, this preprocessor is going to internally convert those into numbers.

E.g. "Jan,Apr,Jul,Oct" was a string, and it is converted to list of numbers that store internally.

This get all the categories shows in this column
```
small_train_df.PromoInterval.cat.categories
```

This show the internal number
```
small_train_df['PromoInterval'].cat.codes
```
-1 in codes means NaN

## fastai preprocessor - fill_missing
![](lesson6/lesson6-3.png)
Create a new boolean feature, and identify TRUE for missing data in FEATURE_na; false for otherwise

For the original data that is missing, it will put median into it.

## Fastai - treat label as continuous variable explicitly

We can use label_cls=FloatList, to treat label as continuous variable explicitly. On the other hand, Fastai will treat integer data as categorical variable.
```
data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs,)
                .split_by_idx(valid_idx)
                .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                .add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars))
                .databunch())
```

## Root Mean Square Percentage Error
![](lesson6/lesson6-4.png)

## Usage of Log

The above example pass the label and perform log(y) immediately (log=True). We use Log when the label have long-tail distribution, like population, the amount of sales. We use when we care about the percentage differences rather than exact differences.

That's why the loss function of the above example is RMSPE, instead of RMSE.

It is claimed that doing a log(y) on input ( so yi now is log(yi) ), the RMSPE becomes RMSE in math, I don't know yet.

## Dropout
![](lesson6/lesson6-5.png)

Dropout will throw out some activations on each layer at random, for each batch, by given a probability. On next batch, put those back and it will be some other activation that get throw out on each layer.

It means that no one activation can kind of memorize some part of the input, because that's what happen when we overfit. When we overfit, some part of the model is basically learning to recognize a particular image, rather a feature in general.
```
ps=[0.001,0.01]
ps=0.01
```
![](lesson6/lesson6-6.png)
At testing, we will always use all nodes ( turn off droupout ).

It also say you should mutiply the weights by p, since now that all the nodes are present. The weights will not be accurate ( since some weights are dropped before ).

Pytorch already did those for us, so we did not need to care about this.

## Check lesson6 on Dropout
[Jump to Lesson6 Dropout](https://youtu.be/hkBa9pU-H48?t=2328)

## Batch Normalization

[Jump to Lesson6 on Batch Norm](https://youtu.be/hkBa9pU-H48?t=2670)

![](lesson6/lesson6-7.png)
The algorithm is going to take a mini-batch, and since BatchNorm is a layer, so the things come into it is an activation { x1, x2, ... xm }.

We find the mean of those mini-batch μ.

Then we find the variance of those mini-batch σ.

Then we normailize and have x_hat.

But turns out those are not the most important steps.

Then we take those values and add a bias called β. We also add another thing γ, which is a lot like a bias, but we multiply it with x_hat. β and γ are learnable parameters. These are things that learn with gradient descent.

![](lesson6/lesson6-8.png)
There are 2 papers to prove that Batch norm did not actually help reducing internal coveriate shift.

In the above picture, you can see the red line ( Standard ) is very bumpy, while the blue line (standard + BatchNorm ) is not bumpy at all. What it means is you can increase the learning rate using batchNorm. Because those big bumps represent the times you are really at risk of your set of weights jumping off into some awful part of the weight space that it can never get out of again. So if BatchNorm get you less bumpy, then you can train at a higher learning rate. ( So it is why it train faster ).

### Then Why BatchNorm has so better and faster result?
![](lesson6/lesson6-9.png)
In the above picture, y_hat is the result of some functions f of our various weights, can be millions of them, and also include our input layers x. This function is the nerual net.

L is the loss function.

Our current training model ( up the this y_hat activation ) is outputing range of -1 to 1. Let say for movie review, we want the output to be range of 1 to 5, so the output is way off where they need to be, both scale and position. We can train the network and eventually it will learn to increase the weights so the y_hat have a closer output to expected, but it is very hard to do ( train a long time ) because all these weights interact in very intricate ways ( the neural net is complex, connected network between nodes )

But, batchnorm provaide a new two bias β and γ ( g and b in above drawing ). Now, we can increase scale very easily ( incrase γ ), and also increase positiom to where it needs to be ( incrase β ). β and γ has a direct gradient to incrase the scale, change the mean, directly through this BatchNorm Layer. No need to go through the complex nerual network of layers.

So basically, batchnorm makes it easy to shift the output up and down, and in and out.

## Weight Norm
TODO

## Data augmentation

It is basically transform your original data and have multiple set of new data, which is trasform in different ways.

![](lesson6/lesson6-10.png)

Check out the transformaiton library for image in fastai:

[Fastai Vision Transform](https://docs.fast.ai/vision.transform.html)

## Image Kernal
![](lesson6/lesson6-11.png)

So, we have a 3x3 matrix we called kernal. For each 3x3 pixels in the image ( except the edge pixels), we multiply the pixel with the kernal, and form a new image. The image has 1 less pixel on each edge.

For this particular kernal, the newly formed image now outlining the horizontal edges.

This is called convolution.

Check out this website:
[Image Kernal](http://setosa.io/ev/image-kernels/)



## Convolutional Neural Network

So now we can use this newly formed image in the first layer and use it with another kernal that highlight left edge, the result is a second layer that good at finding top-left corner stuff. Thats the idea of CNN.

![](lesson6/lesson6-12.png)
![](lesson6/lesson6-13.png)

The above 2 picture shows that convolution is just matrix multiplcation.

![](lesson6/lesson6-14.png)
We need to think about padding. because the most edge cannot be multiply by the kernal 3x3 ( so let say we want to comput image(0,0), but the center of a 3x3 kernal is outside pixel image(0,0). So we need to pad zeros on the edges of image, for the output to have the same size as the input image.

![](lesson6/lesson6-15.png)

So, normally image will have 3 channel, red, blue and green. Our input pixel will be h * w * 3. (height x width x channel ).

Since there are 3 channel, we cannot just use a 3x3 kernal. It is 2D. and it also doesn't make sense to reuse the same 3x3 kernal for each channel. 

So we now have a 3x3x3 kernal.

The input images with h * w * 3, multiply with 3*3*3 kernal, will ouput h * w newly formed image, which is 2D.

It doesn't really can do much with just one 2D activation.

So, we have another different 3*3*3 kernal, in this case 16 of the kernals in total.

So the output newly formed activations is now h * w * 16.

## Stride 2 Convolution

![](lesson6/lesson6-16.png)
Tranditional CNN, which kernal is calculated for each pixel, in each layer, is very computational heavy.

Stride 2 Convolution, is basically doing the same thing, but only for every other pixel.

This will result in outputing a activation shape of h/2 * w/2 * channel.

We overcome this somewhat by twice the size of kernal, so new output activation will have h/2 * w/2 * 32.

### Check lesson6-pets-more.ipynb #Convolution-kernel for doing convolution manually.

## Average pooling
![](lesson6/lesson6-17.png)

After going deeper down the layers in CNN, the activations now has a shape of example of 11 * 11 * 512 ( does have to be exact 11, just example ). We can calculate the mean of each 11*11 plane, have output of an array of means which has size of 512.

So, if we predicted this particular image is a Maine_Coon cat, which highlighted in yellow, it has to have an high value in that yellow box in output array.

Walking backward, the only way we get a high value there, is with the matrix multiplication of 512 * 37.

it's going to represent a simple weighted linear combination of all of the 512 values here.

So if we are going to be able to say I'm pretty confident that this is a Maine Coon, just by taking the weighted sum of a bunch of inputs, those inputs are going to have to represent features like how fluffy is it, what color is its nose, how long are its legs, all kinds of things that can be used, etc. 

Because for the other thing which figured out is this is a bulldog, It's going to use exactly the same kind of 512 inputs with a different set of weights because that's all a matrix multiplication is. It is just a bunch of weighted sums, a different weighted sum for each output.

## Average Activations
![](lesson6/lesson6-18.png)
Extand from above explanation, we know that in this CNN, potentially dozens or even hundreds of layers of convolution, it must have eventually come up with an 11 * 11 face for each of these features ( hidden in 11 * 11 * 512 tensor ).

It is saying in this blue box here, how much is that part of the image like a pointy ear, how much is it fluffy, how much is it like a long leg.

So each face is what we call each of these represents a different features.

So, instead of doing what average pooling does, if we do a average across all 512 faces, for each grid point in 11 by 11 plane, we can get a output result of 11 by 11 place, showing how "activated" of each grid point is.

 When it came to figuring out that if this was a maine coon, how many signs of maine coon-ishess was there in that part of the 11 by 11 grid.

## Heatmap
![](lesson6/lesson6-19.png)
It is a plot of that average activations.

### Check lesson6-pets-more.ipynb #Heatmap for drawing heatmap manually.

## Ethics and Data Science
[Jump to Lesson 6 on Ethics](https://youtu.be/hkBa9pU-H48?t=6546)