## Data Preparation

We need to make all the features as clear for the neural network to pick up as possible.

E.g. If we only have 1 feature of date, it is hard for neural network to pick up, is a pattern happen on weekend.

So, we can pre-process and create a new feature called is_weekend.

![](lesson6/lesson6-1.png)

## Fastai preprocessor - Categorify
![](lesson6/lesson6-2.png)

Some of the values in the panda dataframe is a string of multiple values, this preprocessor is going to internally convert those into numbers.

E.g. "Jan,Apr,Jul,Oct" was a string, and it is converted to list of numbers that store internally.

This get all the categories shows in this column
```
small_train_df.PromoInterval.cat.categories
```

This show the internal number
```
small_train_df['PromoInterval'].cat.codes
```
-1 in codes means NaN

## fastai preprocessor - fill_missing
![](lesson6/lesson6-3.png)
Create a new boolean feature, and identify TRUE for missing data in FEATURE_na; false for otherwise

For the original data that is missing, it will put median into it.

## Fastai - treat label as continuous variable explicitly

We can use label_cls=FloatList, to treat label as continuous variable explicitly. On the other hand, Fastai will treat integer data as categorical variable.
```
data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs,)
                .split_by_idx(valid_idx)
                .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                .add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars))
                .databunch())
```

## Root Mean Square Percentage Error
![](lesson6/lesson6-4.png)

## Usage of Log

The above example pass the label and perform log(y) immediately (log=True). We use Log when the label have long-tail distribution, like population, the amount of sales. We use when we care about the percentage differences rather than exact differences.

That's why the loss function of the above example is RMSPE, instead of RMSE.

It is claimed that doing a log(y) on input ( so yi now is log(yi) ), the RMSPE becomes RMSE in math, I don't know yet.

## Dropout
![](lesson6/lesson6-5.png)

Dropout will throw out some activations on each layer at random, for each batch, by given a probability. On next batch, put those back and it will be some other activation that get throw out on each layer.

It means that no one activation can kind of memorize some part of the input, because that's what happen when we overfit. When we overfit, some part of the model is basically learning to recognize a particular image, rather a feature in general.
```
ps=[0.001,0.01]
ps=0.01
```
![](lesson6/lesson6-6.png)
At testing, we will always use all nodes ( turn off droupout ).

It also say you should mutiply the weights by p, since now that all the nodes are present. The weights will not be accurate ( since some weights are dropped before ).

Pytorch already did those for us, so we did not need to care about this.