# fastai v2 Kernel Starter Code

The goal of this kernel is to show how to train a neural network using fastai 2.0 for this Kaggle Competition

## Grabbing the Library

First we need to enable internet access within this kernel and then `!git clone` the `fastai_dev` repository for us to import from.

In [None]:
!git clone https://github.com/fastai/fastai_dev.git
%cd fastai_dev/dev

We're going to need a variety of imports, most importantly the `tabular.core` module for building the dataset (the rest deal with training the model)

In [None]:
from local.data.all import *
from local.tabular.core import *
from local.tabular.model import *
from local.optimizer import *
from local.learner import *
from local.metrics import *
from local.callback.all import *

## Setting Up Our Data

Let's make a `Path` object to our data and combine the `train.csv` with the `building_metadata.csv` to grab some more information about these meter readings. For simplicity we will use the first 1000 samples from the training set. For the `DataFrame` preperation please see ryches Kernel [here](https://www.kaggle.com/ryches/simple-lgbm-solution)

In [None]:
path = Path('/kaggle/input/ashrae-energy-prediction')

In [None]:
train = pd.read_csv(path/'train.csv')
train = train.iloc[:5000]
bldg = pd.read_csv(path/'building_metadata.csv')
weather_train = pd.read_csv(path/"weather_train.csv")

In [None]:
train = train[np.isfinite(train['meter_reading'])]

In [None]:
train.head()

In [None]:
bldg.head()

In [None]:
train = train.merge(bldg, left_on = 'building_id', right_on = 'building_id', how = 'left')

In [None]:
train.head()

In [None]:
weather_train.head()

In [None]:
train = train.merge(weather_train, left_on = ['site_id', 'timestamp'], right_on = ['site_id', 'timestamp'])

In [None]:
del weather_train

In [None]:
train["timestamp"] = pd.to_datetime(train["timestamp"])
train["hour"] = train["timestamp"].dt.hour
train["day"] = train["timestamp"].dt.day
train["weekend"] = train["timestamp"].dt.weekday
train["month"] = train["timestamp"].dt.month

In [None]:
train.drop('timestamp', axis=1, inplace=True)

## Making the DataBunch

Next, just like in fastai v1 we need to declare a few things. Specifically our Categorical and Continuous variables, our preprocessors (Normalization, Categorification, and FillMissing), along with how we want to split our data. `fastai` v2 now includes a `RandomSplitter` which is similar to `.split_by_rand_pct()` but now we can specify a custom range for our data (hence `range_of(train)`)

In [None]:
cat_vars = ["building_id", "primary_use", "hour", "day", "weekend", "month", "meter"]
cont_vars = ["square_feet", "year_built", "air_temperature", "cloud_coverage",
              "dew_temperature"]
dep_var = 'meter_reading'

In [None]:
procs = [Normalize, Categorify, FillMissing]
splits = RandomSplitter()(range_of(train))

Now that those are defined, we can create a `TabularPandas` object by passing in our dataframe, the `procs`, our variables, what our `y` is, and how we want to split our data. `fastai` v2 is built on a Pipeline structure where first we dictate what we want to do, then we call the databunch (the high-level API is not done yet so we have nothing similar to directly DataBunching an object)

In [None]:
to = TabularPandas(train, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits)

If we look at what `to` actually is, we can see what looks to be a bunch of batches of our data aligned into a dataframe that can easily be read!

In [None]:
to

We can then also easily look at our training and validation datasets by calling `.train` or `.valid`

In [None]:
to.train

From here we can create our DataBunch object one of two ways. We can either directly do a `dbch = to.databunch()`, *or* we can take it one step further and apply custom works to some dataloaders. First let's look at the basic version

In [None]:
dbch = to.databunch()
dbch.valid_dl.show_batch()

Now let's try doing this the second method. We can increase our batch size since the validation set is much smaller than our training dataset. We can also specify a few options with our training dataset too. To do this, we will need to create `TabDataLoaders` to, well, load the data!

We pass in a dataset, a batch size, our `num_workers`, along with if we want to shuffle our dataset and drop the last batch if it does not evenly split. You should always want to do this with the **training** dataset but not the validation. Defaultly they are both set to `False`

In [None]:
trn_dl = TabDataLoader(to.train, bs=64, num_workers=0, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128, num_workers=0)

Lastly we can create a `DataBunch` object by calling `DataBunch()` and passing in our two `DataLoaders`

In [None]:
dbunch = DataBunch(trn_dl, val_dl)
dbunch.valid_dl.show_batch()

As you can see there are a *lot* of ways we can customize our DataBunch's now

## Training the Model

First we need to create a `TabularModel` that needs an embedding matrix size, how many continuous variables to expect, the number of possible outputs (classes), and how big we want our layers. To pass in the embedding matrix sizes, we can use `get_emb_sz` onto a `TabularPandas` object

First let's define our embedding size rule of thumb, along with our `get_emb_sz` function

In [None]:
def emb_sz_rule(n_cat): 
    "Rule of thumb to pick embedding size corresponding to `n_cat`"
    return min(600, round(1.6 * n_cat**0.56))

In [None]:
def _one_emb_sz(classes, n, sz_dict=None):
    "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
    sz_dict = ifnone(sz_dict, {})
    n_cat = len(classes[n])
    sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
    return n_cat,sz

In [None]:
def get_emb_sz(to, sz_dict=None):
    "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
    return [_one_emb_sz(to.procs.classes, n, sz_dict) for n in to.cat_names]

Now we pass in our `TabularPandas` object, `to`

In [None]:
emb_szs = get_emb_sz(to); print(emb_szs)

The last piece of the puzzle we need is our basic `TabularModel`

In [None]:
class TabularModel(Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs, n_cont, out_sz, layers, ps=None, embed_p=0., y_range=None, use_bn=True, bn_final=False):
        ps = ifnone(ps, [0]*len(layers))
        if not is_listy(ps): ps = [ps]*len(layers)
        self.embeds = nn.ModuleList([Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(embed_p)
        self.bn_cont = nn.BatchNorm1d(n_cont)
        n_emb = sum(e.embedding_dim for e in self.embeds)
        self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range
        sizes = [n_emb + n_cont] + layers + [out_sz]
        actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None]
        _layers = [BnDropLin(sizes[i], sizes[i+1], bn=use_bn and i!=0, p=p, act=a)
                       for i,(p,a) in enumerate(zip([0.]+ps,actns))]
        if bn_final: _layers.append(nn.BatchNorm1d(sizes[-1]))
        self.layers = nn.Sequential(*_layers)
    
    def forward(self, x_cat, x_cont):
        if self.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
            x = torch.cat(x, 1)
            x = self.emb_drop(x)
        if self.n_cont != 0:
            x_cont = self.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
        x = self.layers(x)
        if self.y_range is not None:
            x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0]
        return x

If you noticed, most of what changed with the v2 API is focused on the dataloading / DataBunch creation. The rest of this Kernel sould look very familiar to fastai users

In [None]:
model = TabularModel(emb_szs, len(to.cont_names), 1, [1000,500]); model

Now we can define our optimization function and create our `Learner`

In [None]:
opt_func = partial(Adam, wd=0.01, eps=1e-5)
learn = Learner(dbunch, model, MSELossFlat(), opt_func=opt_func)

In [None]:
learn.fit_one_cycle(5)

I need to solve the bug for why we are not fitting properly, but this is also just a subset of the data. Hope this helps you get started! :)

- muellerzr