# ML/DL techniques for Tabular Modelling
> Some most commonly used optimization algorithms for updating network parameters, and their advantages/disadvantages.

- toc: true 
- badges: true
- comments: true

In [1]:
#hide
# !pip install -Uqq fastbook

import fastbook
fastbook.setup_book()

  return torch._C._cuda_getDeviceCount() > 0


In [2]:
#hide
from fastbook import *
from kaggle import api
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

In [3]:
#hide

# api.competition_download_cli('bluebook-for-bulldozers')
# file_extract('bluebook-for-bulldozers.zip')
# df = pd.read_csv('/home/nitish/Downloads/bluebook-bulldozers/TrainAndValid.csv', low_memory=False)

## Introduction
Tabular Modelling takes data in the form of table, where generally we want to learn about a column's value from all the other columns' values. The column we want to learn is known as dependent variable and others are known as independent variabls. The learning could be both like a classification problem or regression problem. We will look into various machine learning models such as decision trees, random forests, etc, also we'll look for what deep learning has to offer in tabular modelling.

## Dataset
I will be using [Kaggle competition](https://www.kaggle.com/c/bluebook-for-bulldozers) dataset on all the models, so that it will be easier to udnerstand and compare different models. I have loaded it into a dataframe df.

In [5]:
df.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,...,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000.0,999089,3157,...,,,Standard,Conventional
1,1139248,57000.0,117657,77,...,,,Standard,Conventional
2,1139249,10000.0,434808,7009,...,,,,
3,1139251,38500.0,1026470,332,...,,,,
4,1139253,11000.0,1057373,17311,...,,,,


The key fields are in train.csv are:

- SalesID: the uniue identifier of the sale
- MachineID: the unique identifier of a machine.  A machine can be sold multiple times
- saleprice: what the machine sold for at auction (only provided in train.csv)
- saledate: the date of the sale

For this competition, we need to predict the log of the sale price of bulldozers sold at auctions. We will try to build different ML and DL models which will be predicting $log$(sale price).

## Decision Trees
A decision tree makes a split in data based on the values of a columns. For example, suppose we have data for different persons for their age, whether they eat healthy, whether they exercise, etc, and want to predict whether they are fit or unfit based on the data then we can use the following decision tree.

![](images/blog5_1.png "Credit:fast.ai")

At each level, the data at that level is divided into 2 groups for the next level, e.g. at first level, whether age<30 or not divides the whole dataset into 2 smaller datasets, and similarly data is splitted again until we reach leaf node of 2 classes: FIT or UNFIT. 

In the real world, data is way more complex containing lot of columns. E.g. in our dataframe df, there are 53 columns. So the question arises which column to chose for each split and what should be the value at which it is splitted. The answer is try for every column and each value present in a column for split. So if there are n columns and each column has x different values then we need to try n\*x splits and chose the best one on some criteria. When trying a split, then whole data will be divided into 2 groups for that level, so we can take the average of sale price of a group as predicted sale price for all the rows in that group, and can calculate rmse distance between predictions and actual sale price. This will give us a number which if bigger tells our predictions are far from actual sale price and vice-versa. So the algorithm for building a decision tree could be written as:
1. Loop through all the columns in the training dataset.
1. Loop through all the possible values for a column. If the column contains categorical data then chose condition as "equal to" a category and "not equal to" a category. If the column contains continuos data then for all the distinct values split on "less than equal to" and "greater than" the value.
1. Find average sale price for each of the group, this is our prediction. Calculate rmse from the actual values of sale price.
1. The rmse of a split could be set as sum of rmse for all groups after split.
1. After looping though all the columns and all possible splits for each column chose the split with least rmse.
1. Continue the same process recursively on the child groups until some stopping criteria is reached like maximum number of leaf node, minimum number of data items per group, etc.

Below is given an example of decision tree. In the root node, value is simply average of all the training dataset which would be the most simple prediction we could calculate for a new datapoint is to simply give prediction of 10.1 everytime. Mean Square Error (mse) is 0.48, and there are total 404710 samples, which is actually the total number of samples in training dataset.

Now for the split, it would have tried many all the columns at all the possible values, and it came with $Coupler\_System \leq 0.5$ split. This would split the whole dataset into two smaller datasets. When condition is false it resulted in 360847 samples, with mse of 0.42 and average value of 10.21. When condition is true it resulted in 43863 samples, with mse of 0.12 and average value of 9.21. It could be seen that this split has improved our prediction because now average mse is (0.42 + 0.12)/2 < 0.48.

Similarly, splitting the "True condition child" on $YearMade \leq 0.42$ further deacreases the mse, which means our predictions are further closer to the actual values.

![](images/blog5_2.png "Credit:fast.ai")

## Overfitting and Underfitting in decision trees
Underfitting in the decision trees will be when we make very few splits or no splits at all, e.g. in the root node the average value is 10.1 and if we use this value as prediction then it's clearly a very simple solution to an complex problem, which is an underfitting.

Overfitting will be when there are way too many splits such that in extreme case there are one sample per leaf node, which is actually the model has memorize the training dataset. It is overfitting because although the mse will be 0 for the training dataset, it will be very high for validation dataset, as the model will fail to generalize on unseen datapoints.