In [1]:
%load_ext autoreload
%autoreload 2 

In [2]:
from fastai.tabular.all import * 
from tabnet.utils import *
from tabnet.model import *

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Abstract 

I propose a method for tabular learning problems where we have an abundance of data but a small amount of it is labeled based on the TabNet architecture.
I demonstrated that by using semi-supervised learning we can improve the performance of the model in the small labeled set setting and check what ammount of data is enough.
I also demonstrated that curriculum learning improves this by improving the self-supervised step. 

# Introduction 

Tabular data problems are still very prevalent in today's world, especially in big corporations that amass large amounts of data for analysis.

Even though this domain is popular, it's not as widely researched as computer vision, audio etc. For example, there are [papers](https://arxiv.org/abs/1604.07379) using self-supervised learning in CV problems as far back as 2016, while the first known one for Tabular data has been released in August 2019. 

Even though large corporations usually have large amounts of data, in many of their tabular problems they have very few labeled examples as those are very expensive to get. To address the scenario where there isn't an abundance of labeled data, the common approach is to use oversampling methods such as [SMOTE](https://arxiv.org/pdf/1106.1813.pdf). Even though these methods sometimes improve the model's performance, the improvement is usually minor at best.

For these reasons, I wanted to implement a self-supervised approach for Tabular Data by learning the underlying representation and then using the pretrained model with the labeled data we have. 

In this project I wanted to test:  
1. If a model trained in a self-supervised fashion gives better results in the small labeled setting.
1. At what number of samples is self supervision unnecesary. 
1. If and how curriculum learning improves the outcome and the `self supervised` step. 

To do so I've implemented a relatively new (Aug 2019) Tabular Data DL model called [Tabnet](#https://arxiv.org/pdf/1908.07442.pdf) which uses sequential attention to choose which features to look at at each step, as well as introduced tabular self-supervision for the first time (although I couldn't find any implementation of the self supervision which is why I wanted to implement it). Furthermore, TabNet also enables interpretability by using the sequential attention. I've yet to implement this feature but plan on doing so. 
I've also taken the time to learn the [fastai framework](https://docs.fast.ai/) (a DL framework implemented using `pytorch`) for this project which helped me decouple the different parts and run experiments efficiently. 

I've tested this approach on 2 different datasets: 
1. Adult Census Income - where the task is to distinguish whether a person's income is above $50,000
1. Forest Cover - classifying the forst cover type from cartographic variables.

# Methods 

### Basis for our model - TabNet 

The TabNet architecture uses an encoder-head architecture. 
The encoder is used to learn a better representation of the features in a sequential manner by using masked attention. It is the focus of the TabNet paper. 
The head (a simple FC block for example) then receives the encoder's output to solve the task at hand (classification / regression / decoding). 


##### Encoder
![image.png](attachment:image.png)
TabNet's encoder works by sequentially calculating masks (using an attention block) to be applied to the features. The masked features are then transformed at each step. Half of the transformed features will be used by the decoder, while the other half will be used by the next step's attention block.

The Encoder is built from 2 basic blocks: 
1. Feature Transformer - multiple stacks of blocks made up of FC, BN, GLUs with residual connections. The first few blocks are usually shared since the input's transformations should be the same across all steps. 
1. Attentive Transformer - creates the mask. A block consisting of a FC, BN and Sparsemax activation (with an additional prior to make sure that the same features won't be used too many times).


##### Head 
A simple layer that consists of adding up all the outputs from the encoder's steps and passing them through a FC layer. 


##### Self Supervised Training 
![image.png](attachment:image.png)
The self supervised training works by creating a mask `S` and applying it to keep some of the features, and then trying to reconstruct the `1-S` left over features. 
To implement the self supervised training, we need to replace the problem's loss (MSE, CE etc) with a loss that takes the forme into account as well as change the model's head to a decoder. 

1. Decoder - As seen above, we used the proposed architecture of a `Feature Transformer` for each step accompanied by a FC layer and then adding up all the results. 
1. Loss - For the loss we used the proposed `Reconstruction Loss` which is similar to MSE/MAE for the reconstructed (non masked) features, as well as adding a regularization term (since they're scaled differently). 



### Changes I made 

#### Self Supervised training 

I've implemented the self supervised training in a Curricular Learning fashion - instead of choosing the mask a feature with some `p` probability which creates a varying number of masked features at every iteration, we progressively mask more features (make the problem harder) as the number of iterations grow.


1. RNN
1. Self Supervised Loss
1. Dropout 

### Datasets

I've tested this approach on 2 different datasets: 
1. Adult Census Income - where the task is to distinguish whether a person's income is above $50,000
1. Forest Cover - classifying the forst cover type from cartographic variables.

# Adult 

In [None]:
adult_path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult_path/'adult.csv')
params = dict(cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
            cont_names = ['age', 'fnlwgt', 'education-num'], y_names='salary')
model_params = dict(n_d=16, n_a=16, lambda_sparse=1e-4, bs=1024*4, 
                          virtual_batch_size=128, n_steps=5, gamma=1.5)

# Forest Cover DS

In [None]:
data_dir = Path('./data')

In [None]:
def extract_gzip(file, dest=None):
    import gzip
    dest = dest or Path(dest)
    with gzip.open(file, 'rb') as f_in:
        with open(dest / file.stem, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

In [None]:
forest_type_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
forest_path = untar_data(forest_type_url, dest=data_dir, extract_func=extract_gzip)

In [None]:
target = "Covertype"

cat_names = [
    "Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3",
    "Wilderness_Area4", "Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4",
    "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9",
    "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14",
    "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19",
    "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24",
    "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29",
    "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34",
    "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39",
    "Soil_Type40"
]

cont_names = [
    "Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology",
    "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways",
    "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
    "Horizontal_Distance_To_Fire_Points"
]

feature_columns = (
    cont_names + cat_names + [target])

params = dict(cont_names = cont_names, y_names = target, cat_names = cat_names)
procs=[Categorify, FillMissing, Normalize]
model_params = dict(n_d=64, n_a=64, n_steps=5, virtual_batch_size=512, gamma=1.5, bs=1024*16, lambda_sparse=1e-5)

In [None]:
df = pd.read_csv(forest_path, header=None, names=feature_columns).sample(n=200_000)
df.shape

# Self Supervision + Number of Epochs
In this experiment we'll train 2 models, one without self supervision, and one with it, and see which one does better and if number of epochs matter

In [None]:
%%capture 
res = L([score_before_after_ss(df, params, model_params) for i in range(10)])

In [None]:
before = res.itemgot(0).map(lambda b: accuracy(*b))
after = res.itemgot(1).map(lambda b: accuracy(*b))

pd.DataFrame({'before': before, 'after': after}).agg(['mean', 'std'])

In [None]:
%%capture 
res = L([score_before_after_ss(df, params, model_params, cycle_lr=[(20, 1e-1/2)]*3) for i in range(5)])

In [None]:
before = res.itemgot(0).map(lambda b: accuracy(*b))
after = res.itemgot(1).map(lambda b: accuracy(*b))

pd.DataFrame({'before': before, 'after': after}).agg(['mean', 'std'])

# Self Supervision + Curriculum
In this experiment we'll train 2 models with self-supervised learning. One with curriculum learning and one without and check if it improves the score

In [None]:
%%capture 
res = L([score_before_after_ss(df, params, model_params, cycle_lr=[(15, 1e-1/2)]*3) 
                             for i in range(5)])

In [None]:
before = res.itemgot(0).map(lambda b: accuracy(*b))
after = res.itemgot(1).map(lambda b: accuracy(*b))

pd.DataFrame({'before': before, 'after': after}).agg(['mean', 'std'])

In [None]:
%%capture 
res = L([score_before_after_ss(df, params, model_params, cycle_lr=[(15, 1e-1/2)]*3, curriculum=True) 
                             for i in range(2)])

In [None]:
before = res.itemgot(0).map(lambda b: accuracy(*b))
after = res.itemgot(1).map(lambda b: accuracy(*b))

pd.DataFrame({'before': before, 'after': after}).agg(['mean', 'std'])

# Self Supervision For Small Label Regime 
In this experiment we'll check the affect of self supervised learning on problems without a lof of labels 

In [None]:
df.shape

In [None]:
tp = tabular_pandas(df, **params, val_pct=0.2)
tp.train.ys[tp.y_names[0]].value_counts()

In [None]:
learn = tabnet_df_classifier(df, **params, tabnet_args=new_params, val_pct=0.2)

In [None]:
learn.fit_one_cycle(30, 1e-1/2)

In [None]:
learn = tabnet_df_classifier(df, **params, tabnet_args=new_params, val_pct=0.995)

In [None]:
learn.fit_one_cycle(30, 1e-1/2)

In [None]:
learn_ss = tabnet_df_self_sup(df, **params, tabnet_args=model_params)
learn_ss.dls.train.n, learn_ss.dls.valid.n

In [None]:
learn_ss.fit_one_cycle(50, 1e-1/2)

In [None]:
learn_ss.dls.ys.iloc[:,0].value_counts()

In [None]:
tp = tabular_pandas(df, **params, val_pct=0.995)
tp.train.ys[tp.y_names[0]].value_counts()

In [None]:
new_params = model_params.copy()
new_params['bs'] = 100

In [None]:
learn = tabnet_df_classifier(df, **params, tabnet_args=new_params, enc=learn_ss.model.enc, val_pct=0.995)
learn.dls.train.n, learn.dls.valid.n

In [None]:
learn.fit_one_cycle(30, 1e-1/2)