<a href="https://colab.research.google.com/github/iloncka/DeepGBM/blob/master/DeepGBM_wine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installations

In [None]:
#@title
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#@title
!pip install torch==1.2.0 torchvision==0.4.0 tensorboardx LightGBM==2.2.1 scikit-learn==0.19.2 category-encoders tqdm 



In [None]:
#@title
!pip install jupyter_contrib_nbextensions



In [None]:
#@title
!jupyter contrib nbextension install --user

[32m[I 18:54:54 InstallContribNbextensionsApp][m jupyter contrib nbextension install --user
[32m[I 18:54:54 InstallContribNbextensionsApp][m Installing jupyter_contrib_nbextensions nbextension files to jupyter data directory
[32m[I 18:54:54 InstallContribNbextensionsApp][m Installing /usr/local/lib/python3.7/dist-packages/jupyter_contrib_nbextensions/nbextensions/comment-uncomment -> comment-uncomment
[32m[I 18:54:54 InstallContribNbextensionsApp][m Up to date: /root/.local/share/jupyter/nbextensions/comment-uncomment/icon.png
[32m[I 18:54:54 InstallContribNbextensionsApp][m Up to date: /root/.local/share/jupyter/nbextensions/comment-uncomment/main.js
[32m[I 18:54:54 InstallContribNbextensionsApp][m Up to date: /root/.local/share/jupyter/nbextensions/comment-uncomment/comment-uncomment.yaml
[32m[I 18:54:54 InstallContribNbextensionsApp][m Up to date: /root/.local/share/jupyter/nbextensions/comment-uncomment/readme.md
[32m[I 18:54:54 InstallContribNbextensionsApp][m - Va

In [None]:
#@title
!pip install jupyter_nbextensions_configurator



# DeepGBM

Guolin Ke, Zhenhui Xu, Jia Zhang, Jiang Bian, and Tie-yan Liu. "DeepGBM: A Deep Learning Framework Distilled  by GBDT for Online Prediction Tasks." In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2019.  

[Article](https://dl.acm.org/doi/10.1145/3292500.3330858)  

[GitHub](https://github.com/motefly/DeepGBM)


In [None]:
import pandas as pd
# pd.set_option('display.max_columns', 200)

In [None]:
import numpy as np
import category_encoders as ce
from tqdm import tqdm
import collections, os
import gc
import pdb

Here code for preprocessing from paper.

In [None]:
#@title
def unpackbits(x,num_bits):
    xshape = list(x.shape)
    x = x.reshape([-1,1])
    to_and = 2**np.arange(num_bits).reshape([1,num_bits])
    return (x & to_and).astype(bool).astype(int).reshape(xshape + [num_bits])

class NumEncoder(object):
    def __init__(self, cate_col, nume_col, threshold, thresrate, label):
        self.label_name = label
        # cate_col = list(df.select_dtypes(include=['object']))
        self.cate_col = cate_col
        # nume_col = list(set(list(df)) - set(cate_col))
        self.dtype_dict = {}
        for item in cate_col:
            self.dtype_dict[item] = 'str'
        for item in nume_col:
            self.dtype_dict[item] = 'float'
        self.nume_col = nume_col
        self.tgt_nume_col = []
        self.encoder = ce.ordinal.OrdinalEncoder(cols=cate_col)
        self.threshold = threshold
        self.thresrate = thresrate
        # for online update, to do
        self.save_cate_avgs = {}
        self.save_value_filter = {}
        self.save_num_embs = {}
        self.Max_len = {}
        self.samples = 0

    def fit_transform(self, inPath, outPath):
        print('----------------------------------------------------------------------')
        print('Fitting and Transforming %s .'%inPath)
        print('----------------------------------------------------------------------')
        df = pd.read_csv(inPath, dtype=self.dtype_dict)
        self.samples = df.shape[0]
        print('Filtering and fillna features')
        for item in tqdm(self.cate_col):
            value_counts = df[item].value_counts()
            num = value_counts.shape[0]
            self.save_value_filter[item] = list(value_counts[:int(num*self.thresrate)][value_counts>self.threshold].index)
            rm_values = set(value_counts.index)-set(self.save_value_filter[item])
            df[item] = df[item].map(lambda x: '<LESS>' if x in rm_values else x)
            df[item] = df[item].fillna('<UNK>')
            del value_counts
            gc.collect()

        for item in tqdm(self.nume_col):
            df[item] = df[item].fillna(df[item].mean())
            self.save_num_embs[item] = {'sum':df[item].sum(), 'cnt':df[item].shape[0]}

        print('Ordinal encoding cate features')
        # ordinal_encoding
        df = self.encoder.fit_transform(df)

        print('Target encoding cate features')
        # dynamic_targeting_encoding
        for item in tqdm(self.cate_col):
            feats = df[item].values
            labels = df[self.label_name].values
            feat_encoding = {'mean':[], 'count':[]}
            feat_temp_result = collections.defaultdict(lambda : [0, 0])
            self.save_cate_avgs[item] = collections.defaultdict(lambda : [0, 0])
            for idx in range(self.samples):
                cur_feat = feats[idx]
                # smoothing optional
                if cur_feat in self.save_cate_avgs[item]:
                    # feat_temp_result[cur_feat][0] = 0.9*feat_temp_result[cur_feat][0] + 0.1*self.save_cate_avgs[item][cur_feat][0]/self.save_cate_avgs[item][cur_feat][1]
                    # feat_temp_result[cur_feat][1] = 0.9*feat_temp_result[cur_feat][1] + 0.1*self.save_cate_avgs[item][cur_feat][1]/idx
                    feat_encoding['mean'].append(self.save_cate_avgs[item][cur_feat][0]/self.save_cate_avgs[item][cur_feat][1])
                    feat_encoding['count'].append(self.save_cate_avgs[item][cur_feat][1]/idx)
                else:
                    feat_encoding['mean'].append(0)
                    feat_encoding['count'].append(0)
                self.save_cate_avgs[item][cur_feat][0] += labels[idx]
                self.save_cate_avgs[item][cur_feat][1] += 1
            df[item+'_t_mean'] = feat_encoding['mean']
            df[item+'_t_count'] = feat_encoding['count']
            self.tgt_nume_col.append(item+'_t_mean')
            self.tgt_nume_col.append(item+'_t_count')
        
        print('Start manual binary encode')
        rows = None
        for item in tqdm(self.nume_col+self.tgt_nume_col):
            feats = df[item].values
            if rows is None:
                rows = feats.reshape((-1,1))
            else:
                rows = np.concatenate([rows,feats.reshape((-1,1))],axis=1)
            del feats
            gc.collect()
        for item in tqdm(self.cate_col):
            feats = df[item].values
            Max = df[item].max()
            bit_len = len(bin(Max)) - 2
            samples = self.samples
            self.Max_len[item] = bit_len
            res = unpackbits(feats, bit_len).reshape((samples,-1))
            rows = np.concatenate([rows,res],axis=1)
            del feats
            gc.collect()
        trn_y = np.array(df[self.label_name].values).reshape((-1,1))
        del df
        gc.collect()
        trn_x = np.array(rows)
        np.save(outPath+'_features.npy', trn_x)
        np.save(outPath+'_labels.npy', trn_y)

    # for test dataset
    def transform(self, inPath, outPath):
        print('----------------------------------------------------------------------')
        print('Transforming %s .'%inPath)
        print('----------------------------------------------------------------------')
        df = pd.read_csv(inPath, dtype=self.dtype_dict)
        samples = df.shape[0]
        print('Filtering and fillna features')
        for item in tqdm(self.cate_col):
            value_counts = df[item].value_counts()
            rm_values = set(value_counts.index)-set(self.save_value_filter[item])
            df[item] = df[item].map(lambda x: '<LESS>' if x in rm_values else x)
            df[item] = df[item].fillna('<UNK>')

        for item in tqdm(self.nume_col):
            mean = self.save_num_embs[item]['sum'] / self.save_num_embs[item]['cnt']
            df[item] = df[item].fillna(mean)

        print('Ordinal encoding cate features')
        # ordinal_encoding
        df = self.encoder.transform(df)

        print('Target encoding cate features')
        # dynamic_targeting_encoding
        for item in tqdm(self.cate_col):
            avgs = self.save_cate_avgs[item]
            df[item+'_t_mean'] = df[item].map(lambda x: avgs[x][0]/avgs[x][1] if x in avgs else 0)
            df[item+'_t_count'] = df[item].map(lambda x: avgs[x][1]/self.samples if x in avgs else 0)
        
        print('Start manual binary encode')
        rows = None
        for item in tqdm(self.nume_col+self.tgt_nume_col):
            feats = df[item].values
            if rows is None:
                rows = feats.reshape((-1,1))
            else:
                rows = np.concatenate([rows,feats.reshape((-1,1))],axis=1)
            del feats
            gc.collect()
        for item in tqdm(self.cate_col):
            feats = df[item].values
            bit_len = self.Max_len[item]
            res = unpackbits(feats, bit_len).reshape((samples,-1))
            rows = np.concatenate([rows,res],axis=1)
            del feats
            gc.collect()
        vld_y = np.array(df[self.label_name].values).reshape((-1,1))
        del df
        gc.collect()
        vld_x = np.array(rows)
        np.save(outPath+'_features.npy', vld_x)
        np.save(outPath+'_labels.npy', vld_y)
    
    # for update online dataset
    def refit_transform(self, inPath, outPath):
        print('----------------------------------------------------------------------')
        print('Refitting and Transforming %s .'%inPath)
        print('----------------------------------------------------------------------')
        df = pd.read_csv(inPath, dtype=self.dtype_dict)
        samples = df.shape[0]
        print('Filtering and fillna features')
        for item in tqdm(self.cate_col):
            value_counts = df[item].value_counts()
            rm_values = set(value_counts.index)-set(self.save_value_filter[item])
            df[item] = df[item].map(lambda x: '<LESS>' if x in rm_values else x)
            df[item] = df[item].fillna('<UNK>')

        for item in tqdm(self.nume_col):
            self.save_num_embs[item]['sum'] += df[item].sum()
            self.save_num_embs[item]['cnt'] += df[item].shape[0]
            mean = self.save_num_embs[item]['sum'] / self.save_num_embs[item]['cnt']
            df[item] = df[item].fillna(mean)

        print('Ordinal encoding cate features')
        # ordinal_encoding
        df = self.encoder.transform(df)

        print('Target encoding cate features')
        # dynamic_targeting_encoding
        for item in tqdm(self.cate_col):
            feats = df[item].values
            labels = df[self.label_name].values
            feat_encoding = {'mean':[], 'count':[]}
            for idx in range(samples):
                cur_feat = feats[idx]
                if self.save_cate_avgs[item][cur_feat][1] == 0:
                    pdb.set_trace()
                feat_encoding['mean'].append(self.save_cate_avgs[item][cur_feat][0]/self.save_cate_avgs[item][cur_feat][1])
                feat_encoding['count'].append(self.save_cate_avgs[item][cur_feat][1]/(self.samples+idx))
                self.save_cate_avgs[item][cur_feat][0] += labels[idx]
                self.save_cate_avgs[item][cur_feat][1] += 1
            df[item+'_t_mean'] = feat_encoding['mean']
            df[item+'_t_count'] = feat_encoding['count']

        self.samples += samples
            
        print('Start manual binary encode')
        rows = None
        for item in tqdm(self.nume_col+self.tgt_nume_col):
            feats = df[item].values
            if rows is None:
                rows = feats.reshape((-1,1))
            else:
                rows = np.concatenate([rows,feats.reshape((-1,1))],axis=1)
            del feats
            gc.collect()
        for item in tqdm(self.cate_col):
            feats = df[item].values
            bit_len = self.Max_len[item]
            res = unpackbits(feats, bit_len).reshape((samples,-1))
            rows = np.concatenate([rows,res],axis=1)
            del feats
            gc.collect()
        vld_y = np.array(df[self.label_name].values).reshape((-1,1))
        del df
        gc.collect()
        vld_x = np.array(rows)
        np.save(outPath+'_features.npy', vld_x)
        np.save(outPath+'_labels.npy', vld_y)
        # to do
        pass


In [None]:
#@title
class CateEncoder(object):
    def __init__(self, cate_col, nume_col, threshold, thresrate, bins, label):
        self.label_name = label
        # cate_col = list(df.select_dtypes(include=['object']))
        self.cate_col = cate_col 
        # nume_col = list(set(list(df)) - set(cate_col))
        self.dtype_dict = {}
        for item in cate_col:
            self.dtype_dict[item] = 'str'
        for item in nume_col:
            self.dtype_dict[item] = 'float'
        self.nume_col = nume_col
        self.encoder = ce.ordinal.OrdinalEncoder(cols=cate_col+nume_col)
        self.threshold = threshold
        self.thresrate = thresrate
        self.bins = bins
        # for online update, to do
        self.save_value_filter = {}
        self.save_num_bins = {}
        self.samples = 0

    def save2npy(self, df, out_dir):
        if not os.path.isdir(out_dir):
            os.mkdir(out_dir)
        result = {'label':[], 'index':[],'feature_sizes':[]}
        result['label'] = df[self.label_name].values
        result['index'] = df[self.cate_col+self.nume_col].values
        for item in self.cate_col+self.nume_col:
            result['feature_sizes'].append(df[item].max()+1)
        for item in result:
            result[item] = np.array(result[item])
            np.save(out_dir + '_' + item +'.npy', result[item])

    def fit_transform(self, inPath, outPath):
        print('----------------------------------------------------------------------')
        print('Fitting and Transforming %s .'%inPath)
        print('----------------------------------------------------------------------')
        df = pd.read_csv(inPath, dtype=self.dtype_dict)
        print('Filtering and fillna features')
        for item in tqdm(self.cate_col):
            value_counts = df[item].value_counts()
            num = value_counts.shape[0]
            self.save_value_filter[item] = list(value_counts[:int(num*self.thresrate)][value_counts>self.threshold].index)
            rm_values = set(value_counts.index)-set(self.save_value_filter[item])
            df[item] = df[item].map(lambda x: '<LESS>' if x in rm_values else x)
            df[item] = df[item].fillna('<UNK>')

        print('Fillna and Bucketize numeric features')
        for item in tqdm(self.nume_col):
            q_res = pd.qcut(df[item], self.bins, labels=False, retbins=True, duplicates='drop')
            df[item] = q_res[0].fillna(-1).astype('int')
            self.save_num_bins[item] = q_res[1]

        print('Ordinal encoding cate features')
        # ordinal_encoding
        df = self.encoder.fit_transform(df)
        self.save2npy(df, outPath)
        # df.to_csv(outPath, index=False)

    # for test dataset
    def transform(self, inPath, outPath):
        print('----------------------------------------------------------------------')
        print('Transforming %s .'%inPath)
        print('----------------------------------------------------------------------')
        df = pd.read_csv(inPath, dtype=self.dtype_dict)
        print('Filtering and fillna features')
        for item in tqdm(self.cate_col):
            value_counts = df[item].value_counts()
            rm_values = set(value_counts.index)-set(self.save_value_filter[item])
            df[item] = df[item].map(lambda x: '<LESS>' if x in rm_values else x)
            df[item] = df[item].fillna('<UNK>')

        for item in tqdm(self.nume_col):
            df[item] = pd.cut(df[item], self.save_num_bins[item], labels=False, include_lowest=True).fillna(-1).astype('int')

        print('Ordinal encoding cate features')
        # ordinal_encoding
        df = self.encoder.transform(df)
        self.save2npy(df, outPath)
        # df.to_csv(outPath, index=False)

## Data loading

In [None]:
# file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
file_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
# file_url = "https://storage.yandexcloud.net/datasouls-ods/materials/3b9757b5/train_data.csv"
# file_url = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
df = pd.read_csv(file_url, sep=';')
df.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6


In [None]:
nume_col = df.select_dtypes('number').columns.tolist()
cate_col = df.select_dtypes('object').columns.tolist()
label_col = 'quality'

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test = train_test_split(
    df, test_size=0.2, random_state=42)

In [None]:
out_dir = '/content/drive/MyDrive/DeepGBM/experiments/data/data_offline'
threshold = 10
thresrate = 0.99
num_bins = 32
test_csv_path = os.path.join(out_dir, 'test.csv')
train_csv_path = os.path.join(out_dir, 'train.csv')

In [None]:
X_train.to_csv(train_csv_path)
X_test.to_csv(test_csv_path)

## Numeric feature preprocessing

In [None]:
out_dir_num = '/content/drive/MyDrive/DeepGBM/experiments/data/data_offline_num'
if not os.path.isdir(out_dir_num):
        os.mkdir(out_dir_num)
ec = NumEncoder(cate_col, nume_col, threshold, thresrate, label_col)
ec.fit_transform(train_csv_path, out_dir_num + '/train')
ec.transform(test_csv_path, out_dir_num + '/test')


----------------------------------------------------------------------
Fitting and Transforming /content/drive/MyDrive/DeepGBM/experiments/data/data_offline/train.csv .
----------------------------------------------------------------------
Filtering and fillna features


0it [00:00, ?it/s]
100%|██████████| 12/12 [00:00<00:00, 613.90it/s]


Ordinal encoding cate features
Target encoding cate features


0it [00:00, ?it/s]


Start manual binary encode


100%|██████████| 12/12 [00:01<00:00, 10.41it/s]
0it [00:00, ?it/s]


----------------------------------------------------------------------
Transforming /content/drive/MyDrive/DeepGBM/experiments/data/data_offline/test.csv .
----------------------------------------------------------------------
Filtering and fillna features


0it [00:00, ?it/s]
100%|██████████| 12/12 [00:00<00:00, 1559.46it/s]


Ordinal encoding cate features
Target encoding cate features


0it [00:00, ?it/s]


Start manual binary encode


100%|██████████| 12/12 [00:00<00:00, 13.16it/s]
0it [00:00, ?it/s]


## Categorical features preprocessing

In [None]:
out_dir_cate = '/content/drive/MyDrive/DeepGBM/experiments/data/data_offline_cate'
if not os.path.isdir(out_dir_cate):
        os.mkdir(out_dir_cate)
ec = CateEncoder(cate_col, nume_col, threshold, thresrate, num_bins, label_col)
ec.fit_transform(train_csv_path, out_dir_cate + '/train/')
ec.transform(test_csv_path, out_dir_cate + '/test/')

----------------------------------------------------------------------
Fitting and Transforming /content/drive/MyDrive/DeepGBM/experiments/data/data_offline/train.csv .
----------------------------------------------------------------------
Filtering and fillna features


0it [00:00, ?it/s]


Fillna and Bucketize numeric features


100%|██████████| 12/12 [00:00<00:00, 293.22it/s]


Ordinal encoding cate features
----------------------------------------------------------------------
Transforming /content/drive/MyDrive/DeepGBM/experiments/data/data_offline/test.csv .
----------------------------------------------------------------------
Filtering and fillna features


0it [00:00, ?it/s]
100%|██████████| 12/12 [00:00<00:00, 501.93it/s]

Ordinal encoding cate features





In [None]:
import sys
sys.path.insert(0, '/content/drive/MyDrive/DeepGBM/experiments/data/models')

In [None]:
!python /content/drive/MyDrive/DeepGBM/experiments/data/main.py -data data_offline -batch_size 512 -plot_title 'paper_0201' \
-max_epoch 20 -lr 1e-3 -opt Adam -test_batch_size 100 -model deepgbm \
-task regression -l2_reg 1e-6 -test_freq 300 -seed 1,2,3,4,5 -group_method Random \
-emb_epoch 2 -loss_de 2 -loss_dr 0.7 -tree_lr 0.1 -cate_layers 16,16 -nslices 5 \
 -tree_layers 100,100,100,50 -embsize 20 -maxleaf 64 -log_freq 500

2021-12-26 19:02:06,100 [INFO] data loaded.
 train_x shape: (3918, 12). train_y shape: (3918, 1).
 test_x shape: (980, 12). test_y shape: (980, 1).
loaded from /content/drive/MyDrive/DeepGBM/experiments/data/data_offline_cate/train/.
loaded from /content/drive/MyDrive/DeepGBM/experiments/data/data_offline_cate/test/.
2021-12-26 19:02:06,113 [INFO] Categorical data loaded.
 train_x shape: (3918, 12). train_y shape: (3918, 1).
 test_x shape: (980, 12). test_y shape: (980, 1).
[LightGBM] [Info] Total Bins 1349
[LightGBM] [Info] Number of data: 3918, number of used features: 12
[LightGBM] [Info] Start training from score 5.871363
[1]	valid_0's l2: 0.629356
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's l2: 0.510817
[3]	valid_0's l2: 0.414886
[4]	valid_0's l2: 0.337218
[5]	valid_0's l2: 0.273997
[6]	valid_0's l2: 0.256411
[7]	valid_0's l2: 0.208742
[8]	valid_0's l2: 0.170341
[9]	valid_0's l2: 0.139027
[10]	valid_0's l2: 0.113654
[11]	valid_0's l2: 0.0930975
[12]