# GPU Light Gradient boosting trained on timestamp text data-set

1. Same emotion dataset from [NLP-dataset](https://github.com/huseinzol05/NLP-Dataset)
2. Same splitting 80% training, 20% testing, may vary depends on randomness
3. Same regex substitution '[^\"\'A-Za-z0-9 ]+'

## Example

Based on sorted dictionary position

text: 'module into which all the refactored classes', matrix: [167, 143, 12, 3, 4, 90]

In [1]:
import numpy as np
import sklearn.datasets
import re
import time
import lightgbm as lgb
import pickle
from sklearn.cross_validation import train_test_split
import json



In [2]:
def clearstring(string):
    string = re.sub('[^\"\'A-Za-z0-9 ]+', '', string)
    string = string.split(' ')
    string = filter(None, string)
    string = [y.strip() for y in string]
    string = ' '.join(string)
    return string

# because of sklean.datasets read a document as a single element
# so we want to split based on new line
def separate_dataset(trainset):
    datastring = []
    datatarget = []
    for i in range(len(trainset.data)):
        data_ = trainset.data[i].split('\n')
        # python3, if python2, just remove list()
        data_ = list(filter(None, data_))
        for n in range(len(data_)):
            data_[n] = clearstring(data_[n])
        datastring += data_
        for n in range(len(data_)):
            datatarget.append(trainset.target[i])
    return datastring, datatarget

In [3]:
trainset_data = sklearn.datasets.load_files(container_path = 'data', encoding = 'UTF-8')
trainset_data.data, trainset_data.target = separate_dataset(trainset_data)

In [4]:
with open('dictionary_emotion.p', 'rb') as fopen:
    dict_emotion = pickle.load(fopen)

In [5]:
len_sentences = np.array([len(i.split()) for i in trainset_data.data])
maxlen = np.ceil(len_sentences.mean()).astype('int')
data_X = np.zeros((len(trainset_data.data), maxlen))

In [6]:
for i in range(data_X.shape[0]):
    tokens = trainset_data.data[i].split()[:maxlen]
    for no, text in enumerate(tokens[::-1]):
        try:
            data_X[i, -1 - no] = dict_emotion[text]
        except:
            continue

In [7]:
train_X, test_X, train_Y, test_Y = train_test_split(data_X, trainset_data.target, test_size = 0.2)

In [8]:
params_lgb = {
    'max_depth': 27, 
    'learning_rate': 0.03,
    'verbose': 50, 
    'early_stopping_round': 200,
    'metric': 'multi_logloss',
    'objective': 'multiclass',
    'num_classes': len(trainset_data.target_names),
    'device': 'gpu',
    'gpu_platform_id': 0,
    'gpu_device_id': 0
    }

In [12]:
d_train = lgb.Dataset(train_X, train_Y)
d_valid = lgb.Dataset(test_X, test_Y)
watchlist = [d_train, d_valid]
t=time.time()
clf = lgb.train(params_lgb, d_train, 100000, watchlist, early_stopping_rounds=200, verbose_eval=100)
print(round(time.time()-t, 3), 'Seconds to train lgb')



Training until validation scores don't improve for 200 rounds.
[100]	training's multi_logloss: 1.53707	valid_1's multi_logloss: 1.54177
[200]	training's multi_logloss: 1.50251	valid_1's multi_logloss: 1.51202
[300]	training's multi_logloss: 1.48103	valid_1's multi_logloss: 1.49636
[400]	training's multi_logloss: 1.46418	valid_1's multi_logloss: 1.48556
[500]	training's multi_logloss: 1.44901	valid_1's multi_logloss: 1.47635
[600]	training's multi_logloss: 1.43547	valid_1's multi_logloss: 1.46905
[700]	training's multi_logloss: 1.42315	valid_1's multi_logloss: 1.4629
[800]	training's multi_logloss: 1.41188	valid_1's multi_logloss: 1.45773
[900]	training's multi_logloss: 1.40093	valid_1's multi_logloss: 1.45285
[1000]	training's multi_logloss: 1.39076	valid_1's multi_logloss: 1.44862
[1100]	training's multi_logloss: 1.3808	valid_1's multi_logloss: 1.44438
[1200]	training's multi_logloss: 1.3712	valid_1's multi_logloss: 1.4405
[1300]	training's multi_logloss: 1.36209	valid_1's multi_loglo

[11100]	training's multi_logloss: 0.883993	valid_1's multi_logloss: 1.35606
[11200]	training's multi_logloss: 0.880897	valid_1's multi_logloss: 1.35605
[11300]	training's multi_logloss: 0.877782	valid_1's multi_logloss: 1.35597
[11400]	training's multi_logloss: 0.874686	valid_1's multi_logloss: 1.35596
[11500]	training's multi_logloss: 0.871599	valid_1's multi_logloss: 1.3559
[11600]	training's multi_logloss: 0.868433	valid_1's multi_logloss: 1.35582
[11700]	training's multi_logloss: 0.865309	valid_1's multi_logloss: 1.3557
[11800]	training's multi_logloss: 0.862323	valid_1's multi_logloss: 1.3557
[11900]	training's multi_logloss: 0.859349	valid_1's multi_logloss: 1.35567
[12000]	training's multi_logloss: 0.85648	valid_1's multi_logloss: 1.35575
[12100]	training's multi_logloss: 0.853534	valid_1's multi_logloss: 1.35577
Early stopping, best iteration is:
[11908]	training's multi_logloss: 0.859119	valid_1's multi_logloss: 1.35566
1136.274 Seconds to train lgb


In [14]:
np.mean(test_Y == np.argmax(clf.predict(test_X), axis = 1))

0.47172572635013554

In [15]:
clf.save_model('lgb-timestamp.model')

In [16]:
from sklearn import metrics
print(metrics.classification_report(test_Y, np.argmax(clf.predict(test_X), axis = 1), target_names = trainset_data.target_names))

             precision    recall  f1-score   support

      anger       0.53      0.22      0.31     11587
       fear       0.50      0.18      0.27      9504
        joy       0.46      0.73      0.57     28074
       love       0.30      0.08      0.13      6949
    sadness       0.49      0.57      0.53     24293
   surprise       0.26      0.09      0.13      2955

avg / total       0.46      0.47      0.43     83362

