Once we have the tf-idf matrix is just a matter of:

```python
import pickle

from utils.lightgbm_optimizer import LGBOptimizer
from pathlib import Path

if __name__ == '__main__':

	opt = LGBOptimizer(dataset='original', save=True)
	opt.optimize(maxevals=50)

	opt = LGBOptimizer(dataset='augmented', save=True)
	opt.optimize(maxevals=50)
```

I won't run the code here since it takes a long time. Let's have a look to the results

In [1]:
import pandas as pd
import pickle

In [2]:
org_res = pickle.load(open("../results/original_results.p", "rb"))
aug_res = pickle.load(open("../results/augmented_results.p", "rb"))

In [8]:
print('RESULTS WITH ORIGINAL DATASET: {}'.format(org_res))
print('RESULTS WITH EDA AUGMENTED DATASET: {}'.format(aug_res))

RESULTS WITH ORIGINAL DATASET: {'acc': 0.7054056974105953, 'f1': 0.6832450020682459, 'prec': 0.6763347945728874, 'rec': 0.7054056974105953, 'cm': array([[ 1613,   406,   215,   423],
       [  549,   897,   864,   723],
       [  183,   455,  2144,  3037],
       [  137,   175,  1013, 14933]]), 'model': <lightgbm.basic.Booster object at 0x7f1f6e5c2320>, 'best_params': {'colsample_bytree': 0.7333187629999629, 'learning_rate': 0.0639446526095139, 'min_child_weight': 6.386698616449527, 'num_boost_round': 439, 'num_leaves': 236, 'reg_alpha': 0.08333569598110195, 'reg_lambda': 0.07805240800326783, 'subsample': 0.5715270609261924, 'verbose': -1, 'num_class': 4, 'objective': 'multiclass'}, 'running_time': 311.58}
RESULTS WITH EDA AUGMENTED DATASET: {'acc': 0.7060899629056073, 'f1': 0.6847974561849195, 'prec': 0.6781136678067431, 'rec': 0.7060899629056073, 'cm': array([[ 1598,   414,   217,   428],
       [  530,   919,   866,   718],
       [  161,   443,  2185,  3030],
       [  121,   175, 

In [10]:
keep_keys = ['acc', 'f1', 'prec', 'running_time']
o_res = {k:v for k,v in org_res.items() if k in keep_keys}
o_res['data'] = 'original'
a_res = {k:v for k,v in aug_res.items() if k in keep_keys}
a_res['data'] = 'augmented'

In [16]:
df_res = pd.concat([pd.DataFrame(o_res, index=[0]), pd.DataFrame(a_res, index=[0])])
df_res

Unnamed: 0,acc,f1,prec,running_time,data
0,0.705406,0.683245,0.676335,311.58,original
0,0.70609,0.684797,0.678114,1027.66,augmented


A couple of points to comment here. First and of course the most important is that all metrics have improved by $\sim$2% relative to the results in the dir `amazon_reviews_classification_without_DL` where I did not use the `fastai`'s Tokenizer. Second one can see that the difference between using and not using EDA in this dataset is negligible. To be honest I was expecting this results since EDA is more suited for DL algorithms and small datasets. And finally let me comment on the running time. In the case of the augmented dataset we are feeding to `LightGBM` a sparse matrix of nearly 0.9mil rows and 30k columns. Each full `hyperopt` iteration takes around 1000 sec, or around 15min (note that the running time includes the final fit, i.e. the 51 `LightGBM` fits, 50 `hyperopt` iterations (using train and valid) plus the final one (using train+valid and test)). So fitting 0.9mil rows and 30k columns takes around 15min. 

Anyway, the summary for the exercise in this repo is: **USE FASTAI TOKENIZER**