I initially run 16 experiments: 

```bash
python score.py --maxevals 50 --save
python score.py --maxevals 50 --with_bigrams --save
python score.py --packg spacy --maxevals 50 --save
python score.py --packg spacy --maxevals 50 --with_bigrams --save

python score.py --algo lda --n_topics 20 --with_cv --save
python score.py --algo lda --n_topics 50 --with_cv --save
python score.py --algo lda --n_topics 20 --with_bigrams --with_cv --save
python score.py --algo lda --n_topics 50 --with_bigrams --with_cv --save
python score.py --packg spacy --algo lda --n_topics 20 --with_cv --save
python score.py --packg spacy --algo lda --n_topics 50 --with_cv --save
python score.py --packg spacy --algo lda --n_topics 20 --with_bigrams --with_cv --save
python score.py --packg spacy --algo lda --n_topics 50 --with_bigrams --with_cv --save

python score.py --algo ensemb --n_topics 20 --with_cv --save
python score.py --algo ensemb --n_topics 20 --with_bigrams --with_cv --save
python score.py --packg spacy --algo ensemb --n_topics 20 --with_cv --save
python score.py --packg spacy --algo ensemb --n_topics 20 --with_bigrams --with_cv --save
```

I have used `LDA`, `EnStop` (pLSA+`UMAP`) and `tfidf` with `nltk` and `spacy` tokenization (see `preprocessing.py`) with and without bigrams. 

All the code neccesary to run the experiments can be found in `score.py`. Is mostly this:

In [62]:
import pandas as pd
import numpy as np
import pickle
import argparse
import pdb

from pathlib import Path
# Note: I simply copied the `utils` dir into the notebooks dir so that I can run the next cell here
from utils.lightgbm_optimizer import LGBOptimizer

In [63]:
packg = 'nltk'           # nltk or spacy
with_bigrams = False
algo = 'lda'             # lda or ensemb
n_topics = '20'          # 20 or 50 when lda, only 20 for ensemb
with_cv = False          # hyperoptimize with cross validation
is_unbalance = True      # set the lightgbm is_unbalance param to True/False
with_focal_loss = False  # Use the Focal Loss (see here: https://github.com/jrzaurin/LightGBM-with-Focal-Loss)
eval_with_metric = False # hyperoptimize using the F1 score or the CE Loss
save = False             
maxevals = 1

In [64]:
FEAT_PATH = Path('../features')

wbigram = 'bigram_' if with_bigrams else ''
dataname = packg + '_tok_reviews_' + wbigram + algo
if algo is not 'tfidf': dataname = dataname + '_' +  n_topics

dtrain = pickle.load(open(FEAT_PATH/('train/'+ dataname+'_feat_tr.p'), 'rb'))
dvalid = pickle.load(open(FEAT_PATH/('valid/'+ dataname+'_feat_val.p'), 'rb'))
dtest  = pickle.load(open(FEAT_PATH/('test/' + dataname+'_feat_te.p'),  'rb'))

opt = LGBOptimizer(
    dataname,
    dtrain,
    dvalid,
    dtest,
    with_cv=with_cv,
    is_unbalance=is_unbalance,
    with_focal_loss=with_focal_loss,
    eval_with_metric=eval_with_metric,
    save=save)
opt.optimize(maxevals=maxevals)

100%|██████████| 1/1 [00:03<00:00,  3.61s/it, best loss: 0.9695638766845442]
acc: 0.6034, f1 score: 0.5129, precision: 0.5108, recall: 0.6034
confusion_matrix
[[  483   110   355  1717]
 [  287   125   476  2154]
 [  287   113   572  4863]
 [  199    73   417 15633]]


Let's first comment a bit on what happens within `LGBOptimizer` (all related code can be found in the `utils` module). There I use `hyperopt` and the following parameter space to optimize `LightGBM` hyper-parameters:

```python
space = {
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.2),
    'num_boost_round': hp.quniform('num_boost_round', 50, 500, 20),
    'num_leaves': hp.quniform('num_leaves', 31, 255, 4),
    'min_child_weight': hp.uniform('min_child_weight', 0.1, 10),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.),
    'subsample': hp.uniform('subsample', 0.5, 1.),
    'reg_alpha': hp.uniform('reg_alpha', 0.01, 0.1),
    'reg_lambda': hp.uniform('reg_lambda', 0.01, 0.1),
}
```

 I think I could perhaps refine a bit more some of them, like using a slightly higher `learning_rate` (e.g 0.3) of a smaller `num_boost_round` (e.g. 10). I will leave that to future visits to this repo or to you, the reader 🙂. 

Within the class `LGBOptimizer`, one can run the optimization process with a number of options. For example, using cross validation and the F1 score. This will basically run the next piece of code:

```python
cv_result = lgb.cv(
    params,
    dtrain,
    num_boost_round=params['num_boost_round'],
    metrics='multi_logloss',
    feval = lgb_f1 if self.eval_with_metric else None,
    nfold=3,
    stratified=True,
    early_stopping_rounds=20)
```

where lgb_f1 is

```python
def lgb_f1_score(preds, lgbDataset, num_class):
	"""
	Implementation of the f1 score to be used as evaluation score for lightgbm
	Parameters:
	-----------
	preds: numpy.ndarray
		array with the predictions
	lgbDataset: lightgbm.Dataset
	"""
	preds = preds.reshape(-1, num_class, order='F')
	cat_preds = np.argmax(preds, axis=1)
	y_true = lgbDataset.get_label()
	return 'f1', f1_score(y_true, cat_preds, average='weighted'), True

```

If you want to know more about custom losses and metrics for `LightGBM`, have a look to my repo [here](https://github.com/jrzaurin/LightGBM-with-Focal-Loss). Again, all the related code can be found in the `utils` module. 

So, if we choose to run 100 iterations, with cross validation and evaluate using the F1 score, for a set of features that is the results of using LDA with 20 topics, `LGBOptimizer` will take the train, validation and test datasets, merge the first two and run a 3 stratified-fold CV experiments on them. Once the best parameters have been found based on the resulting F1 score, it will then predict on the test set and compute the success metrics

```python
acc  = accuracy_score(self.lgtest.label, preds)
f1   = f1_score(self.lgtest.label, preds, average='weighted')
prec = precision_score(self.lgtest.label, preds, average='weighted')
rec  = recall_score(self.lgtest.label, preds, average='weighted')
cm   = confusion_matrix(self.lgtest.label, preds)
```

Have a look to the code in `utils` and the process will be pretty clear, I hope...and "that's it". Let's have a look to the results for the 16 experiments at the top of this notebook:

In [65]:
RESULT_PATH = Path('../results/')
results_fnames = list(RESULT_PATH.glob("*.p"))
results_fnames

[PosixPath('../results/nltk_tok_reviews_bigram_tfidf_results_unb.p'),
 PosixPath('../results/nltk_tok_reviews_ensemb_20_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_ensemb_20_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_lda_50_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_bigram_lda_20_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_bigram_tfidf_results_unb.p'),
 PosixPath('../results/nltk_tok_reviews_bigram_lda_50_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_lda_20_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_bigram_ensemb_20_results_unb.p'),
 PosixPath('../results/nltk_tok_reviews_bigram_lda_20_results_unb.p'),
 PosixPath('../results/nltk_tok_reviews_tfidf_results_unb.p'),
 PosixPath('../results/nltk_tok_reviews_bigram_ensemb_20_results_unb.p'),
 PosixPath('../results/spacy_tok_reviews_bigram_lda_50_results_unb.p'),
 PosixPath('../results/nltk_tok_reviews_lda_20_results_unb.p'),
 PosixPath('../results/spacy_tok

In [66]:
rnames = [str(rf).replace('../results/', '') for rf in results_fnames]
rnames = [str(rf).replace('_results_unb.p', '') for rf in rnames]

# let's have a look to one of the result files
pickle.load(open("../results/nltk_tok_reviews_bigram_tfidf_results_unb.p", "rb"))

{'acc': 0.681739879414298,
 'f1': 0.6528736886562243,
 'prec': 0.6454948246543382,
 'rec': 0.681739879414298,
 'cm': array([[ 1490,   430,   240,   505],
        [  551,   781,   733,   977],
        [  215,   462,  1778,  3380],
        [  207,   199,   969, 14947]]),
 'model': <lightgbm.basic.Booster at 0x7fb9c4e46f28>,
 'best_params': {'colsample_bytree': 0.5838279335270615,
  'learning_rate': 0.06754162601628992,
  'min_child_weight': 9.612941659095199,
  'num_boost_round': 302,
  'num_leaves': 224,
  'reg_alpha': 0.06187478435550153,
  'reg_lambda': 0.07577877965357216,
  'subsample': 0.6243898455966228,
  'verbose': -1,
  'num_class': 4,
  'objective': 'multiclass'},
 'running_time': 223.66}

In [67]:
keep = ['acc', 'f1', 'prec', 'running_time']
resd = [] 
for rn in rnames:
    res = pickle.load(open("../results/" + rn + "_results_unb.p", "rb"))
    res = {k:v for k,v in res.items() if k in keep}
    res['model_name'] = rn
    resd.append(res)

In [68]:
res_df = pd.DataFrame(resd)
res_df = res_df[['model_name', 'acc', 'f1', 'prec', 'running_time']]

In [69]:
res_df.sort_values('f1', ascending=False)

Unnamed: 0,model_name,acc,f1,prec,running_time
14,spacy_tok_reviews_tfidf,0.68759,0.661757,0.654098,217.83
5,spacy_tok_reviews_bigram_tfidf,0.687554,0.661738,0.653948,239.55
0,nltk_tok_reviews_bigram_tfidf,0.68174,0.652874,0.645495,223.66
10,nltk_tok_reviews_tfidf,0.680556,0.65118,0.643643,186.04
3,spacy_tok_reviews_lda_50,0.623959,0.562297,0.555568,25.02
2,spacy_tok_reviews_ensemb_20,0.618145,0.555991,0.548623,28.29
1,nltk_tok_reviews_ensemb_20,0.620334,0.555944,0.548857,29.14
8,spacy_tok_reviews_bigram_ensemb_20,0.616925,0.553605,0.546329,29.38
12,spacy_tok_reviews_bigram_lda_50,0.618145,0.552436,0.547555,28.04
11,nltk_tok_reviews_bigram_ensemb_20,0.612116,0.543171,0.536745,30.97


There are a number of interesting results to discuss from this table. 

In the first place, we see that the best results (by far) are obtained when using tf-idf (let me add that these types of results, i.e. the best solution being the simplest or most straightforward, are quite common in ML). The main two drawbacks of this technique is that, as soon as your vocabulary is large the (sparse) feature matrix is going to be gigantic, and that due to that data-size "*issue*", the algorithm is relatively slow compared to other solutions. 

For example, the last column in the dataset shows the running time per iteration. In the case of tf-idf related experiments, I have used 50 iterations without cross validation, while for the remaining ones I have used cross-validations and 100 iterations during optimization. 

We can see that in the case of tf-idf is **notably slower** than when using topic modeling (note that a direct comparison is not adequate since I did not use cross-validation for tf-idf). 

Nonetheless, let's reflect for a bit on the `LightGBM`'s gun-power. I have used a vocabulary size of 20000 words and the dataset contains 278,677 reviews. Therefore, the tf-idf feature matrix is 278,677 x 20,000. Still, `LightGBM` manages to fit that in less than 4 min 😱.

Before we discuss some other interesing aspects, let's see how long it took to run *all* experiments.

In [33]:
print('Running the 16 experiments took: {} h'.format(
    round(sum(np.array([50]*4 + [100]*12) * res_df.running_time.values)/3600, 3)))

Running the 16 experiments took: 28.647 h


A bit more than a day of c5.4xlarge time 🙂.

A second interesting aspect is that, in terms of classifying the documents (or predicting their score), my `Spacy` tokenization seems to work a bit better than my `nltk` one. This is simply because the later is a bit more "aggressive". In addition, interestingly, adding bigrams does not seem to make any difference. As we saw in previous notebooks, a better preprocessing step is possible. In the "*real world*" this would be another venue to explore to improve the success metrics. 

Moving on, it is perhaps a bit dissapointing how distant are the topic modeling techniques from the tf-idf results. It is true that we loose quite a bit of information by reducing the dimensionality to 20 or 50, but still, one would expect that the topics would have captured more information (the drop in f-score is $\sim$10%). One might be tempted to use a higher number of topics (100 or 150). I will leave that for the reader to try...Spoiler alert, it does not improve that much (at least when using 100 topics).

Still focusing on the topic modeling techniques, we see that the topic ensemble technique in `EnStop` does indeed outperform the LDA technique in `sklearn` when using the same number of topics (remember that due to memory issues related to the fact that `UMAP` does not accept sparse matrices we can only use 20 topics when using `EnStop`). I think this result highlights the potential of the `EnStop` package and I just wish they add the ability to deal with sparse matrices in the near future. 

So, in summary, I have spend 28.6 hours and run 16 experiments and the best accuracy and f1-score obtained are 0.688 and 0.662 respectively, but the question is of course...**is this good at all?**

Let's see which results one would obtain by just doing a random guess given the observed distributions.

In [50]:
from collections import Counter
class_counts = [(k,v) for k,v in Counter(dtrain.y).items()]
class_counts = sorted(class_counts, key = lambda x: x[0])
class_prob   = [c[1]/dtrain.y.shape[0] for c in class_counts] 

In [71]:
from sklearn.metrics import accuracy_score

accs, fs = [],[]
for _ in range(100):
    np.random.choice(4, dtrain.y.shape[0], p=class_prob)
    accs.append(accuracy_score(dtrain.y, random_guess))

print(np.mean(accs))

0.40921285205422925


So, "thankfully", we are doing notably better than just a random guess given the observed distributions, phew!

Ok, so there are only a couple of things I want to comment here. 

My first comment is related to the way I have faced the problem, i.e. as a multiclass classification problem. In reality, there is nothing that prevents us from facing this problem as a regression and compare results. I will again leave that to the reader. 

Secondly, I have also included code to run the classification using the [Focal Loss](https://arxiv.org/abs/1708.02002). The dataset is "mildly" imbalanced, so it might be worth it to give it a go. I will do this in a future visit or simply leave it to the reader.

For example, just run this: 

```python
python score.py --packg spacy --maxevals 50 --with_focal_loss --is_unbalance --save
```