Let's extract the features that will be fed to LightGBM for the amazon review classification problem. Before I start, just mentioning that prior to the code in this notebook I have run the `train_test_split.py`. As a result, there are 3 sub-directories within the `data` directory that contain the train, validation and test splits.

The algorithms I am going to use here to "extract" the features are: 

- Our good friend tf-idf
- The well known LDA topic modeling algorithm
- And the promising EnsTop package, an ensemble based approach to topic modelling using pLSA

Let me write a bit more about the later technique. The `Enstop` package was created by [Leland McInnes](https://github.com/lmcinnes) and [Gökçen Eraslan](https://github.com/gokceneraslan). Leland McInnes is also the creator of [UMAP](https://github.com/lmcinnes/umap) (Uniform Manifold Approximation and Projection). When I first saw his youtube [presentation](https://www.youtube.com/watch?v=nq6iPZVUxZU&t=8s) on UMAP, it made me want to go back to uni and do a PhD or MPhil in topology. Seriously, check it out. 

Coming back to `Enstop`, the package is, as explained by the authors, an ensemble based approach to topic modelling using pLSA. Leaving aside the use of `numba` for high perfomance, the package runs multiple topic models using pLSA, and then clusters them using HDBSCAN to determine a set of stable topics. If you have a look to the source code of the package, you will see that the way they compute the stable topics can be described as follows:

- Compute an ensemble of topics using their pLSA implementation of NMF. For example, if you run 16 experiments (their default), this step will result in a `(n_runs * n_topics, n_words)` array
- Once we have that large array, they select a small list of stable topics using 3 optional methods:
    - HDBSCAN and the KL-divergence as a distance
    - HDBSCAN and Hellinger as a distance
    - First projecting the topics onto a low dimensional space using UMAP with the Hellinger distance. The cluster the topics using HDBSCAN and the euclidean distance in the lower-dim space. This is their default method. 
    
Of course, there is more to it, since the package comes with its own topic coherence meassure and a few more rings and bells. However, there is a notable caveat/drawback. `UMAP` does not work with sparse matrices through the entire fit and trasnform pipeline (e.g. see [here](https://github.com/lmcinnes/umap/issues/81)). These will be move to dense during the process. 

Therefore, when using `Enstop` with the default options, the number of topics must be small, if not the memory explodes. For example, using an AWS c5.4xlarge EC2 instance (16 cores, 30.4 GB) with ouramazon reviews dataset, I can only use 20 topics, and that already uses 30GB of RAM. 

I have not tried to use the other two options described before (KL-divergence and Hellinger), but I do hope UMAP supports sparse matrices in the near future because its potential is significant. 

Anyway, without further ado, let's move to the feature extraction process.

In [1]:
import pdb
import pandas as pd
import os
import pickle
import warnings

from pathlib import Path
from multiprocessing import Pool
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import Bunch
from sklearn.pipeline import Pipeline
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.utils.validation import check_is_fitted
from enstop import EnsembleTopics


warnings.filterwarnings("ignore")
cores = os.cpu_count()

In [2]:
class FeatureExtraction(object):
    def __init__(self, algo, n_topics=None, max_vocab_size=50000):
        super(FeatureExtraction, self).__init__()

        if algo is 'tfidf':
            vectorizer = TfidfVectorizer(max_features=max_vocab_size, preprocessor = lambda x: x,
                tokenizer = lambda x: x)
            self.fe = Pipeline([('vectorizer', vectorizer)])
        else:
            assert n_topics is not None
            vectorizer = CountVectorizer(max_features=max_vocab_size, preprocessor = lambda x: x,
                tokenizer = lambda x: x)
            if algo is 'lda':
                model = LDA(n_components=n_topics, n_jobs=-1, random_state=0)
            elif algo is 'ensemb':
                model = EnsembleTopics(n_components=n_topics, n_jobs=cores, random_state=0)
            self.fe = Pipeline([('vectorizer', vectorizer), ('model', model)])

    def fit(self, X):
        self.fe.fit(X)
        return self

    def transform(self, X):
        out = self.fe.transform(X)
        return out

    def fit_transform(self,X):
        return self.fit(X).transform(X)

I have already splitted the dataset in training, validation and testing, check the `train_test_split.py` file. 

In [21]:
dataset = 'nltk_tok_reviews'
algo='tfidf'
n_topics=None
max_vocab_size=20000
root='../data'
train_dir='train' 
valid_dir='valid'
test_dir='test'

In [22]:
TRAIN_PATH = Path('/'.join([root, train_dir]))
VALID_PATH = Path('/'.join([root, valid_dir]))
TEST_PATH  = Path('/'.join([root, test_dir]))

In [23]:
dtrain = pickle.load(open(TRAIN_PATH/(dataset+'_tr.p'), 'rb'))
dvalid = pickle.load(open(VALID_PATH/(dataset+'_val.p'), 'rb'))
dtest  = pickle.load(open(TEST_PATH/(dataset+'_te.p'), 'rb'))

These are sklearn's data bunches

In [24]:
type(dtrain)

sklearn.utils.Bunch

In [25]:
dtrain.keys()

dict_keys(['X', 'y'])

In [26]:
print(dtrain.X[0])

['buy', 'husband', 'have', 'leg', 'pain', 'vericose', 'vein', 'purchase', 'heavier', 'expensive', 'black', 'sock', 'seeif', 'help', 'ease', 'discomfort', 'work', 'stand', 'day', 'help', 'heavy', 'warm', 'summer', 'end', 'buy', 'twopairs', 'try', 'help', 'great', 'deal', 'eventually', 'purchase', 'pair', 'win', 'wear', 'dayand', 'machine', 'wash', 'air', 'dry', 'go', 'dryer', 'time', 'count', 'affect', 'fit', 'don', 'roll', 'like', 'match', 'sock', 'instead', 'fold', 'drawer', 'stretch', 'pretty', 'frugaland', 'search', 'better', 'price', 'quality', 'couldn', 'find', 'better', 'think', 'expensive', 'worth', 'themoney']


And simply

In [27]:
feature_extractor = FeatureExtraction(algo, n_topics, max_vocab_size)
X_tr  = feature_extractor.fit_transform(dtrain.X)
X_val = feature_extractor.transform(dvalid.X)
X_te  = feature_extractor.transform(dtest.X)

In [28]:
X_tr

<222906x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 4724123 stored elements in Compressed Sparse Row format>

And change the settings accordingly to use `LDA` or `EnsembleTopics`