# Topic Modeling and Document Clustering with LDA

TODO add description

In [1]:
# add scripts/ folder to path
import os, sys

SCRIPTS_PATH = os.environ['DSX_PROJECT_DIR'] + '/scripts'
sys.path.insert(0, SCRIPTS_PATH)

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

import visualization # custom script

In [3]:
DATASET_PATH = "/user-home/libraries/text-analytics/datasets/aclImdb"
TRAIN_PATH = DATASET_PATH + "/train/"
TEST_PATH = DATASET_PATH + "/test/"

## 0. Load files

In [4]:
from sklearn.datasets import load_files

We only load the training data, without labels, and consider it as unlabeled data:

In [5]:
reviews_train = load_files(TRAIN_PATH)
text_train = reviews_train.data
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))

type of text_train: <class 'list'>
length of text_train: 25000


## 1. Preprocessing

Even though the preprocessing is short and straightforward, we probably want to move this to a script at some point.

In [6]:
text_train = [doc.replace(b"<br />", b" ").decode('utf-8') for doc in text_train]

In [7]:
text_train = pd.DataFrame({"review": text_train})

## 2. Feature Engineering

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

We limit the number of features to speed up the topic modeling.

In [9]:
vect = CountVectorizer(max_features=10000, max_df=.15)
X_train = vect.fit_transform(text_train.review)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<25000x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 1948677 stored elements in Compressed Sparse Row format>


In [10]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 10000
First 20 features:
['00', '000', '10', '100', '1000', '101', '11', '12', '13', '13th', '14', '15', '150', '16', '17', '18', '18th', '19', '1920', '1920s']
Features 20010 to 20030:
[]
Every 2000th feature:
['00', 'conroy', 'graphic', 'named', 'sharp']


## 3. Build model

...

### 3.2 Non-Negative Matrix Factorization (NMF), 10 topics

In [21]:
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer

In [22]:
pipe = make_pipeline(FunctionTransformer(pd.DataFrame.get, kw_args={'key':'review'}, validate=False),
                     CountVectorizer(),
                     NMF(n_components=10, max_iter=25, random_state=0))

In [25]:
%%time
document_topics_nmf = pipe.fit_transform(text_train)
print("nmf.components_.shape: {}".format(nmf.components_.shape))

nmf.components_.shape: (10, 10000)
CPU times: user 18.4 s, sys: 19 s, total: 37.4 s
Wall time: 16 s


In [29]:
nmf = pipe.steps[2][1]
vect = pipe.steps[1][1]

In [30]:
# for each topic (a row in the components_), sort the features (ascending).
# Invert rows with [:, ::-1] to make sorting descending
sorting_nmf = np.argsort(nmf.components_, axis=1)[:, ::-1]
feature_names = np.array(vect.get_feature_names())

#### Explore the topics

In [31]:
# Print out the 10 topics:
visualization.print_topics(topics=range(10), feature_names=feature_names,
                           sorting=sorting_nmf, topics_per_chunk=5, n_words=10)

topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
the           it            and           is            to            
of            that          with          that          that          
and           was           is            it            be            
to            but           to            are           have          
in            you           are           not           they          
on            to            all           this          for           
with          and           their         but           on            
is            so            very          the           you           
for           just          of            as            with          
from          not           as            there         who           


topic 5       topic 6       topic 7       topic 8       topic 9       
--------      --------      --------      --------      --------      
of  

## 4. Store model

### 4.1 Save model in ML repository

In [32]:
from dsx_ml.ml import save

In [36]:
deployment_info = save(name='simple-topic-modeling',
                        model=pipe,
                        algorithm_type='Classification', # Only classification and regression are supported
                        description='This is the first simple topic modeling with NMF',
                        source='simple-topic-modeling.ipynb')
print(deployment_info)

{'path': '/user-home/1055/DSX_Projects/text-deployment-demo/models/simple-topic-modeling/1', 'scoring_endpoint': 'https://dsxl-api/v3/project/score/Python35/scikit-learn-0.19/text-deployment-demo/simple-topic-modeling/1'}


### 4.2 Test model in Models UI

The UI doesn't support Unsupervised models with sklearn.

### 4.3 Test model with REST API call

Similarly, we can't directly use the API that's automatically generated for Unsupervised models with sklearn.

### 4.4 Create a custom scoring script

WS Local automatically support only classification and regression models for scikit-learn. Yet, deploying unsupervised models is also very easy and only requires a few steps.

**1. In the "models" section, select the model we just saved, and click on "Generate custom scoring script"**
<img style="float: left;" src="https://i.imgur.com/0LsTt5o.png" alt="Step 1 - Create custom script" width=900 />

**2. This will generate a script with the same functions that are automatically generated when using the raw API/using the UI. Note that doing this is a good way of debugging a deployment that failed, in other cases. By default, the script is set to run as a web service. To debug it, we can set it as a job in the "Run Configuration" panel.**
<img style="float: left;" src="https://i.imgur.com/xi7vF9B.png" alt="Step 2 - Modify custom script: switch to job" width=900 />

**3. To switch from classification to topic modeling, all we need to do is to identify the part where the predict() function is called, and switch it to transform() instead.**
<img style="float: left;" src="https://i.imgur.com/f1A4v5Q.png" alt="Step 3 - Modify custom script: switch to transform()" width=900 />

**4. To try out the code modification, call the test_score() function within the script and print the result. Then save the script and run it. If the console output is hidden, drag it from the right of the screen.**
<img style="float: left;" src="https://i.imgur.com/v4yQHvK.png" alt="Step 3 - Modify custom script: switch to transform()" width=900 />

**5. Once the code is running, switch the "Run configuration" back to "Web Service", save it, and select the run button again. The UI will show the steps necessary to call this newly created API to be able to score the model**
<img style="float: left;" src="https://i.imgur.com/GmmS5kf.png" alt="Step 3 - Modify custom script: switch to transform()" width=900 />