# Introduction
***

Amazon SageMaker NTM (Neural Topic Model) is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. NTM is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

In this notebook we will use the Amazon SageMaker NTM algorithm to train a model on some example synthetic data. We will then use this model to classify (perform inference on) the data. The main goals of this notebook are to,

* create an AWS SageMaker training job on a data set to produce a NTM model,
* use the model to perform inference with an Amazon SageMaker endpoint.

In [5]:
import os
import sagemaker
import boto3
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

role = get_execution_role()

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-228889150161


## Training

Once the data is preprocessed and available in a recommended format the next step is to train our model on the data. There are number of parameters required by the NTM algorithm to configure the model and define the computational environment in which training will take place. The first of these is to point to a container image which holds the algorithms training and hosting code.

In [2]:
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/ntm:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/ntm:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/ntm:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/ntm:latest'}

An NTM model uses the following hyperparameters:

- **num_topics** - The number of topics or categories in the NTM model. 
- **feature_dim** - The size of the "vocabulary". In this case, this has been set to 1000 by the nytimes pyspark data prep.

In addition to these NTM model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role.

> Note: Try adjusting the mini_batch_size if running on a GPU. 

In [4]:
num_topics=20
vocabulary_size=1000
output = 's3://{}/data/nytimes-model/sagemaker-ntm'.format(sagemaker_session.default_bucket())

ntm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c4.xlarge',
                                    output_path=output,
                                    sagemaker_session=sagemaker_session)

ntm.set_hyperparameters(num_topics=num_topics,
                        feature_dim=vocabulary_size,
                        mini_batch_size=256)

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-228889150161


We'll train against the bag-of-words extracted from the NY Times comments.

In [10]:
import boto3
s3_client = boto3.client('s3')
objects = s3_client.list_objects(Bucket=bucket, Prefix='data/nyt-record-io/training.rec')
training_key = objects['Contents'][0]['Key']
training_input = 's3://{}/{}'.format(bucket, training_key)

In [11]:
ntm.fit({'train': training_input})

INFO:sagemaker:Creating training-job with name: ntm-2018-06-01-21-13-08-814


.....................
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/01/2018 21:16:30 INFO 140428234315584] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'}[0m
[31m[06/01/2018 21:16:30 INFO 140428234315584] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'feature_dim': u'1000', u'mini_batch_size': u'256', u'num_topics': u'20'}[0m
[31m[06/01/2018 21:16:30 INFO 140428234315584] Final configuration: {u'num_patience_epoch

[31m[06/01/2018 21:17:44 INFO 140428234315584] # Finished training epoch 4 on 246918 examples from 965 batches, each of size 256.[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] Metrics for Training:[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] Loss (name: value) total: 6.43195738619[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] Loss (name: value) kld: 0.0505995121278[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] Loss (name: value) recons: 6.38135787267[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] Loss (name: value) logppx: 6.43195738619[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] #quality_metric: host=algo-1, epoch=4, train total_loss <loss>=6.43195738619[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] patience losses:[6.4783105405501136, 6.4561878028928925, 6.4456030966704372] min patience loss:6.44560309667 current loss:6.43195738619 absolute loss difference:0.0136457104757[0m
[31m[06/01/2018 21:17:44 INFO 140428234315584] #progress_metri

[31m[06/01/2018 21:18:58 INFO 140428234315584] # Finished training epoch 8 on 246918 examples from 965 batches, each of size 256.[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] Metrics for Training:[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] Loss (name: value) total: 6.40672978515[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] Loss (name: value) kld: 0.0810080907004[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] Loss (name: value) recons: 6.32572168563[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] Loss (name: value) logppx: 6.40672978515[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] #quality_metric: host=algo-1, epoch=8, train total_loss <loss>=6.40672978515[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] patience losses:[6.4251035154174643, 6.4178449648031917, 6.410954800294471] min patience loss:6.41095480029 current loss:6.40672978515 absolute loss difference:0.00422501514613[0m
[31m[06/01/2018 21:18:58 INFO 140428234315584] #progress_metri

[31m[06/01/2018 21:20:11 INFO 140428234315584] # Finished training epoch 12 on 246918 examples from 965 batches, each of size 256.[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] Metrics for Training:[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] Loss (name: value) total: 6.38788782797[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] Loss (name: value) kld: 0.106702757824[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] Loss (name: value) recons: 6.28118507109[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] Loss (name: value) logppx: 6.38788782797[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] #quality_metric: host=algo-1, epoch=12, train total_loss <loss>=6.38788782797[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] patience losses:[6.4030098003426978, 6.398642103412608, 6.3932190287298489] min patience loss:6.39321902873 current loss:6.38788782797 absolute loss difference:0.00533120076273[0m
[31m[06/01/2018 21:20:11 INFO 140428234315584] #progress_metr

[31m[06/01/2018 21:21:25 INFO 140428234315584] # Finished training epoch 16 on 246918 examples from 965 batches, each of size 256.[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] Metrics for Training:[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] Loss (name: value) total: 6.37562277626[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] Loss (name: value) kld: 0.126761106521[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] Loss (name: value) recons: 6.24886167136[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] Loss (name: value) logppx: 6.37562277626[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] #quality_metric: host=algo-1, epoch=16, train total_loss <loss>=6.37562277626[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] patience losses:[6.3854912844346599, 6.3830994314480325, 6.3797528617740298] min patience loss:6.37975286177 current loss:6.37562277626 absolute loss difference:0.00413008551523[0m
[31m[06/01/2018 21:21:25 INFO 140428234315584] #progress_met

[31m[06/01/2018 21:22:39 INFO 140428234315584] # Finished training epoch 20 on 246918 examples from 965 batches, each of size 256.[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] Metrics for Training:[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] Loss (name: value) total: 6.36674192458[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] Loss (name: value) kld: 0.141850165776[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] Loss (name: value) recons: 6.22489174882[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] Loss (name: value) logppx: 6.36674192458[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] #quality_metric: host=algo-1, epoch=20, train total_loss <loss>=6.36674192458[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] patience losses:[6.3726115681346833, 6.3700844641176531, 6.3691815168746393] min patience loss:6.36918151687 current loss:6.36674192458 absolute loss difference:0.00243959229227[0m
[31m[06/01/2018 21:22:39 INFO 140428234315584] #progress_met

[31m[06/01/2018 21:23:52 INFO 140428234315584] # Finished training epoch 24 on 246918 examples from 965 batches, each of size 256.[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] Metrics for Training:[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] Loss (name: value) total: 6.36119400454[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] Loss (name: value) kld: 0.152351276129[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] Loss (name: value) recons: 6.20884273139[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] Loss (name: value) logppx: 6.36119400454[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] #quality_metric: host=algo-1, epoch=24, train total_loss <loss>=6.36119400454[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] patience losses:[6.3637985992925774, 6.3620918338162911, 6.3610591923016955] min patience loss:6.3610591923 current loss:6.36119400454 absolute loss difference:0.000134812241392[0m
[31m[06/01/2018 21:23:52 INFO 140428234315584] Bad epoch: lo

===== Job Complete =====
Billable seconds: 593


## Inference

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the topic mixture representing a given document or comment.

This is simplified by the deploy function provided by the Amazon SageMaker Python SDK.

In [5]:
ntm_predictor = ntm.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: ntm-2018-05-23-14-49-44-899
INFO:sagemaker:Creating endpoint with name ntm-2018-05-23-14-34-25-654


--------------------------------------------------------------------------!

## Perform Inference

With this real-time endpoint at our fingertips we can finally perform inference on our training and test data.  We should first discuss the meaning of the SageMaker NTM inference output.

For each document we wish to compute its corresponding `topic_weights`. Each set of topic weights is a probability distribution over the number of topics, which is 5 in this example. Of the 5 topics discovered during NTM training each element of the topic weights is the proportion to which the input document is represented by the corresponding topic.

For example, if the topic weights of an input document $\mathbf{w}$ is,

$$\theta = \left[ 0.3, 0.2, 0, 0.5, 0 \right]$$

then $\mathbf{w}$ is 30% generated from Topic #1, 20% from Topic #2, and 50% from Topic #4. Below, we compute the topic mixtures for the first ten traning documents.

First, we setup our serializes and deserializers which allow us to convert NumPy arrays to CSV strings which we can pass into our HTTP POST request to our hosted endpoint.

In [6]:
from sagemaker.predictor import csv_serializer, json_deserializer
ntm_predictor.content_type = 'text/csv'
ntm_predictor.serializer = csv_serializer
ntm_predictor.deserializer = json_deserializer

In [7]:
import numpy as np
test_comments = np.load('test_comments.npy')

In [8]:
results = ntm_predictor.predict(test_comments)
predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
print(predictions[5])

[0.03881462 0.0330111  0.02953554 0.44098815 0.03035911 0.0249091
 0.01516498 0.02849843 0.02895007 0.02496313 0.02657804 0.0277223
 0.02970661 0.0321646  0.03495577 0.03026009 0.0334294  0.02988204
 0.03124016 0.02886663]


In [9]:
np.argmax(predictions, axis=1)

array([3, 3, 0, 3, 3, 3, 3, 3, 3, 3])

In [10]:
sagemaker.Session().delete_endpoint(ntm_predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: ntm-2018-05-22-22-03-44-121
