# Introduction
***

Amazon SageMaker NTM (Neural Topic Model) is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. NTM is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

In this notebook we will use the Amazon SageMaker NTM algorithm to train a model on some example synthetic data. We will then use this model to classify (perform inference on) the data. The main goals of this notebook are to,

* create an AWS SageMaker training job on a data set to produce a NTM model,
* use the model to perform inference with an Amazon SageMaker endpoint.

In [1]:
import os
import sagemaker
import boto3
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
sagemaker_session._default_bucket = 'scw-use1-cors-test-3'

role = get_execution_role()

## Training

Once the data is preprocessed and available in a recommended format the next step is to train our model on the data. There are number of parameters required by the NTM algorithm to configure the model and define the computational environment in which training will take place. The first of these is to point to a container image which holds the algorithms training and hosting code.

In [2]:
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/ntm:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/ntm:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/ntm:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/ntm:latest'}

An NTM model uses the following hyperparameters:

- **num_topics** - The number of topics or categories in the NTM model. 
- **feature_dim** - The size of the "vocabulary". In this case, this has been set to 1000 by the nytimes pyspark data prep.

In addition to these NTM model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role.

> Note: Try adjusting the mini_batch_size if running on a GPU. 

In [3]:
num_topics=20
vocabulary_size=1000

ntm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.p3.2xlarge',
                                    output_path='s3://{}/data/nytimes-model/sagemaker-ntm'.format('scw-use1-cors-test-3'),
                                    sagemaker_session=sagemaker_session)

ntm.set_hyperparameters(num_topics=num_topics,
                        feature_dim=vocabulary_size,
                        mini_batch_size=1024)

We'll train against the bag-of-words extracted from the NY Times comments.

In [4]:
ntm.fit({'train': 's3://{}/data/nytimes-model/recordio/ntm.data'.format('scw-use1-cors-test-3')})

INFO:sagemaker:Creating training-job with name: ntm-2018-05-23-14-34-25-654


.......................
[31mDocker entrypoint called with argument(s): train[0m
[31m[05/23/2018 14:38:07 INFO 140609680480064] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'}[0m
[31m[05/23/2018 14:38:07 INFO 140609680480064] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'feature_dim': u'1000', u'mini_batch_size': u'1024', u'num_topics': u'20'}[0m
[31m[05/23/2018 14:38:07 INFO 140609680480064] Final configuration: {u'num_patience_ep

[31m[05/23/2018 14:38:34 INFO 140609680480064] # Finished training epoch 4 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] Loss (name: value) total: 6.29366907793[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] Loss (name: value) kld: 0.036844597377[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] Loss (name: value) recons: 6.25682444853[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] Loss (name: value) logppx: 6.29366907793[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] #quality_metric: host=algo-1, epoch=4, train total_loss <loss>=6.29366907793[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] patience losses:[6.3625331355076211, 6.3203200246773514, 6.3013154478634101] min patience loss:6.30131544786 current loss:6.29366907793 absolute loss difference:0.00764636993408[0m
[31m[05/23/2018 14:38:34 INFO 140609680480064] #progress_metr

[31m[05/23/2018 14:38:50 INFO 140609680480064] # Finished training epoch 8 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] Loss (name: value) total: 6.27092050852[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] Loss (name: value) kld: 0.0512845783841[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] Loss (name: value) recons: 6.21963592043[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] Loss (name: value) logppx: 6.27092050852[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] #quality_metric: host=algo-1, epoch=8, train total_loss <loss>=6.27092050852[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] patience losses:[6.2903975673750336, 6.2868399975346581, 6.2806985051024196] min patience loss:6.2806985051 current loss:6.27092050852 absolute loss difference:0.00977799658682[0m
[31m[05/23/2018 14:38:50 INFO 140609680480064] #progress_metr

[31m[05/23/2018 14:39:15 INFO 140609680480064] # Finished training epoch 14 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] Loss (name: value) total: 6.23776557773[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] Loss (name: value) kld: 0.08228455817[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] Loss (name: value) recons: 6.15548101687[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] Loss (name: value) logppx: 6.23776557773[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] #quality_metric: host=algo-1, epoch=14, train total_loss <loss>=6.23776557773[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] patience losses:[6.2451230666216686, 6.2423242868161672, 6.2399860531676046] min patience loss:6.23998605317 current loss:6.23776557773 absolute loss difference:0.00222047543993[0m
[31m[05/23/2018 14:39:15 INFO 140609680480064] #progress_met

[31m[05/23/2018 14:39:31 INFO 140609680480064] # Finished training epoch 18 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] Loss (name: value) total: 6.22858788733[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] Loss (name: value) kld: 0.094045179907[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] Loss (name: value) recons: 6.13454268399[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] Loss (name: value) logppx: 6.22858788733[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] #quality_metric: host=algo-1, epoch=18, train total_loss <loss>=6.22858788733[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] patience losses:[6.2367464720034134, 6.2347331047058105, 6.233091971453498] min patience loss:6.23309197145 current loss:6.22858788733 absolute loss difference:0.00450408411961[0m
[31m[05/23/2018 14:39:31 INFO 140609680480064] #progress_met

[31m[05/23/2018 14:39:52 INFO 140609680480064] # Finished training epoch 23 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] Loss (name: value) total: 6.21752161138[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] Loss (name: value) kld: 0.110444693881[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] Loss (name: value) recons: 6.10707690669[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] Loss (name: value) logppx: 6.21752161138[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] #quality_metric: host=algo-1, epoch=23, train total_loss <loss>=6.21752161138[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] patience losses:[6.2234382087109132, 6.2209809284584194, 6.2184555240705901] min patience loss:6.21845552407 current loss:6.21752161138 absolute loss difference:0.000933912688611[0m
[31m[05/23/2018 14:39:52 INFO 140609680480064] Bad epoch: 

[31m[05/23/2018 14:40:16 INFO 140609680480064] # Finished training epoch 29 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] Loss (name: value) total: 6.21296051624[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] Loss (name: value) kld: 0.116266710706[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] Loss (name: value) recons: 6.09669382993[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] Loss (name: value) logppx: 6.21296051624[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] #quality_metric: host=algo-1, epoch=29, train total_loss <loss>=6.21296051624[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] patience losses:[6.2144057367362224, 6.2144779728908164, 6.2142862301246788] min patience loss:6.21428623012 current loss:6.21296051624 absolute loss difference:0.00132571388693[0m
[31m[05/23/2018 14:40:16 INFO 140609680480064] #progress_me

[31m[05/23/2018 14:40:33 INFO 140609680480064] # Finished training epoch 33 on 260977 examples from 255 batches, each of size 1024.[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] Metrics for Training:[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] Loss (name: value) total: 6.21146457523[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] Loss (name: value) kld: 0.117858771132[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] Loss (name: value) recons: 6.09360578574[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] Loss (name: value) logppx: 6.21146457523[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] #quality_metric: host=algo-1, epoch=33, train total_loss <loss>=6.21146457523[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] patience losses:[6.2127918841792091, 6.2124407581254548, 6.2124714271695005] min patience loss:6.21244075813 current loss:6.21146457523 absolute loss difference:0.000976182900223[0m
[31m[05/23/2018 14:40:33 INFO 140609680480064] Bad epoch: 

## Inference

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the topic mixture representing a given document or comment.

This is simplified by the deploy function provided by the Amazon SageMaker Python SDK.

In [5]:
ntm_predictor = ntm.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: ntm-2018-05-23-14-49-44-899
INFO:sagemaker:Creating endpoint with name ntm-2018-05-23-14-34-25-654


--------------------------------------------------------------------------!

## Perform Inference

With this real-time endpoint at our fingertips we can finally perform inference on our training and test data.  We should first discuss the meaning of the SageMaker NTM inference output.

For each document we wish to compute its corresponding `topic_weights`. Each set of topic weights is a probability distribution over the number of topics, which is 5 in this example. Of the 5 topics discovered during NTM training each element of the topic weights is the proportion to which the input document is represented by the corresponding topic.

For example, if the topic weights of an input document $\mathbf{w}$ is,

$$\theta = \left[ 0.3, 0.2, 0, 0.5, 0 \right]$$

then $\mathbf{w}$ is 30% generated from Topic #1, 20% from Topic #2, and 50% from Topic #4. Below, we compute the topic mixtures for the first ten traning documents.

First, we setup our serializes and deserializers which allow us to convert NumPy arrays to CSV strings which we can pass into our HTTP POST request to our hosted endpoint.

In [6]:
from sagemaker.predictor import csv_serializer, json_deserializer
ntm_predictor.content_type = 'text/csv'
ntm_predictor.serializer = csv_serializer
ntm_predictor.deserializer = json_deserializer

In [7]:
import numpy as np
test_comments = np.load('test_comments.npy')

In [8]:
results = ntm_predictor.predict(test_comments)
predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
print(predictions[5])

[0.03881462 0.0330111  0.02953554 0.44098815 0.03035911 0.0249091
 0.01516498 0.02849843 0.02895007 0.02496313 0.02657804 0.0277223
 0.02970661 0.0321646  0.03495577 0.03026009 0.0334294  0.02988204
 0.03124016 0.02886663]


In [9]:
np.argmax(predictions, axis=1)

array([3, 3, 0, 3, 3, 3, 3, 3, 3, 3])

In [10]:
sagemaker.Session().delete_endpoint(ntm_predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: ntm-2018-05-22-22-03-44-121
