# Introduction
***

Amazon SageMaker NTM (Neural Topic Model) is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. NTM is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

In this notebook we will use the Amazon SageMaker NTM algorithm to train a model on some example synthetic data. We will then use this model to classify (perform inference on) the data. The main goals of this notebook are to,

* create an AWS SageMaker training job on a data set to produce a NTM model,
* use the model to perform inference with an Amazon SageMaker endpoint.

In [1]:
import os
import sagemaker
import boto3
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

role = get_execution_role()

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-228889150161


## Training

Once the data is preprocessed and available in a recommended format the next step is to train our model on the data. There are number of parameters required by the NTM algorithm to configure the model and define the computational environment in which training will take place. The first of these is to point to a container image which holds the algorithms training and hosting code.

In [2]:
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/ntm:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/ntm:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/ntm:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/ntm:latest'}

An NTM model uses the following hyperparameters:

- **num_topics** - The number of topics or categories in the NTM model. 
- **feature_dim** - The size of the "vocabulary". In this case, this has been set to 1000 by the nytimes pyspark data prep.

In addition to these NTM model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role.

> Note: Try adjusting the mini_batch_size if running on a GPU. 

In [3]:
num_topics=20
vocabulary_size=5000
output = 's3://{}/data/nytimes-model/sagemaker-ntm'.format(sagemaker_session.default_bucket())

ntm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.p3.2xlarge',
                                    output_path=output,
                                    sagemaker_session=sagemaker_session)

ntm.set_hyperparameters(num_topics=num_topics,
                        feature_dim=vocabulary_size,
                        mini_batch_size=1024)

We'll train against the bag-of-words extracted from the NY Times comments.

In [4]:
import boto3
s3_client = boto3.client('s3')
objects = s3_client.list_objects(Bucket=bucket, Prefix='data/nyt-record-io/training.rec')
training_key = objects['Contents'][0]['Key']
training_input = 's3://{}/{}'.format(bucket, training_key)

In [5]:
ntm.fit({'train': training_input})

INFO:sagemaker:Creating training-job with name: ntm-2018-06-03-17-15-41-948


......................
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/03/2018 17:19:11 INFO 139893340792640] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'}[0m
[31m[06/03/2018 17:19:11 INFO 139893340792640] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'feature_dim': u'5000', u'mini_batch_size': u'1024', u'num_topics': u'20'}[0m
[31m[06/03/2018 17:19:11 INFO 139893340792640] Final configuration: {u'num_patience_epo

[31m[06/03/2018 17:19:43 INFO 139893340792640] # Finished training epoch 5 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] Loss (name: value) total: 7.67566204317[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] Loss (name: value) kld: 0.0193515777624[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] Loss (name: value) recons: 7.65631049256[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] Loss (name: value) logppx: 7.67566204317[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] #quality_metric: host=algo-1, epoch=5, train total_loss <loss>=7.67566204317[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] patience losses:[7.700584696097807, 7.6894773175893736, 7.6823241252544499] min patience loss:7.68232412525 current loss:7.67566204317 absolute loss difference:0.006662082081[0m
[31m[06/03/2018 17:19:43 INFO 139893340792640] #progress_metric

[31m[06/03/2018 17:20:05 INFO 139893340792640] # Finished training epoch 10 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] Loss (name: value) total: 7.66044490175[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] Loss (name: value) kld: 0.0243345500366[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] Loss (name: value) recons: 7.63611034347[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] Loss (name: value) logppx: 7.66044490175[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] #quality_metric: host=algo-1, epoch=10, train total_loss <loss>=7.66044490175[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] patience losses:[7.6688663466902804, 7.6654402012667378, 7.6629434779655838] min patience loss:7.66294347797 current loss:7.66044490175 absolute loss difference:0.00249857621745[0m
[31m[06/03/2018 17:20:05 INFO 139893340792640] #progress_m

[31m[06/03/2018 17:20:36 INFO 139893340792640] # Finished training epoch 17 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] Loss (name: value) total: 7.63342382248[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] Loss (name: value) kld: 0.0447227580007[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] Loss (name: value) recons: 7.58870105335[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] Loss (name: value) logppx: 7.63342382248[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] #quality_metric: host=algo-1, epoch=17, train total_loss <loss>=7.63342382248[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] patience losses:[7.6516125813003413, 7.6491427731908059, 7.6419481484850573] min patience loss:7.64194814849 current loss:7.63342382248 absolute loss difference:0.00852432600723[0m
[31m[06/03/2018 17:20:36 INFO 139893340792640] #progress_m

[31m[06/03/2018 17:21:02 INFO 139893340792640] # Finished training epoch 23 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] Loss (name: value) total: 7.60737993082[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] Loss (name: value) kld: 0.0674365092885[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] Loss (name: value) recons: 7.53994339384[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] Loss (name: value) logppx: 7.60737993082[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] #quality_metric: host=algo-1, epoch=23, train total_loss <loss>=7.60737993082[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] patience losses:[7.6205358758938218, 7.6163235232849749, 7.6129691521983505] min patience loss:7.6129691522 current loss:7.60737993082 absolute loss difference:0.00558922138096[0m
[31m[06/03/2018 17:21:02 INFO 139893340792640] #progress_me

[31m[06/03/2018 17:21:24 INFO 139893340792640] # Finished training epoch 28 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] Loss (name: value) total: 7.58106571387[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] Loss (name: value) kld: 0.0874590917783[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] Loss (name: value) recons: 7.49360664965[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] Loss (name: value) logppx: 7.58106571387[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] #quality_metric: host=algo-1, epoch=28, train total_loss <loss>=7.58106571387[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] patience losses:[7.5945136697331739, 7.5889624108460323, 7.5849436028929782] min patience loss:7.58494360289 current loss:7.58106571387 absolute loss difference:0.0038778890263[0m
[31m[06/03/2018 17:21:24 INFO 139893340792640] #progress_me

[31m[06/03/2018 17:21:46 INFO 139893340792640] # Finished training epoch 33 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] Loss (name: value) total: 7.5643788682[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] Loss (name: value) kld: 0.103331442568[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] Loss (name: value) recons: 7.461047392[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] Loss (name: value) logppx: 7.5643788682[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] #quality_metric: host=algo-1, epoch=33, train total_loss <loss>=7.5643788682[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] patience losses:[7.5749946773544812, 7.5722230160531918, 7.5680219831545488] min patience loss:7.56802198315 current loss:7.5643788682 absolute loss difference:0.00364311495103[0m
[31m[06/03/2018 17:21:46 INFO 139893340792640] #progress_metric: 

[31m[06/03/2018 17:22:09 INFO 139893340792640] # Finished training epoch 38 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] Loss (name: value) total: 7.55659563187[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] Loss (name: value) kld: 0.111448525143[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] Loss (name: value) recons: 7.4451471045[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] Loss (name: value) logppx: 7.55659563187[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] #quality_metric: host=algo-1, epoch=38, train total_loss <loss>=7.55659563187[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] patience losses:[7.5608535918815081, 7.5597764554102556, 7.5585755109786987] min patience loss:7.55857551098 current loss:7.55659563187 absolute loss difference:0.0019798791113[0m
[31m[06/03/2018 17:22:09 INFO 139893340792640] #progress_metr

[31m[06/03/2018 17:22:35 INFO 139893340792640] # Finished training epoch 44 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] Loss (name: value) total: 7.55175155672[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] Loss (name: value) kld: 0.116382418247[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] Loss (name: value) recons: 7.43536913567[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] Loss (name: value) logppx: 7.55175155672[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] #quality_metric: host=algo-1, epoch=44, train total_loss <loss>=7.55175155672[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] patience losses:[7.5537119751626793, 7.5532537127329302, 7.5531142041210302] min patience loss:7.55311420412 current loss:7.55175155672 absolute loss difference:0.00136264739943[0m
[31m[06/03/2018 17:22:35 INFO 139893340792640] #progress_me

[31m[06/03/2018 17:22:57 INFO 139893340792640] # Finished training epoch 49 on 246918 examples from 242 batches, each of size 1024.[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] Metrics for Training:[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] Loss (name: value) total: 7.54704588233[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] Loss (name: value) kld: 0.122145494954[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] Loss (name: value) recons: 7.42490036773[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] Loss (name: value) logppx: 7.54704588233[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] #quality_metric: host=algo-1, epoch=49, train total_loss <loss>=7.54704588233[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] patience losses:[7.5505730593500058, 7.5500576570014326, 7.5481454156154442] min patience loss:7.54814541562 current loss:7.54704588233 absolute loss difference:0.00109953328598[0m
[31m[06/03/2018 17:22:57 INFO 139893340792640] #progress_me

## Inference

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the topic mixture representing a given document or comment.

This is simplified by the deploy function provided by the Amazon SageMaker Python SDK.

In [None]:
ntm_predictor = ntm.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge',
                          endpoint_name='ntm-nyt')

INFO:sagemaker:Creating model with name: ntm-2018-06-03-17-25-37-392
INFO:sagemaker:Creating endpoint with name ntm-nyt


--------------------------------

## Model Exploration 

This next section is based on ["An Introduction to SageMaker Neural Topic Model"](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/ntm_20newsgroups_topic_modeling/ntm_20newsgroups_topic_model.ipynb).  While this section is not required for model deployment, it does offer some explaination of the topic. 

In [None]:
!pip install mxnet 
import mxnet as mx

In [None]:
model_path = os.path.join('data/nytimes-model/sagemaker-ntm', ntm._current_job_name, 'output/model.tar.gz')
boto3.resource('s3').Bucket(bucket).download_file(model_path, 'downloaded_model.tar.gz')
!tar -xzvf 'downloaded_model.tar.gz'
!unzip -o model_algo-1

In [None]:
model = mx.ndarray.load('params')
W = model['arg:projection_weight']

In [None]:
!pip install wordcloud
import wordcloud as wc

In [None]:
import boto3
import json

s3 = boto3.resource('s3')
obj = s3.Object('sagemaker-us-east-1-228889150161','data/nyt-features/vocab.json')
obj.download_file('vocab.json')

def load_vocab():
    with open('vocab.json', 'r') as json_file:
        return json.load(json_file)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

word_to_id = load_vocab()

limit = 24
n_col = 4
counter = 0

plt.figure(figsize=(20,16))
for ind in range(num_topics):

    if counter >= limit:
        break

    title_str = 'Topic{}'.format(ind)

    #pvals = mx.nd.softmax(W[:, ind]).asnumpy()
    pvals = mx.nd.softmax(mx.nd.array(W[:, ind])).asnumpy()

    word_freq = dict()
    for k in word_to_id.keys():
        i = word_to_id[k]
        word_freq[k] =pvals[i]

    wordcloud = wc.WordCloud(background_color='white').fit_words(word_freq)

    plt.subplot(limit // n_col, n_col, counter+1)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(title_str)
    #plt.close()

    counter +=1
