# Amazon Comprehend Topic Modeling 

You can use Amazon Comprehend to examine the content of a collection of documents to determine common themes. For example, you can give Amazon Comprehend a collection of news articles, and it will determine the subjects, such as sports, politics, or entertainment. The text in the documents doesn't need to be annotated.

Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.

A word is associated to a topic in a document based on how prevalent that topic is in a document and how much affinity the topic has to the word. The same word can be associated with different topics in different documents based on the topic distribution in a particular document.

For example, the word "glucose" in an article that talks predominantly about sports can be assigned to the topic "sports," while the same word in an article about "medicine" will be assigned to the topic "medicine."

Each word associated with a topic is given a weight that indicates how much the word helps define the topic. The weight is an indication of how many times the word occurs in the topic compared to other words in the topic, across the entire document set.

For the most accurate results you should provide Amazon Comprehend with the largest possible corpus to work with. For best results:

You should use at least 1,000 documents in each topic modeling job.

Each document should be at least 3 sentences long.

If a document consists of mostly numeric data, you should remove it from the corpus.

Topic modeling is an asynchronous process. You submit your list of documents to Amazon Comprehend from an Amazon S3 bucket using the StartTopicsDetectionJob operation. The response is sent to an Amazon S3 bucket. You can configure both the input and output buckets. Get a list of the topic modeling jobs that you have submitted using the ListTopicsDetectionJobs operation and view information about a job using the DescribeTopicsDetectionJob operation. Content delivered to Amazon S3 buckets might contain customer content.


This lab includes step-by-step instructions for topic modeling using Amazon Comprehend.

## Setup

Let's start by specifying:

* AWS region.
* The IAM role arn used to give access to Comprehend API and S3 bucket.
* The S3 bucket that you want to use for training and model data.


In [8]:

import os
import boto3
import re
import json
import sagemaker
from sagemaker import get_execution_role

region = boto3.Session().region_name

role = get_execution_role()

bucket = sagemaker.Session().default_bucket()

In [9]:
prefix = "sagemaker/topic-modeling"
bucketuri="s3://"+bucket+"/"+prefix
print(bucketuri)
# customize to your bucket where you have stored the data

s3://sagemaker-us-east-1-340280328827/sagemaker/topic-modeling


## Data
Let's start by uploading the dataset the sample data s3 bucket.The  sample dataset contains Human Activity Recognition Using Smartphones Data Set.
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data. 

We would be performing topic modeling for these 6 activities 

More details about dataset: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Now lets read this into a Pandas data frame and take a look.


In [23]:
# Download the data set

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip
!apt-get install unzip -y
!unzip -o "UCI HAR Dataset.zip"

--2021-08-04 21:40:41--  https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60999314 (58M) [application/x-httpd-php]
Saving to: ‘UCI HAR Dataset.zip’


2021-08-04 21:40:45 (18.7 MB/s) - ‘UCI HAR Dataset.zip’ saved [60999314/60999314]

Reading package lists... Done
Building dependency tree       
Reading state information... Done
unzip is already the newest version (6.0-23+deb10u1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
Archive:  UCI HAR Dataset.zip
   creating: UCI HAR Dataset/
  inflating: UCI HAR Dataset/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/UCI HAR Dataset/
  inflating: __MACOSX/UCI HAR Dataset/._.DS_Store  
  inflating: UCI HAR Dataset/activity_labels.txt  
  inflating: __MACOSX/UCI HAR Dat

In [None]:
# Download the data set

!wget https://docs.aws.amazon.com/comprehend/latest/dg/samples/tutorial-reviews-data.zip
!apt-get install unzip -y
!unzip -o tutorial-reviews-data.zip



In [11]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd 

# data = pd.read_csv('./amazon-reviews.csv')   
read_file = pd.read_csv('./UCI HAR Dataset/test/X_test.txt')
data=read_file.to_csv('./Test/human_activity_data.csv', index=None)


## Asynchronous Batch Processing using StartSentimentDetectionJob

To analyze large documents and large collections of documents, use one of the Amazon Comprehend asynchronous operations. There is an asynchronous version of each of the Amazon Comprehend operations and an additional set of operations for topic modeling.

To analyze a collection of documents, you typically perform the following steps:

   * Store the documents in an Amazon S3 bucket.

   * Start one or more jobs to analyze the documents.

   * Monitor the progress of an analysis job.

   * Retrieve the results of the analysis from an S3 bucket when the job is complete.

The following sections describe using the Amazon Comprehend API to run asynchronous operations. 

We would be using the following API:

StartSentimentDetectionJob — Start a job to detect the emotional sentiment in each document in the collection. 

In [12]:
import boto3
s3 = boto3.resource('s3')


s3.Bucket(bucket).upload_file("./Test/human_activity_data.csv", "sagemaker/topic-modeling/human_activity_data.csv")
comprehend = boto3.client('comprehend')

In [13]:
import uuid
job_uuid = uuid.uuid1()
job_name = f"topicmodeling-job-{job_uuid}"
inputs3uri= bucketuri+"/HumanActivityRecognition-Test.csv"
asyncresponse = comprehend.start_topics_detection_job(
    InputDataConfig={
        'S3Uri': inputs3uri,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': bucketuri,
       
    },
    DataAccessRoleArn=role,
    JobName=job_name,
    NumberOfTopics=6
   
 
)

In [14]:
events_job_id = asyncresponse['JobId']
job = comprehend.describe_topics_detection_job(JobId=events_job_id)
print(job)

{'TopicsDetectionJobProperties': {'JobId': 'c17013807e1b5807b0c46a5335f2dc51', 'JobName': 'topicmodeling-job-6eae0d9c-f56f-11eb-ac13-d558d32189d7', 'JobStatus': 'IN_PROGRESS', 'SubmitTime': datetime.datetime(2021, 8, 4, 22, 0, 51, 260000, tzinfo=tzlocal()), 'InputDataConfig': {'S3Uri': 's3://sagemaker-us-east-1-340280328827/sagemaker/topic-modeling/HumanActivityRecognition-Test.csv', 'InputFormat': 'ONE_DOC_PER_LINE'}, 'OutputDataConfig': {'S3Uri': 's3://sagemaker-us-east-1-340280328827/sagemaker/topic-modeling/340280328827-TOPICS-c17013807e1b5807b0c46a5335f2dc51/output/output.tar.gz'}, 'NumberOfTopics': 6, 'DataAccessRoleArn': 'arn:aws:iam::340280328827:role/SagemakerFullAccessPolicy'}, 'ResponseMetadata': {'RequestId': '178eae27-9e72-45fb-a592-6eeeac5236ba', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '178eae27-9e72-45fb-a592-6eeeac5236ba', 'content-type': 'application/x-amz-json-1.1', 'content-length': '625', 'date': 'Wed, 04 Aug 2021 22:00:52 GMT'}, 'RetryAttempts': 

In [None]:
from time import sleep
# Get current job status
job = comprehend.describe_topics_detection_job(JobId=events_job_id)
print(job)
# Loop until job is completed
waited = 0
timeout_minutes = 40
while job['TopicsDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(60)
    waited += 60
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    job = comprehend.describe_topics_detection_job(JobId=events_job_id)

{'TopicsDetectionJobProperties': {'JobId': 'c17013807e1b5807b0c46a5335f2dc51', 'JobName': 'topicmodeling-job-6eae0d9c-f56f-11eb-ac13-d558d32189d7', 'JobStatus': 'IN_PROGRESS', 'SubmitTime': datetime.datetime(2021, 8, 4, 22, 0, 51, 260000, tzinfo=tzlocal()), 'InputDataConfig': {'S3Uri': 's3://sagemaker-us-east-1-340280328827/sagemaker/topic-modeling/HumanActivityRecognition-Test.csv', 'InputFormat': 'ONE_DOC_PER_LINE'}, 'OutputDataConfig': {'S3Uri': 's3://sagemaker-us-east-1-340280328827/sagemaker/topic-modeling/340280328827-TOPICS-c17013807e1b5807b0c46a5335f2dc51/output/output.tar.gz'}, 'NumberOfTopics': 6, 'DataAccessRoleArn': 'arn:aws:iam::340280328827:role/SagemakerFullAccessPolicy'}, 'ResponseMetadata': {'RequestId': 'c11fde40-9d4f-4536-8892-8254057d0c8f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c11fde40-9d4f-4536-8892-8254057d0c8f', 'content-type': 'application/x-amz-json-1.1', 'content-length': '625', 'date': 'Wed, 04 Aug 2021 22:00:56 GMT'}, 'RetryAttempts': 

The job would take roughly 6-8 minutes to complete and you can download the output from the output location you specified in the job paramters. You can open Comprehend in your console and check the job details there as well. Asynchronous method would be very useful when you have multiple documents and you want to run asynchronous batch.

