---
## Topic Modeling Pipeline with AWS Comprehend

This Jupyter notebook implements a pipeline for topic modeling using AWS Comprehend. The pipeline takes as input a file containing interview questions and responses, and generates topics and corresponding keywords using the Comprehend topic modeling API.

The pipeline consists of the following steps:

1. __Data Preprocessing__: The input file is read and preprocessed to remove any unwanted characters or formatting. This is done using Python's pandas and re libraries.

2. __Calling AWS Comprehend Topic Modeling API__: The preprocessed data is then passed to the Comprehend topic modeling API using the AWS SDK for Python (boto3). The API generates topics and corresponding keywords based on the input data.

3. __Extracting Results__: The generated results are then extracted and stored in a pandas dataframe for further sentiment analysis.

This pipeline can be customized to fit different types of input data and analysis requirements.

---

### Set up environment:

In [1]:
#!pip install sagemaker ipywidgets --upgrade --quiet

In [17]:
import time
import boto3
from botocore.exceptions import ClientError
import requests

In [18]:
comprehend = boto3.client('comprehend', region_name='us-east-1')

In [19]:
import pandas as pd
import os

### Read input file with interview questions and responses

In [20]:
BUCKET='cnatest' # Or whatever you called your bucket
data_key = 'transcript_with_mapped_questions_and_answers.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(BUCKET, data_key)
df = pd.read_csv(data_location)

In [21]:
# Transformed review, input for Comprehend
LOCAL_TRANSFORMED_INTERVIEW = os.path.join('data', 'Transformed.txt')
S3_OUT = 's3://' + BUCKET + '/out/' + 'Transformed.txt'

# Final dataframe where topics and sentiments are going to be joined
S3_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'

df.head()

Unnamed: 0,responce_to_question,text
0,As a national savings for the healthcare syste...,if this is running correctly? And we get buy i...
1,Fees will be paid to GPS based on the health r...,start first. What I used to hear a lot of is t...
2,From your point of view What is Healthier SG,I think healthier S. G. Is a whole rethinking ...
3,If there's one thing that you could you know c...,for me I think information to unify all the in...
4,Should the typical patient be concerned about ...,over time. We will see a general increase in c...


### Cleaning responses

In [22]:
def clean_text(df):
    """Preprocessing review text.
    The text becomes Comprehend compatible as a result.
    This is the most important preprocessing step.
    """
    # Encode and decode reviews
    df['text'] = df['text'].str.encode("utf-8", "ignore")
    df['text'] = df['text'].str.decode('ascii')
    
    df['text'] = df['text'].str.replace('[PII]','', regex=True)
    # Replacing characters with whitespace
    df['text'] = df['text'].replace(r'\r+|\n+|\t+|\u2028',' ', regex=True)

    # Replacing punctuations
    df['text'] = df['text'].str.replace('[^\w\s]','', regex=True)

    # Lowercasing reviews
    df['text'] = df['text'].str.lower()
    return df
df = clean_text(df)
df.head(10)

Unnamed: 0,responce_to_question,text
0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...
1,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...
2,From your point of view What is Healthier SG,think healthier s g s a whole rethinking of h...
3,If there's one thing that you could you know c...,for me think information to unify all the inf...
4,Should the typical patient be concerned about ...,over time we will see a general increase in co...
5,The idea of having family doctors is something...,dont think were only doing it now think weve...
6,Will this also add on to increasing burdens fo...,be honest and say that it is going to be certa...
7,when you look at our health care system what w...,would actually just extend a little bit from...


In [23]:
def prepare_input_data(df):
    """Encoding and getting reviews in byte size.
    Review gets encoded to utf-8 format and getting the size of the reviews in bytes. 
    Comprehend requires each review input to be no more than 5000 Bytes
    """
    df['textsize'] = df['text'].apply(lambda x:len(x.encode('utf-8')))
    df = df[(df['textsize'] > 0) & (df['textsize'] < 5000)]
    df = df.drop(columns=['textsize'])
    return df
df = prepare_input_data(df)

In [24]:
# We first save the input file locally
with open(LOCAL_TRANSFORMED_INTERVIEW, "w") as outfile:
    outfile.write("\n".join(df['text'].tolist()))
    

### Run an Amazon Comprehend topic modeling job

In [25]:
# Client and session information
session = boto3.Session()
s3 = boto3.resource('s3')

# Account id. Required downstream.
account_id = boto3.client('sts').get_caller_identity().get('Account')

# Initializing Comprehend client
comprehend = boto3.client(service_name='comprehend', 
                          region_name=session.region_name)


In [27]:
# Number of topics set to 5 after having a human-in-the-loop
# This needs to be fully aligned with topicMaps dictionary in the third script 
NUMBER_OF_TOPICS = 5

# Input file format of one review per line
input_doc_format = "ONE_DOC_PER_LINE"

# Role arn (Hard coded, masked)
data_access_role_arn = ""



In [28]:
# Constants for S3 bucket and input data file
BUCKET='cnatest' 
input_s3_url = 's3://' + BUCKET + '/out/' + 'Transformed.txt'
output_s3_url = 's3://' + BUCKET + '/out/' + 'output/'

# Final dataframe where we will join Comprehend outputs later
S3_FEEDBACK_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'

# Local copy of Comprehend output
LOCAL_COMPREHEND_OUTPUT_DIR = os.path.join('comprehend_out', '')
LOCAL_COMPREHEND_OUTPUT_FILE = os.path.join(LOCAL_COMPREHEND_OUTPUT_DIR, 'output.tar.gz')

INPUT_CONFIG={
    # The S3 URI where Comprehend input is placed.
    'S3Uri':    input_s3_url,
    # Document format
    'InputFormat': input_doc_format,
}
OUTPUT_CONFIG={
    # The S3 URI where Comprehend output is placed.
    'S3Uri':    output_s3_url,
}

In [30]:
import json
# Start Comprehend topic modelling job.
# Specifies the number of topics, input and output config and IAM role ARN 
# that grants Amazon Comprehend read access to data.
start_topics_detection_job_result = comprehend.start_topics_detection_job(
                                                    NumberOfTopics=NUMBER_OF_TOPICS,
                                                    InputDataConfig=INPUT_CONFIG,
                                                    OutputDataConfig=OUTPUT_CONFIG,
DataAccessRoleArn=data_access_role_arn)

print('start_topics_detection_job_result' )

# Job ID is required downstream for extracting the Comprehend results
job_id = start_topics_detection_job_result["JobId"]
#print('job_id: ', job_id)


start_topics_detection_job_result


In [None]:
# Keeping track if Comprehend has finished its job
description = comprehend.describe_topics_detection_job(JobId=job_id)

topic_detection_job_status = description['TopicsDetectionJobProperties']["JobStatus"]
print(topic_detection_job_status)
while topic_detection_job_status not in ["COMPLETED", "FAILED"]:
    time.sleep(120)
    topic_detection_job_status = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']["JobStatus"]
    print(topic_detection_job_status)

topic_detection_job_status = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']["JobStatus"]
print(topic_detection_job_status)

SUBMITTED
IN_PROGRESS


In [None]:
comprehend.describe_topics_detection_job(JobId=job_id)

### Extracting Results:

When the job is successfully complete, it returns a compressed archive containing two files: topic-terms.csv and doc-topics.csv. The first output file, topic-terms.csv, is a list of topics in the collection. For each topic, the list includes, by default, the top terms by topic according to their weight. The second file, doc-topics.csv, lists the documents associated with a topic and the proportion of the document that is concerned with the topic. 

The outputs of Amazon Comprehend are copied locally for our next steps.

In [40]:
# Bucket prefix where model artifacts are stored
prefix = f'{account_id}-TOPICS-{job_id}'

# Model artifact zipped file
artifact_file = 'output.tar.gz'

# Location on S3 where model artifacts are stored
target = f's3://{BUCKET}/out/output/{prefix}/output/{artifact_file}'





In [None]:
# Copy Comprehend output from S3 to local notebook instance
! aws s3 cp {target}  ./comprehend-out/

### Extract output file and read them with pandas

In [44]:
import tarfile

In [47]:
# Unzip the Comprehend output file. 
# Two files are now saved locally- 
#       (1) comprehend-out/doc-topics.csv and 
#       (2) comprehend-out/topic-terms.csv

comprehend_tars = tarfile.open('comprehend-out/output.tar.gz')
comprehend_tars.extractall('comprehend-out')
comprehend_tars.close()

In [48]:
df_topics = pd.read_csv('comprehend-out/topic-terms.csv')

In [49]:
df_docs_by_topics = pd.read_csv('comprehend-out/doc-topics.csv')

In [50]:
df_topics.head(3)

Unnamed: 0,topic,term,weight
0,0,direction,0.071109
1,0,cost,0.057836
2,0,good,0.031432


In [60]:
df_docs_by_topics.head(20)

Unnamed: 0,docname,topic,proportion
0,Transformed.txt:0,4,0.34043
1,Transformed.txt:0,0,0.222851
2,Transformed.txt:0,1,0.220077
3,Transformed.txt:0,3,0.142298
4,Transformed.txt:0,2,0.074345
5,Transformed.txt:3,2,1.0
6,Transformed.txt:2,1,1.0
7,Transformed.txt:1,3,1.0
8,Transformed.txt:5,2,0.405211
9,Transformed.txt:5,4,0.208956


### Overall we have 5 topics. Lets look closer to terms:

In [53]:
for i in range(5):
    df_tmp = df_topics[df_topics.topic==i]
    print(df_tmp.head(20))

   topic        term    weight
0      0   direction  0.071109
1      0        cost  0.057836
2      0        good  0.031432
3      0  scientific  0.021970
4      0        work  0.021009
5      0    diabetes  0.021630
6      0     primary  0.021193
7      0    hospital  0.020882
8      0         key  0.020837
9      0  prevention  0.020256
    topic        term    weight
10      1       focus  0.046187
11      1     illness  0.046187
12      1      doctor  0.043520
13      1     chronic  0.044497
14      1     healthy  0.042997
15      1        back  0.033492
16      1     disease  0.032519
17      1  healthcare  0.032567
18      1      health  0.044762
19      1      school  0.021253
    topic     term    weight
20      2   system  0.064898
21      2  vaccine  0.046083
22      2    covid  0.033421
23      2    unify  0.033421
24      2     weve  0.040784
25      2     good  0.030318
26      2     huge  0.021204
27      2      ago  0.021204
28      2  blanket  0.021204
29      2   safet

### Topic names for 5 topics created by human-in-the-loop or SME feed

In [56]:
topicMaps = {
    0: 'Cost',
    1: 'Managment of Chronic Conditions',
    2: 'System Operation',
    3: 'Patient Experience',
    4: 'Changes in Healthcare',
}

Proceed to the notebook _"04_topic_mapping_sentiment.ipynb"_.