---
## A Pipeline to Understand Topics and Sentiment

We may encounter multiple individuals from diverse demographics undergoing an interview in a real-life scenario. Therefore, on top of generated topics, we can build a pipeline for sentiment analysis to find how interviewees  are reacting to questions, and also analyze aggregated information on user affinity towards a particular question.

The pipeline consists of the following steps:
1. __Data Preprocessing__: The input file is read and preprocessed to remove any unwanted characters or formatting. This is done using Python's pandas and re libraries.
2. __Name the topics__: use a human-in-the-loop with enough domain knowledge and subject matter expertise to name the topics by looking at their associated terms.
3. __Calling AWS Comprehend Detect Sentiment API__: The preprocessed data is then passed to the Comprehend detect sentiment API using the AWS SDK for Python (boto3). The API generates sentiment for each "mock-inteviewee's" each question answer text. 
4. __Aggregating topics and sentiments__: Both topics and sentiments are tightly coupled with answers. We will be aggregating topics and sentiments at question level and count the composite keys for each question. This final step helps us better understand the granularity of the answers per questions and categorizing it per topic in an aggregated manner.

This analysis is inspired by this [example](https://aws.amazon.com/blogs/machine-learning/get-better-insight-from-reviews-using-amazon-comprehend/)

---


In [1]:
import time
import boto3
from botocore.exceptions import ClientError
import requests
import pandas as pd

In [2]:
session = boto3.Session()
comprehend = boto3.client(  'comprehend')

### Step 1-2: Data Preprocessing and Name the topics

In [3]:
topicMaps = {
    0: 'Cost',
    1: 'Managment of Chronic Conditions',
    2: 'System Operation',
    3: 'Patient Experience',
    4: 'Changes in Healthcare',
}

In [4]:
topic_map_df = pd.DataFrame(topicMaps.items())
topic_map_df.columns=['topic','topic_name']
topic_map_df.head(3)

Unnamed: 0,topic,topic_name
0,0,Cost
1,1,Managment of Chronic Conditions
2,2,System Operation


In [5]:
BUCKET='cnatest'
# Final dataframe where we will join Comprehend outputs later
S3_FEEDBACK_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'
# Final output
S3_FINAL_OUTPUT = 's3://' + BUCKET + '/out/' + 'TopicsSentiments.csv'
language_code = 'en'
TOP_TOPICS=3

In [6]:
# Loading documents and topics assigned to each of them by Comprehend
DOC_TOPIC_FILE='comprehend-out/doc-topics.csv'
docTopics = pd.read_csv(DOC_TOPIC_FILE)
docTopics.head()

# Creating a field with doc number. 
# This doc number is the line number of the input file to Comprehend.
docTopics['doc'] = docTopics['docname'].str.split(':').str[1]
docTopics['doc'] = docTopics['doc'].astype(int)
docTopics.head()

Unnamed: 0,docname,topic,proportion,doc
0,Transformed.txt:0,4,0.34043,0
1,Transformed.txt:0,0,0.222851,0
2,Transformed.txt:0,1,0.220077,0
3,Transformed.txt:0,3,0.142298,0
4,Transformed.txt:0,2,0.074345,0


In [7]:
# Load topics and associated terms
DOC_TOPIC_FILE='comprehend-out/topic-terms.csv'
topicTerms = pd.read_csv(DOC_TOPIC_FILE)
# Consolidate terms for each topic
aggregatedTerms = topicTerms.groupby('topic')['term'].aggregate(lambda term: term.unique().tolist()).reset_index()
# Sneak peek
aggregatedTerms = pd.merge(aggregatedTerms, topic_map_df, on='topic',how='left')
aggregatedTerms.head()

Unnamed: 0,topic,term,topic_name
0,0,"[direction, cost, good, scientific, work, diab...",Cost
1,1,"[focus, illness, doctor, chronic, healthy, bac...",Managment of Chronic Conditions
2,2,"[system, vaccine, covid, unify, weve, good, hu...",System Operation
3,3,"[patient, case, outcome, check, bite, difficul...",Patient Experience
4,4,"[health, shift, early, thing, 80s, longevity, ...",Changes in Healthcare


### We will use the same file with responses to simulate that we have multiple interviewees
1. We read the same input file 3 times
2. Concatenate them into one dataframe
3. Join with the results from the topic modelling
4. Then for each reply we will generate sentiment 
5. Finally we aggregate the data to question level 

In [16]:
data_key = 'transcript_with_mapped_questions_and_answers.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(BUCKET, data_key)

interviewee_1 = pd.read_csv(data_location)
interviewee_1 = interviewee_1.reset_index()
interviewee_2 = pd.read_csv(data_location)
interviewee_2 = interviewee_2.reset_index()
interviewee_3 = pd.read_csv(data_location)
interviewee_3 = interviewee_3.reset_index()
interviewee_1.head(10)

Unnamed: 0,index,responce_to_question,text
0,0,As a national savings for the healthcare syste...,if this is running correctly? And we get buy i...
1,1,Fees will be paid to GPS based on the health r...,start first. What I used to hear a lot of is t...
2,2,From your point of view What is Healthier SG,I think healthier S. G. Is a whole rethinking ...
3,3,If there's one thing that you could you know c...,for me I think information to unify all the in...
4,4,Should the typical patient be concerned about ...,over time. We will see a general increase in c...
5,5,The idea of having family doctors is something...,I don't think we're only doing it now. I think...
6,6,Will this also add on to increasing burdens fo...,be honest and say that it is going to be certa...
7,7,when you look at our health care system what w...,[PII] I would actually just extend a little bi...


In [None]:
df_mock = interviewee_1.append(interviewee_2)
df_mock = df_mock.append(interviewee_3)
df_mock.shape

In [18]:
df_mock.columns=['doc', 'question', 'text']

In [19]:
def clean_text(df):
    """Preprocessing review text.
    The text becomes Comprehend compatible as a result.
    This is the most important preprocessing step.
    """
    # Encode and decode reviews
    df['text'] = df['text'].str.encode("utf-8", "ignore")
    df['text'] = df['text'].str.decode('ascii')
    
    df['text'] = df['text'].str.replace('[PII]','', regex=True)
    # Replacing characters with whitespace
    df['text'] = df['text'].replace(r'\r+|\n+|\t+|\u2028',' ', regex=True)

    # Replacing punctuations
    df['text'] = df['text'].str.replace('[^\w\s]','', regex=True)

    # Lowercasing reviews
    df['text'] = df['text'].str.lower()
    return df
df_mock = clean_text(df_mock)
def prepare_input_data(df):
    """Encoding and getting reviews in byte size.
    Review gets encoded to utf-8 format and getting the size of the reviews in bytes. 
    Comprehend requires each review input to be no more than 5000 Bytes
    """
    df['textsize'] = df['text'].apply(lambda x:len(x.encode('utf-8')))
    
    df = df[(df['textsize'] > 0) & (df['textsize'] < 5000)]
  
    df = df.drop(columns=['textsize'])
    return df
df_mock = prepare_input_data(df_mock)


In [20]:
# The index of feedbackTopics is referring to doc field of docTopics dataframe
repliesTopics = pd.merge(df_mock, 
                          docTopics, 
                          on='doc', 
                          how='left')

In [21]:
repliesTopics.head(15)

Unnamed: 0,doc,question,text,docname,topic,proportion
0,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,4.0,0.34043
1,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,0.0,0.222851
2,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,1.0,0.220077
3,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,3.0,0.142298
4,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,2.0,0.074345
5,1,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,Transformed.txt:1,3.0,1.0
6,2,From your point of view What is Healthier SG,think healthier s g s a whole rethinking of h...,Transformed.txt:2,1.0,1.0
7,3,If there's one thing that you could you know c...,for me think information to unify all the inf...,Transformed.txt:3,2.0,1.0
8,4,Should the typical patient be concerned about ...,over time we will see a general increase in co...,Transformed.txt:4,0.0,1.0
9,5,The idea of having family doctors is something...,dont think were only doing it now think weve...,Transformed.txt:5,2.0,0.405211


In [22]:
# Reviews will now have topic numbers, associated terms and topics names
repliesTopics = repliesTopics.merge(aggregatedTerms, 
                                      on='topic', 
                                      how='left')
repliesTopics.head()

Unnamed: 0,doc,question,text,docname,topic,proportion,term,topic_name
0,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,4.0,0.34043,"[health, shift, early, thing, 80s, longevity, ...",Changes in Healthcare
1,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,0.0,0.222851,"[direction, cost, good, scientific, work, diab...",Cost
2,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,1.0,0.220077,"[focus, illness, doctor, chronic, healthy, bac...",Managment of Chronic Conditions
3,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,3.0,0.142298,"[patient, case, outcome, check, bite, difficul...",Patient Experience
4,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,2.0,0.074345,"[system, vaccine, covid, unify, weve, good, hu...",System Operation


### Step 3: Generate sentiment for the replies text using detect_sentiment. 
It inspects text and returns an inference of the prevailing sentiment (POSITIVE, NEUTRAL, MIXED, or NEGATIVE).

In [23]:
def detect_sentiment(text, language_code):
    """Detects sentiment for a given text and language
    """
    comprehend_json_out = comprehend.detect_sentiment(Text=text, LanguageCode=language_code)
    return comprehend_json_out

In [24]:
# Comprehend output for sentiment in raw json 
repliesTopics['comprehend_sentiment_json_out'] = repliesTopics['text'].apply(lambda x: detect_sentiment(x, language_code))

# Extracting the exact sentiment from raw Comprehend Json
repliesTopics['sentiment'] = repliesTopics['comprehend_sentiment_json_out'].apply(lambda x: x['Sentiment'])

# Sneak peek
repliesTopics.head(2)

Unnamed: 0,doc,question,text,docname,topic,proportion,term,topic_name,comprehend_sentiment_json_out,sentiment
0,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,4.0,0.34043,"[health, shift, early, thing, 80s, longevity, ...",Changes in Healthcare,"{'Sentiment': 'MIXED', 'SentimentScore': {'Pos...",MIXED
1,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,0.0,0.222851,"[direction, cost, good, scientific, work, diab...",Cost,"{'Sentiment': 'MIXED', 'SentimentScore': {'Pos...",MIXED


In [25]:
# Creating a composite key of topic name and sentiment.
# This is because we are counting frequency of this combination.
repliesTopics['TopicSentiment'] = repliesTopics['topic_name'] + '_' + repliesTopics['sentiment']

In [26]:
repliesTopics.head(3)

Unnamed: 0,doc,question,text,docname,topic,proportion,term,topic_name,comprehend_sentiment_json_out,sentiment,TopicSentiment
0,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,4.0,0.34043,"[health, shift, early, thing, 80s, longevity, ...",Changes in Healthcare,"{'Sentiment': 'MIXED', 'SentimentScore': {'Pos...",MIXED,Changes in Healthcare_MIXED
1,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,0.0,0.222851,"[direction, cost, good, scientific, work, diab...",Cost,"{'Sentiment': 'MIXED', 'SentimentScore': {'Pos...",MIXED,Cost_MIXED
2,0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,Transformed.txt:0,1.0,0.220077,"[focus, illness, doctor, chronic, healthy, bac...",Managment of Chronic Conditions,"{'Sentiment': 'MIXED', 'SentimentScore': {'Pos...",MIXED,Managment of Chronic Conditions_MIXED


In [27]:
repliesTopics['sentiment'].value_counts()

MIXED       21
NEGATIVE    15
NEUTRAL      9
Name: sentiment, dtype: int64

In [28]:
repliesTopics['TopicSentiment'].value_counts()

Cost_MIXED                                  6
Changes in Healthcare_MIXED                 3
Managment of Chronic Conditions_MIXED       3
Patient Experience_MIXED                    3
System Operation_MIXED                      3
Patient Experience_NEUTRAL                  3
Managment of Chronic Conditions_NEUTRAL     3
System Operation_NEUTRAL                    3
System Operation_NEGATIVE                   3
Changes in Healthcare_NEGATIVE              3
Patient Experience_NEGATIVE                 3
Managment of Chronic Conditions_NEGATIVE    3
Cost_NEGATIVE                               3
Name: TopicSentiment, dtype: int64

### Step 4: Aggregating topics and sentiments
This final step helps us better understand the granularity of the replies per question 

In [33]:
# Create question id group
questions_DF = repliesTopics.groupby('question')

In [34]:
questions_DF

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd9d122ff40>

In [35]:
# Each product now has a list of topics and sentiment combo (topics can appear multiple times)
topicDF = questions_DF['TopicSentiment'].apply(lambda x:list(x)).reset_index()
topicDF.head()

Unnamed: 0,question,TopicSentiment
0,As a national savings for the healthcare syste...,"[Changes in Healthcare_MIXED, Cost_MIXED, Mana..."
1,Fees will be paid to GPS based on the health r...,"[Patient Experience_NEUTRAL, Patient Experienc..."
2,From your point of view What is Healthier SG,"[Managment of Chronic Conditions_NEUTRAL, Mana..."
3,If there's one thing that you could you know c...,"[System Operation_NEUTRAL, System Operation_NE..."
4,Should the typical patient be concerned about ...,"[Cost_MIXED, Cost_MIXED, Cost_MIXED]"


In [36]:
# Count appreances of topics-sentiment combo for product
from collections import Counter

topicDF['TopTopics'] = topicDF['TopicSentiment'].apply(Counter)
topicDF.head()

Unnamed: 0,question,TopicSentiment,TopTopics
0,As a national savings for the healthcare syste...,"[Changes in Healthcare_MIXED, Cost_MIXED, Mana...","{'Changes in Healthcare_MIXED': 3, 'Cost_MIXED..."
1,Fees will be paid to GPS based on the health r...,"[Patient Experience_NEUTRAL, Patient Experienc...",{'Patient Experience_NEUTRAL': 3}
2,From your point of view What is Healthier SG,"[Managment of Chronic Conditions_NEUTRAL, Mana...",{'Managment of Chronic Conditions_NEUTRAL': 3}
3,If there's one thing that you could you know c...,"[System Operation_NEUTRAL, System Operation_NE...",{'System Operation_NEUTRAL': 3}
4,Should the typical patient be concerned about ...,"[Cost_MIXED, Cost_MIXED, Cost_MIXED]",{'Cost_MIXED': 3}


In [37]:
# Sorting topics-sentiment combo based on their appearance
topicDF['TopTopics'] = topicDF['TopTopics'].apply(lambda x: sorted(x, key=x.get, reverse=True))

# Select Top k topics-sentiment combo for each product/review
topicDF['TopTopics'] = topicDF['TopTopics'].apply(lambda x: x[:TOP_TOPICS])

# Sneak peek
topicDF.head()

Unnamed: 0,question,TopicSentiment,TopTopics
0,As a national savings for the healthcare syste...,"[Changes in Healthcare_MIXED, Cost_MIXED, Mana...","[Changes in Healthcare_MIXED, Cost_MIXED, Mana..."
1,Fees will be paid to GPS based on the health r...,"[Patient Experience_NEUTRAL, Patient Experienc...",[Patient Experience_NEUTRAL]
2,From your point of view What is Healthier SG,"[Managment of Chronic Conditions_NEUTRAL, Mana...",[Managment of Chronic Conditions_NEUTRAL]
3,If there's one thing that you could you know c...,"[System Operation_NEUTRAL, System Operation_NE...",[System Operation_NEUTRAL]
4,Should the typical patient be concerned about ...,"[Cost_MIXED, Cost_MIXED, Cost_MIXED]",[Cost_MIXED]


In [41]:
for q in topicDF.question.tolist():
    print(f'QUESTION: {q}')
    print('TOPICS:')
    [print(val) for val in topicDF[topicDF.question==q].TopTopics.tolist()[0]]
    print('---')

QUESTION: As a national savings for the healthcare system will we save money is this is running correctly?
TOPICS:
Changes in Healthcare_MIXED
Cost_MIXED
Managment of Chronic Conditions_MIXED
---
QUESTION: Fees will be paid to GPS based on the health risk profiles of patients and depending on the outcomes.How are you going to measure this when outcome is going to be so different for everyone?
TOPICS:
Patient Experience_NEUTRAL
---
QUESTION: From your point of view What is Healthier SG
TOPICS:
Managment of Chronic Conditions_NEUTRAL
---
QUESTION: If there's one thing that you could you know change right now fixed right now with regards to our health care system what would it be?
TOPICS:
System Operation_NEUTRAL
---
QUESTION: Should the typical patient be concerned about rising costs when they go see the doctor world prices increase over time?
TOPICS:
Cost_MIXED
---
QUESTION: The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?
TO

### Saving results:

In [43]:
topicDF.to_csv(S3_FINAL_OUTPUT,index=False)