---
## Key phrase sentiment analysis pipeline

This Jupyter notebook contains a pipeline for extracting targeted sentiment for key phrases in interview responses using AWS Comprehend. The pipeline first identifies the key phrases in the responses using the AWS Comprehend Key Phrases Detect API, and then extracts the context surrounding each key phrase to provide additional context for sentiment analysis. The pipeline can be used to identify patterns in how interviewees express positive or negative sentiment towards specific key phrase.

### Pipeline steps
The pipeline consists of the following steps:

1. __Loading the data__: The interview transcript, speaker classification results, and reference question list are loaded into the notebook.

2. __Preprocessing__: The transcript is preprocessed to remove any unnecessary elements, such as timestamps or speaker tags.

3. __Identifying key phrases__: The pipeline calls the AWS Comprehend Key Phrases Detect API for each response in the transcript, and selects only the top key phrases with a score above 0.99. These key phrases are then used as the basis for sentiment analysis.

4. __Extracting context__: For each top key phrase, the pipeline extracts the context surrounding the phrase by taking 10 words before and after the phrase in the response. This provides additional context for sentiment analysis and helps to ensure that the sentiment is targeted specifically to the key phrase.

5. __Detecting sentiment__: The pipeline calls the AWS Comprehend Detect Sentiment API for each extracted context, and records the sentiment score and sentiment label (positive, negative, neutral, or mixed) for each key phrase.

6. __Aggregating results__: The pipeline can be used to aggregate the sentiment analysis results across multiple interviewees and/or questions, providing a more comprehensive understanding of the sentiment towards specific topics or themes.

### Conclusion
This pipeline provides a powerful tool for analyzing the sentiment expressed in interview responses. By identifying key phrases and extracting the context surrounding them, the pipeline enables targeted sentiment analysis that can reveal patterns and insights that might otherwise be missed.

---


In [1]:
import time
import boto3
from botocore.exceptions import ClientError
import requests
import pandas as pd
import os

In [2]:
comprehend = boto3.client('comprehend', region_name='us-east-1')

### Step 1: Loading the data

In [3]:
BUCKET='cnatest' # Or whatever you called your bucket
data_key = 'transcript_with_mapped_questions_and_answers.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(BUCKET, data_key)
df = pd.read_csv(data_location)

In [4]:
df.head(10)

Unnamed: 0,responce_to_question,text
0,As a national savings for the healthcare syste...,if this is running correctly? And we get buy i...
1,Fees will be paid to GPS based on the health r...,start first. What I used to hear a lot of is t...
2,From your point of view What is Healthier SG,I think healthier S. G. Is a whole rethinking ...
3,If there's one thing that you could you know c...,for me I think information to unify all the in...
4,Should the typical patient be concerned about ...,over time. We will see a general increase in c...
5,The idea of having family doctors is something...,I don't think we're only doing it now. I think...
6,Will this also add on to increasing burdens fo...,be honest and say that it is going to be certa...
7,when you look at our health care system what w...,[PII] I would actually just extend a little bi...


In [5]:
list_questions = df.responce_to_question.tolist()

### Step 2: Preprocessing

In [6]:
def clean_text(df):
    """Preprocessing review text.
    The text becomes Comprehend compatible as a result.
    This is the most important preprocessing step.
    """
    # Encode and decode reviews
    df['text'] = df['text'].str.encode("utf-8", "ignore")
    df['text'] = df['text'].str.decode('ascii')
    
    df['text'] = df['text'].str.replace('[PII]','', regex=True)
    # Replacing characters with whitespace
    df['text'] = df['text'].replace(r'\r+|\n+|\t+|\u2028',' ', regex=True)

    # Replacing punctuations
    df['text'] = df['text'].str.replace('[^\w\s]','', regex=True)

    # Lowercasing reviews
    df['text'] = df['text'].str.lower()
    return df
df = clean_text(df)


### Step 3: Identifying key phrases

In [7]:
def get_insights(sample):
    # Key phrases
    phrases = comprehend.detect_key_phrases(Text=sample, LanguageCode='en')  
    
    return {'phrases':phrases}

In [8]:
keyphrases_dict = dict()

for answer_i, question_i in zip(df.text.tolist(), list_questions):
    keyphrases_dict[question_i] = get_insights(answer_i)
    

In [15]:
keyphrases_dict[df.responce_to_question.tolist()[0]].get('phrases').get('KeyPhrases')[0]

{'Score': 0.9994125962257385,
 'Text': 'the populist',
 'BeginOffset': 52,
 'EndOffset': 64}

### Save Results of all KeyPhrases to data frame:

In [None]:
key_phrases_df = pd.DataFrame()
for q_i in df.responce_to_question.tolist():
    tmp = pd.DataFrame(keyphrases_dict[q_i].get('phrases').get('KeyPhrases'))
    tmp['question'] = q_i
    key_phrases_df = key_phrases_df.append(tmp)

In [17]:
key_phrases_df.head()

Unnamed: 0,Score,Text,BeginOffset,EndOffset,question
0,0.999413,the populist,52,64,As a national savings for the healthcare syste...
1,0.999762,the long run,79,91,As a national savings for the healthcare syste...
2,0.986046,benefit,104,111,As a national savings for the healthcare syste...
3,0.999053,a decade,121,129,As a national savings for the healthcare syste...
4,0.95075,two decades,130,141,As a national savings for the healthcare syste...


####  Consider only TOP key phrases whith confidence score >0.98

In [18]:
top_key_phrases = key_phrases_df[key_phrases_df.Score>0.98]

In [19]:
set_of_top_key_phrases = set(top_key_phrases.Text.tolist())

In [20]:
len(set_of_top_key_phrases)

461

### Detect overall sentiment for answers:


In [21]:
def get_insights_sentiment(sample):
    # Key phrases
    sentiment = comprehend.detect_sentiment(Text=sample, LanguageCode='en')  
    
    return {'sentiment':sentiment}

In [22]:
def get_top_500_charackters(input_str, textsize):
    if textsize<5000:
        return input_str
    else:
        return input_str[:500]
    

In [23]:
def prepare_input_data(df):
    """Encoding and getting reviews in byte size.
    Review gets encoded to utf-8 format and getting the size of the reviews in bytes. 
    Comprehend requires each review input to be no more than 5000 Bytes
    """
    df['textsize'] = df['text'].apply(lambda x:len(x.encode('utf-8')))
    df['text_for_sentiment'] = ""
    df['text_for_sentiment'] = df[['text','textsize']].apply(lambda x:get_top_500_charackters(x[0],x[1]),axis=1)
    #df = df[(df['textsize'] > 0) & (df['textsize'] < 5000)]
    df = df.drop(columns=['textsize'])
    return df
df = prepare_input_data(df)

In [24]:
df.head()

Unnamed: 0,responce_to_question,text,text_for_sentiment
0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...
1,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,start first what used to hear a lot of is tha...
2,From your point of view What is Healthier SG,think healthier s g s a whole rethinking of h...,think healthier s g s a whole rethinking of h...
3,If there's one thing that you could you know c...,for me think information to unify all the inf...,for me think information to unify all the inf...
4,Should the typical patient be concerned about ...,over time we will see a general increase in co...,over time we will see a general increase in co...


In [25]:
sentiment_dict = dict()

for answer_i, question_i in zip(df.text_for_sentiment.tolist(), list_questions):
    sentiment_dict[question_i] = get_insights_sentiment(answer_i)

In [26]:
q_i='As a national savings for the healthcare system will we save money is this is running correctly?'
sentiment_dict[q_i].get('sentiment').get('Sentiment')

'MIXED'

In [27]:
sentiment_dict[q_i].get('sentiment')

{'Sentiment': 'MIXED',
 'SentimentScore': {'Positive': 0.034217070788145065,
  'Negative': 0.061298761516809464,
  'Neutral': 0.12324418872594833,
  'Mixed': 0.781239926815033},
 'ResponseMetadata': {'RequestId': '52e6fcc5-885d-46d7-8c03-479e159cc46c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '52e6fcc5-885d-46d7-8c03-479e159cc46c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '160',
   'date': 'Thu, 09 Mar 2023 03:15:31 GMT'},
  'RetryAttempts': 0}}

### Add a column with overall sentiment to input transcript

In [28]:
df.head()

Unnamed: 0,responce_to_question,text,text_for_sentiment
0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...
1,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,start first what used to hear a lot of is tha...
2,From your point of view What is Healthier SG,think healthier s g s a whole rethinking of h...,think healthier s g s a whole rethinking of h...
3,If there's one thing that you could you know c...,for me think information to unify all the inf...,for me think information to unify all the inf...
4,Should the typical patient be concerned about ...,over time we will see a general increase in co...,over time we will see a general increase in co...


In [29]:
import numpy as np

In [30]:
df['overall_sentiment']=""
for q_i in list_questions:
    sentiment_i = sentiment_dict[q_i].get('sentiment').get('Sentiment')
    index_q_i = np.where(df['responce_to_question']==q_i)
    df.at[index_q_i[0][0],'overall_sentiment'] = sentiment_i

In [31]:
df.head(10)

Unnamed: 0,responce_to_question,text,text_for_sentiment,overall_sentiment
0,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED
1,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,start first what used to hear a lot of is tha...,NEUTRAL
2,From your point of view What is Healthier SG,think healthier s g s a whole rethinking of h...,think healthier s g s a whole rethinking of h...,NEUTRAL
3,If there's one thing that you could you know c...,for me think information to unify all the inf...,for me think information to unify all the inf...,NEUTRAL
4,Should the typical patient be concerned about ...,over time we will see a general increase in co...,over time we will see a general increase in co...,MIXED
5,The idea of having family doctors is something...,dont think were only doing it now think weve...,dont think were only doing it now think weve...,NEGATIVE
6,Will this also add on to increasing burdens fo...,be honest and say that it is going to be certa...,be honest and say that it is going to be certa...,NEUTRAL
7,when you look at our health care system what w...,would actually just extend a little bit from...,would actually just extend a little bit from...,MIXED


In [32]:
[ len(val) for val in df.text.tolist()]

[2628, 3768, 2057, 1881, 2049, 4161, 6465, 1824]

### Step 4: Extracting context

### Analyse sentiment for top key phrases
We will consider a context +- 10 words around the "key phrase"

In [33]:
top_key_phrases.columns = ['Score', 'Text', 'BeginOffset', 'EndOffset', 'responce_to_question']

In [34]:
df_sentiment_top_key_phrases = pd.merge(top_key_phrases,df,on="responce_to_question", how="left")

In [35]:
df_sentiment_top_key_phrases.head()

Unnamed: 0,Score,Text,BeginOffset,EndOffset,responce_to_question,text,text_for_sentiment,overall_sentiment
0,0.999413,the populist,52,64,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED
1,0.999762,the long run,79,91,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED
2,0.986046,benefit,104,111,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED
3,0.999053,a decade,121,129,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED
4,0.99957,the road,147,155,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED


In [36]:
def get_top_substring_10_words(BeginOffset,input_str):
    top = input_str[:BeginOffset]
    words = top.split(" ")
    if len(words)>=10:
        # return last 10
        return ' '.join(words[-10:])
    else:
        return ' '.join(words) 
    
def get_last_substring_10_words(EndOffset,input_str):
    top = input_str[EndOffset+1:]
    words = top.split(" ")
    if len(words)>=10:
        # return last 10
        return ' '.join(words[:10])
    else:
        return  ' '.join(words)  

In [37]:
df_sentiment_top_key_phrases["top_10_words"]=""
df_sentiment_top_key_phrases["bottom_10_words"]=""
df_sentiment_top_key_phrases["context"]=""

In [38]:
df_sentiment_top_key_phrases["top_10_words"]=df_sentiment_top_key_phrases[["BeginOffset","text"]].apply(
lambda x: get_top_substring_10_words(x[0],x[1]),axis=1)
df_sentiment_top_key_phrases["bottom_10_words"]=df_sentiment_top_key_phrases[["EndOffset","text"]].apply(
lambda x: get_last_substring_10_words(x[0],x[1]),axis=1)

In [39]:
def concat_str(top_5_words,Text,bottom_5_words):
    return str(top_5_words)+" "+str(Text)+" "+str(bottom_5_words)

In [40]:
df_sentiment_top_key_phrases["context"]=df_sentiment_top_key_phrases[["top_10_words",
                                                                      "Text",
                                                                     "bottom_10_words"
                                                                     ]].apply(lambda x:concat_str(x[0],x[1],x[0]),axis=1)

In [41]:
df_sentiment_top_key_phrases.head()

Unnamed: 0,Score,Text,BeginOffset,EndOffset,responce_to_question,text,text_for_sentiment,overall_sentiment,top_10_words,bottom_10_words,context
0,0.999413,the populist,52,64,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED,is running correctly and we get buy in from,would say in the long run we will see,is running correctly and we get buy in from t...
1,0.999762,the long run,79,91,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED,buy in from the populist would say in,we will see benefit you know a decade two decades,buy in from the populist would say in the lo...
2,0.986046,benefit,104,111,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED,would say in the long run we will see,you know a decade two decades down the road for,would say in the long run we will see benefit...
3,0.999053,a decade,121,129,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED,the long run we will see benefit you know,two decades down the road for sure feel that it,the long run we will see benefit you know a d...
4,0.99957,the road,147,155,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,if this is running correctly and we get buy in...,MIXED,see benefit you know a decade two decades down,for sure feel that it takes a generation thats...,see benefit you know a decade two decades down...


In [42]:
df_sentiment_top_key_phrases.shape

(559, 11)

### Step 5: Detecting sentiment for key phrases:

In [43]:
context = df_sentiment_top_key_phrases.context.tolist()

In [44]:
sentiment_dict = dict()
key_phrases_list = df_sentiment_top_key_phrases.Text.tolist()
ind = 0
for phrase_i, context_i in zip(key_phrases_list, context):
    sentiment_dict[ind] = get_insights_sentiment(context_i)
    ind = ind+1

In [45]:
df_sentiment_top_key_phrases =df_sentiment_top_key_phrases[['Score', 'Text', 'BeginOffset', 'EndOffset', 'responce_to_question',
       'text', 'overall_sentiment','context']]

In [46]:
for i, phrase in enumerate(key_phrases_list):
    sentiment_i = sentiment_dict[i].get('sentiment').get('Sentiment')
    df_sentiment_top_key_phrases.at[i,'key_phrase_sentiment'] = sentiment_i

In [47]:
df_sentiment_top_key_phrases[df_sentiment_top_key_phrases.key_phrase_sentiment == "NEGATIVE"].head(30)

Unnamed: 0,Score,Text,BeginOffset,EndOffset,responce_to_question,text,overall_sentiment,context,key_phrase_sentiment
26,0.999519,rescue,1224,1230,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,MIXED,with the condition fairly late where the cost ...,NEGATIVE
31,0.999938,the very long term,1534,1552,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,MIXED,of which we cant afford as a nation in the ve...,NEGATIVE
38,0.999848,the fact,1855,1863,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,MIXED,the beginning of life and didnt allude to th...,NEGATIVE
40,0.988433,greater risk,1908,1920,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,MIXED,we had to start with the ones that are greate...,NEGATIVE
47,0.984725,this particular family doctor family doctor group,2179,2228,As a national savings for the healthcare syste...,if this is running correctly and we get buy in...,MIXED,over time even the kids will end up seeing th...,NEGATIVE
59,0.999603,the individual case,53,72,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,NEUTRAL,used to hear a lot of is that when the indivi...,NEGATIVE
73,0.999572,a bit,757,762,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,NEUTRAL,get amply compensated in that sense so that re...,NEGATIVE
76,0.997937,the patient,793,804,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,NEUTRAL,a bit of the burden and the risk of the patie...,NEGATIVE
77,0.999288,our best efforts,848,864,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,NEUTRAL,risk of the patient just being completely nonc...,NEGATIVE
78,0.995523,very large scale kind,905,926,Fees will be paid to GPS based on the health r...,start first what used to hear a lot of is tha...,NEUTRAL,then the other part about it is that on very ...,NEGATIVE


### Saving Results:

In [48]:
df_sentiment_top_key_phrases.to_csv("data/key_phrases_sentiment.csv", index=False)

In [49]:
df.to_csv("data/overall_sentiment_for_question_replies.csv", index=False)

In [50]:
key_phrases_df.to_csv("data/all_key_phrases.csv",index=False)