## AWS Hands on Labs - Machine Learning - AWS Comprehend (2020/04/05)

##### Notes if you are setting the notebook up in your personal environment:

1. Ensure ComprehendFullAccess Policy has been attached to the IAM Role that SageMaker uses
2. Ensure SageMaker has access to the S3 Bucket defined in `DATA_BUCKET` below, and has an appropriate S3 Policy (e.g. AmazonS3FullAccess) attached to the IAM Role that SageMaker uses

##### For the Topic Modelling section:
1. Ensure the role defined in `DATA_ACCESS_ROLE_ARN` below has access to the `DATA_BUCKET`. Remember that this role needs to have a trust relationship that allows the Comprehend service to assume it
2. Ensure that the SageMaker role has a policy which provides iam:passrole permissions on the `DATA_ACCESS_ROLE_ARN` to allow the passing of the data access role to the SageMaker service
3. Ensure that the STS service has been enabled for the `REGION` in the account. 

In [1]:
import pandas as pd
from sklearn import metrics
import boto3
import time

In [2]:
REGION = 'us-east-2'

DATA_ACCESS_ROLE_ARN = 'arn:aws:iam::632354576168:role/HandsOnLabsComprehendDataAccessRole'
DATA_BUCKET = 'awshandson-comprehend'
IMDB_DATA_PREFIX = 'data/imdb/imdb-data-sample.csv'

TOPICS_OUTPUT_PREFIX = 'imdb-topics'

SAMPLE_SIZE = 1000
MAX_TEXT_LENGTH = 4900
BATCH_SIZE = 25
PREDICTION_THRESHOLD = 0.5

## Overview of AWS Comprehend functionality and API

##### Get a client to access the AWS Comprehend API

In [3]:
comprehend = boto3.client('comprehend', region_name=REGION)

##### Sample sentence to demonstrate key Comprehend functionalities

In [4]:
sample_sentence = "It’s always a great day when I can randomly put my equestrian knowledge to good use at work! #AWS #BePeculiar"

##### Key phrase detection

In [5]:
phrases = comprehend.detect_key_phrases(Text=sample_sentence, LanguageCode='en')
for phrase in phrases['KeyPhrases']:
    print(phrase)

{'Score': 0.9999999403953552, 'Text': 'a great day', 'BeginOffset': 12, 'EndOffset': 23}
{'Score': 1.0, 'Text': 'my equestrian knowledge', 'BeginOffset': 48, 'EndOffset': 71}
{'Score': 0.9999207258224487, 'Text': 'good use', 'BeginOffset': 75, 'EndOffset': 83}
{'Score': 0.9999996423721313, 'Text': 'work', 'BeginOffset': 87, 'EndOffset': 91}


##### Named Entity Recognition

In [6]:
sample_sentence_ner = """
500 men just stood there and watched. All the great knights of the Seven Kingdoms. You think anyone said a word, lifted a finger?
No, Lord Stark. 500 men and this room was silent as a crypt. Except for the screams, of course, and the Mad King laughing. 
And later... When I watched the Mad King die, I remembered him laughing as your father burned... It felt like justice.
"""

##### Write the code to do Named Entity Recognition using AWS Comprehend and output the results. 
##### Hint: Follow the pattern of the key phrase detection section above

In [7]:
### Enter your code here ###
entities = comprehend.detect_entities(Text=sample_sentence, LanguageCode='en')
for entity in entities['Entities']:
    print(entity)

{'Score': 0.6978814601898193, 'Type': 'TITLE', 'Text': '#AWS', 'BeginOffset': 93, 'EndOffset': 97}


##### Sentiment Analysis

1. Which one of the three feedbacks below would get the highest Positive score?
2. Which one of the sarcastic sentences would Comprehend be able to detect?
3. Replace the words 'call out' with 'mention' in 'feedback_1'. How does the sentiment score change?

Also, to all three questions, why?

In [8]:
sentiment_analysis_sentences = dict()
sentiment_analysis_sentences['feedback_1'] = """Hey John,

I wanted to call out all the work you've been doing for the social front of the Melbourne office. 

It's been great having more people actively involved in setting up events on a regular basis to give consultants an opportunity to catch up and team build outside of work contexts. 
Between the Hiking group and the Computer Games initiative the office has been much more active socially thanks to you and your inputs. 

Keep up the good work and thanks for making Servian a more interesting place to work.

Regards, 
Clare"""

sentiment_analysis_sentences['feedback_2'] = """
Laura, great work on creating the model deployment pipeline. You delivered independently with minimal direction and worked through blockers independently.
And with extensive documentation too - once again excellent work.
"""

sentiment_analysis_sentences['feedback_3'] = """
Phillip is fantastic at doing her job in a way that impacts others in a not positive way, I would like to mention that Phillip is excellent at 
arriving not on time and with a very ungood attitude
"""

sentiment_analysis_sentences['sarcasm_1'] = "His ignorance was an Empire State Building of ignorance - you had to admire it for its size"

sentiment_analysis_sentences['sarcasm_2'] = "You have the aroma and intelligence of a great ape"

In [9]:
sentiment = comprehend.detect_sentiment(Text=sentiment_analysis_sentences['feedback_1'], LanguageCode='en')
print(sentiment['SentimentScore'])

{'Positive': 0.8853756189346313, 'Negative': 0.0004674461670219898, 'Neutral': 0.11415435373783112, 'Mixed': 2.640103730300325e-06}


## Text analysis on the IMDB Dataset (Movie Reviews)

##### Read IMDB dataset from S3 into a pandas dataframe

In [10]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=DATA_BUCKET, Key=IMDB_DATA_PREFIX)

df = pd.read_csv(obj['Body'])
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


##### Alternative method of reading dataset to dataframe

In [11]:
df = pd.read_csv('s3://' + DATA_BUCKET + "/" + IMDB_DATA_PREFIX)

##### Detecting sentiment of one review selected at random. 

##### Have a play by running the cell multiple times. Do the predicted sentiment scores align with your human-level understanding and the labels provided?

In [12]:
import random as rand

review_number = rand.randint(1,len(df))

sample_review_object = df.iloc[review_number, ]
sample_review_text = sample_review_object['review']

sample_review_sentiment = comprehend.detect_sentiment(Text=sample_review_text, LanguageCode='en')

print(f"Review: {review_number}\n\n\
Review Text: {sample_review_text}\n\n\
Predicted sentiment: {sample_review_sentiment['SentimentScore']},\n\n\
Actual sentiment: {sample_review_object['sentiment']}")

Review: 643

Review Text: Angela (Sandra Bullock) is a computer expert but, being shy and somewhat of a recluse, she does all of her work from the confines of her condo. Just as she is about to take a vacation in Mexico, a co-worker sends her a computer disc with disturbing information on it. Angela agrees to meet with her fellow employee but he mysteriously dies in a plane crash. Angela heads to Mexico but takes the disc with her. While she is sunning on the beach, a terrific looking gentleman named Jack (Jeremy Northam) makes overtures to her. She falls for them and the two end up on a boat to Cozumel. However, Jack works for the folks who generated the secret information on the disc and he is out to get it. Even after Angela escapes from his clutches and lands back in the USA, Jack makes things difficult. He changes Angela's identity on every computer across the nation, making her lose her condo, her bank account, everything. Can Angela, a computer whiz, beat Jack at his own game? T

##### Detecting sentiment of a batch of texts: Preprocessing

In [13]:
df['label'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df['review_processed'] = df['review'].apply(lambda x: x[:MAX_TEXT_LENGTH])

##### Detecting sentiment of each batch. Loop through the batches and keep extending the `results` list with the output of `batch_detect_sentiment`

In [14]:
batch_size = BATCH_SIZE
max_batch_index = SAMPLE_SIZE // batch_size
results = []

for batch_index in range(max_batch_index):
    if batch_index % 5 == 0:
        print(f"Processing batch {batch_index}")
        
    batch_text_list = df.iloc[batch_index*BATCH_SIZE:batch_index*BATCH_SIZE + batch_size,:]['review_processed'].tolist()
    results.extend(comprehend.batch_detect_sentiment(TextList=batch_text_list, LanguageCode='en')['ResultList'])

Processing batch 0
Processing batch 5
Processing batch 10
Processing batch 15
Processing batch 20
Processing batch 25
Processing batch 30
Processing batch 35


##### Printing out an element of the `results` list

In [15]:
print(results[0])

{'Index': 0, 'Sentiment': 'POSITIVE', 'SentimentScore': {'Positive': 0.9654746055603027, 'Negative': 0.01993785984814167, 'Neutral': 0.014541912823915482, 'Mixed': 4.556140265776776e-05}}


##### Creating `sentiment_pred` column in DataFrame by using the 'Positive' scores reported

In [16]:
### Enter your code here ###
df['sentiment_pred'] = pd.Series([result['SentimentScore']['Positive'] for result in results])

##### Sanity Check

In [17]:
df[['review', 'label', 'sentiment_pred']].head(n=10)

Unnamed: 0,review,label,sentiment_pred
0,One of the other reviewers has mentioned that ...,1,0.965475
1,A wonderful little production. <br /><br />The...,1,0.998547
2,I thought this was a wonderful way to spend ti...,1,0.815299
3,Basically there's a family where a little boy ...,0,0.454442
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,0.991763
5,"Probably my all-time favorite movie, a story o...",1,0.997527
6,I sure would like to see a resurrection of a u...,1,0.926993
7,"This show was an amazing, fresh & innovative i...",0,0.01471
8,Encouraged by the positive comments about this...,0,0.014309
9,If you like original gut wrenching laughter yo...,1,0.999517


##### Calculating accuracy metrics: AUC

In [18]:
true_values = df['label']
predicted_values = df['sentiment_pred']

In [19]:
fpr, tpr, thresholds = metrics.roc_curve(true_values, predicted_values, pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(roc_auc)

0.888231568990533


##### Plot ROC curve

In [20]:
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

<Figure size 640x480 with 1 Axes>

##### Calculating accuracy metrics: Precision and Recall

In [21]:
### Enter your code here ###
predicted_values_binary = df['sentiment_pred'].apply(lambda x: 1 if x > 0.5 else 0)

print(f"Precision: {metrics.precision_score(true_values, predicted_values_binary)},\n\
Recall: {metrics.recall_score(true_values, predicted_values_binary)}") 

Precision: 0.8112033195020747,
Recall: 0.780439121756487


##### Topic Modelling

In [None]:
# Start topics detection job setting
input_s3_url = "s3://" + DATA_BUCKET + "/" + IMDB_DATA_PREFIX
input_doc_format = "ONE_DOC_PER_LINE"
output_s3_url = "s3://" + DATA_BUCKET + "/" + TOPICS_OUTPUT_PREFIX

data_access_role_arn = DATA_ACCESS_ROLE_ARN
number_of_topics = 4
job_name = "IMDB_Topic_Modelling_Job"

input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}

# Starts an asynchronous topic detection job.
response = comprehend.start_topics_detection_job(NumberOfTopics=number_of_topics,
                                                 InputDataConfig=input_data_config,
                                                 OutputDataConfig=output_data_config,
                                                 DataAccessRoleArn=data_access_role_arn,
                                                 JobName=job_name)

# Gets job_id
job_id = response["JobId"]
print('job_id: ' + job_id)

# It loops until JobStatus becomes 'COMPLETED' or 'FAILED'.
while True:
    result = comprehend.describe_topics_detection_job(JobId=job_id)
    job_status = result["TopicsDetectionJobProperties"]["JobStatus"]

    if job_status in ['COMPLETED', 'FAILED']:
        print("job_status: " + job_status)
        break
    else:
        print("job_status: " + job_status)
        time.sleep(30)

In [None]:
# Use to debug if the job above fails
# comprehend.describe_topics_detection_job(JobId=job_id)

##### Example topics generated (topic-terms.csv inside output.tar.gz which is written out to `DATA_BUCKET/TOPICS_OUTPUT_PREFIX` at the end of the job above

|topic|term|weight|
| --- | --- | --- |
|000|film|0.07013403|
|000|positive|0.016125802|
|000|work|0.0061104186|
|000|performance|0.0053255544|
|000|great|0.009274145|
|000|feel|0.0068337116|
|000|story|0.009006502|
|000|director|0.0038702376|
|000|character|0.006694102|
|000|love|0.0063105687|
|001|br|0.14845546|
|001|show|0.003921953|
|001|game|0.0022336794|
|001|character|0.006027758|
|001|woman|0.0021741905|
|001|play|0.003455709|
|001|shakespeare|0.0012397129|
|001|write|0.002254727|
|001|episode|0.001209998|
|001|house|0.0016670021|
|002|movie|0.10096688|
|002|positive|0.02110654|
|002|great|0.012690482|
|002|story|0.0113774035|
|002|love|0.008182541|
|002|good|0.013601423|
|002|message|0.004564077|
|002|young|0.004554768|
|002|fan|0.0047151013|
|002|favorite|0.003373193|
|003|movie|0.08092111|
|003|bad|0.029862158|
|003|negative|0.029115958|
|003|wrong|0.014751226|
|003|rate|0.009013734|
|003|watch|0.01766251|
|003|plot|0.010764287|
|003|waste|0.0074185627|
|003|stupid|0.0071981|
|003|act|0.012075577|