## Lab 1 -- Comprehend Lab
Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification.

In [26]:
# IMPORT DEPENDENCIES
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import botocore
import os
import re
import logging
import time
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))


In [27]:
#Prepare for the lab by learning the current IAM Role and the S3 bucket we will use...

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

# REPLACE THE POD BELOW WITH THE NUMBER ASSIGN TO YOU 
pod = '9996'

# REPLACE THE REGION BELOW WITH THE REGION YOU ARE OPERATING IN (ie N. Virginia=us-east-1, Ohio=us-east-2)
region = 'us-east-1'

prefix = 'comprehend' #we will store the dataset we will use to train a custom classification classifier

bucket = "rga-aws-ai-workshop-pod-" + pod # Replace with your own bucket name if needed
print("The S3 bucket you will be using to upload your dataset is:", bucket)

arn:aws:iam::683164714817:role/service-role/AmazonSageMaker-ExecutionRole-20191016T110583
The S3 bucket you will be using to upload your dataset is: rga-aws-ai-workshop-pod-9996


## Instructions to Update the IAM Role for the Sagemaker Notebook
Need to provide access to:

- The S3 bucket we will create
- Comprehend
- Sagemaker stuff
- Create a new IAM role for Comprehend DataAccessRoleArn & include the S3 bucket that was created.

In [28]:
# Run the first Airbnb review through Comprehend native service.
client = boto3.client('comprehend')

#Lets first run the first Airbnb review through the native sentiment analysis of Comprehend to see whats available
#without customization. Please note this is a simple API call with no ML training required.
#Feel free to re-run this cell with another review you would like to extract information on.
textreview = 'We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.'
printmd("**Raw text Airbnb review:**")
print(textreview + '\n')

response = client.detect_sentiment(
    Text=textreview,
    LanguageCode='en'
)
printmd("**Comprehend native sentiment analysis:**")
print(json.dumps(response, indent=4))

#Note the ability to extract the overall sentiment from the reviews. Comprehend can provide: positive, negative,
#neutral, and mixed from the text with no training/tuning
#This can also be run from the Comprehend AWS Console interface


**Raw text Airbnb review:**

We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.



**Comprehend native sentiment analysis:**

{
    "Sentiment": "POSITIVE",
    "SentimentScore": {
        "Positive": 0.999503493309021,
        "Negative": 6.865667819511145e-05,
        "Neutral": 0.0004137569048907608,
        "Mixed": 1.4093851859797724e-05
    },
    "ResponseMetadata": {
        "RequestId": "e058c109-64ed-4b88-aebc-44864017599c",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "e058c109-64ed-4b88-aebc-44864017599c",
            "content-type": "application/x-amz-json-1.1",
            "content-length": "165",
            "date": "Thu, 17 Oct 2019 14:29:17 GMT"
        },
        "RetryAttempts": 0
    }
}


In [29]:
#Next lets use Comprehend to detect entities and key phrases for our Airbnb review...
response = client.detect_entities(
    Text=textreview,
    LanguageCode='en'
)
printmd("**Raw text Airbnb review:**")
print(textreview + '\n')

printmd("**Comprehend entities detected:**")
print(json.dumps(response, indent=4) + '\n')

response = client.detect_key_phrases(
    Text=textreview,
    LanguageCode='en'
)

printmd("**Raw text Airbnb review:**")
print(textreview + '\n')

printmd("**Comprehend key phrases detected:**")
print(json.dumps(response, indent=4))

#Make note of the different information that can be extracted from the review with no ML model training or tuning....


**Raw text Airbnb review:**

We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.



**Comprehend entities detected:**

{
    "Entities": [
        {
            "Score": 0.9984429478645325,
            "Type": "LOCATION",
            "Text": "Chicago",
            "BeginOffset": 11,
            "EndOffset": 18
        },
        {
            "Score": 0.46189820766448975,
            "Type": "PERSON",
            "Text": "Hamilton",
            "BeginOffset": 26,
            "EndOffset": 34
        },
        {
            "Score": 0.9971076846122742,
            "Type": "PERSON",
            "Text": "Rob",
            "BeginOffset": 311,
            "EndOffset": 314
        },
        {
            "Score": 0.9949488639831543,
            "Type": "LOCATION",
            "Text": "Chicago",
            "BeginOffset": 454,
            "EndOffset": 461
        },
        {
            "Score": 0.9917650818824768,
            "Type": "PERSON",
            "Text": "Rob",
            "BeginOffset": 492,
            "EndOffset": 495
        }
    ],
    "ResponseMetadata": {
        "RequestId": "57386afc-bca

**Raw text Airbnb review:**

We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.



**Comprehend key phrases detected:**

{
    "KeyPhrases": [
        {
            "Score": 0.9986284971237183,
            "Text": "Chicago",
            "BeginOffset": 11,
            "EndOffset": 18
        },
        {
            "Score": 0.5389164090156555,
            "Text": "Hamilton",
            "BeginOffset": 26,
            "EndOffset": 34
        },
        {
            "Score": 0.9982622265815735,
            "Text": "our teenaged daughter",
            "BeginOffset": 53,
            "EndOffset": 74
        },
        {
            "Score": 0.9282022714614868,
            "Text": "this location and neighborhood",
            "BeginOffset": 85,
            "EndOffset": 115
        },
        {
            "Score": 0.9946176409721375,
            "Text": "tons",
            "BeginOffset": 145,
            "EndOffset": 149
        },
        {
            "Score": 0.9767205715179443,
            "Text": "nearby places",
            "BeginOffset": 153,
            "EndOffset": 166
        },
        {
          

In [30]:
#CREATE THE BUCKET IN YOUR AWS AWS ACCT to be used for the dataset
def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region

    If a region is not specified, the bucket is created in the S3 default
    region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """

    # Create bucket
    try:
        if (region is None) or (region == 'us-east-1'):
            s3_client = boto3.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client = boto3.client('s3', region_name=region)
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return print(bucket + ' bucket created successfully!')

create_bucket(bucket, region)

rga-aws-ai-workshop-pod-9996 bucket created successfully!


In [31]:
#Inspect the dataset. We are reviewing only the first two entries.
printmd("**Comprehend working directory contents:**")
!ls comprehend/ -la
!pwd

printmd("**First 2 entries from the training dataset:**")
!head comprehend/airbnb-reviews-training.csv -n 2

# We are assuming that the data transformations have occurred upstream and we have formated the data required for
# Comprehend. Note that the data has two columns, the first is a custom label for 'great' or 'notgreat' to indicate
# the custom labels that have been applied. This is the format required for the Comprehend custom classifier.
# The first column is the custom label, the second column contains the raw review.


**Comprehend working directory contents:**

total 772
drwxrwxr-x 3 ec2-user ec2-user   4096 Oct 16 20:40 .
drwxrwxr-x 6 ec2-user ec2-user   4096 Oct 17 14:28 ..
-rw-rw-r-- 1 ec2-user ec2-user  14880 Oct 16 19:21 airbnb-reviews-holdout.csv
-rw-rw-r-- 1 ec2-user ec2-user 759203 Oct 16 19:21 airbnb-reviews-training.csv
drwxrwxr-x 2 ec2-user ec2-user   4096 Oct 16 20:00 .ipynb_checkpoints
/home/ec2-user/SageMaker/rga-aws-ai-workshop


**First 2 entries from the training dataset:**

﻿notgreat,We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again
notgreat,"Chris' place is really lovely! Plenty of space for 2 people, with lots of thoughtful touches like biscuits, bottled water, tea and coffee etc! The kitchen was well equipped to make meals, and Chris quickly provided a can opener when we asked to borrow one. The washer and dryer in the apartment were also very handy. Good location in a residential neighbourhood, about 10-15 min walk to the local Aldi and 10 mins to shops, including the currency exchange to get bus tickets. 

In [32]:
#UPLOAD THE DATASET TO S3 FROM THE LOCAL SYSTEM FOR COMPREHEND CUSTOM TRAINING
def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return print(file_name + ' object uploaded successfully!')

upload_file('comprehend/airbnb-reviews-training.csv', bucket)
upload_file('comprehend/airbnb-reviews-holdout.csv', bucket)


comprehend/airbnb-reviews-training.csv object uploaded successfully!
comprehend/airbnb-reviews-holdout.csv object uploaded successfully!


## Create IAM Role for Comprehend to read S3 data from your bucket

**Create Data Access Role for Comprehend custom classifier job:**


**Create Policy:**

`
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
        ],
        "Resource": [
            "arn:aws:s3:::UPDATEBUCKETNAMEHERE"
        ]
    }
}
`



In [33]:
# CREATE THE IAM POLICY & ROLE FOR COMPREHEND TO USE TO TRAIN THE CUSTOM MODEL
# CREATE IAM POLICY
policyDocumentStr = '''
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
        ],
        "Resource": [
            "arn:aws:s3:::%s"
        ]
    }
}
'''%(bucket)
pattern = re.compile(r'[\s\r\n]+')
policyDocumentStr = re.sub(pattern, '', policyDocumentStr)

client = boto3.client('iam')
response = client.create_policy(
    PolicyName= bucket + '-policy',
    PolicyDocument=policyDocumentStr,
    Description='IAM permissions policy for Comprehend access to S3 bucket for RGA AWS AI Workshop'
)
print('The ' + response['Policy']['PolicyName'] + ' IAM policy was created')
policyArn= response['Policy']['Arn']

# CREATE IAM ROLE
trustPolicyDocumentStr = '''
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "comprehend.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}'''
pattern = re.compile(r'[\s\r\n]+')
trustPolicyDocumentStr = re.sub(pattern, '', trustPolicyDocumentStr)

response = client.create_role(
    RoleName= bucket + '-role',
    AssumeRolePolicyDocument= trustPolicyDocumentStr,
    Description='IAM role for Comprehend access to S3 bucket for RGA AWS AI Workshop'
)
print('The ' + response['Role']['RoleName'] + ' IAM role was created')
dataaccessarn=response['Role']['Arn']

# ASSIGN POLICIES TO ROLE
response = client.attach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= policyArn
)
response = client.attach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= 'arn:aws:iam::aws:policy/service-role/ComprehendDataAccessRolePolicy'
)


The rga-aws-ai-workshop-pod-9996-policy IAM policy was created
The rga-aws-ai-workshop-pod-9996-role IAM role was created


In [34]:
# Instantiate Boto3 SDK:
client = boto3.client('comprehend', region_name=region)
s3_uri = 's3://' + bucket + '/' + prefix + '/' + 'airbnb-reviews-training.csv'

#UPDATE THE ARN FROM THE ROLE YOU CREATED IN THE PREVIOUS SECTION
#account_id = role.split(':')[4]
#dataaccessarn='arn:aws:iam::12345678901234:role/rga-aws-ai-workshop-pod-xxxx'
                    
docclassifier='RGAAirbnb-' + pod

# Create a document classifier
create_response = client.create_document_classifier(
    InputDataConfig={
        'S3Uri': s3_uri
    },
    DataAccessRoleArn=dataaccessarn,
    DocumentClassifierName=docclassifier,
    LanguageCode='en'
)
printmd("**Create response output**")
print(str(create_response) + "\n")

# Check the status of the classifier
docclassifierarn = create_response['DocumentClassifierArn']
describe_response = client.describe_document_classifier(
    DocumentClassifierArn=docclassifierarn)
printmd("**Describe response output:**")
print(str(describe_response))


**Create response output**

{'DocumentClassifierArn': 'arn:aws:comprehend:us-east-1:683164714817:document-classifier/RGAAirbnb-9996', 'ResponseMetadata': {'RequestId': '8e44b531-e573-47ec-9562-4035a517df6e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '8e44b531-e573-47ec-9562-4035a517df6e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '104', 'date': 'Thu, 17 Oct 2019 14:30:32 GMT'}, 'RetryAttempts': 0}}



**Describe response output:**

{'DocumentClassifierProperties': {'DocumentClassifierArn': 'arn:aws:comprehend:us-east-1:683164714817:document-classifier/RGAAirbnb-9996', 'LanguageCode': 'en', 'Status': 'SUBMITTED', 'SubmitTime': datetime.datetime(2019, 10, 17, 14, 30, 32, 534000, tzinfo=tzlocal()), 'InputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9996/comprehend/airbnb-reviews-training.csv'}, 'OutputDataConfig': {}, 'DataAccessRoleArn': 'arn:aws:iam::683164714817:role/rga-aws-ai-workshop-pod-9996-role'}, 'ResponseMetadata': {'RequestId': '4dfd15a5-d78a-42a7-874b-620325a5f5f6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '4dfd15a5-d78a-42a7-874b-620325a5f5f6', 'content-type': 'application/x-amz-json-1.1', 'content-length': '441', 'date': 'Thu, 17 Oct 2019 14:30:32 GMT'}, 'RetryAttempts': 0}}


In [35]:
# Check the status of the classifier
# Look for a status on TRAINED before moving to the next step
describe_response = client.describe_document_classifier(DocumentClassifierArn=docclassifierarn)
printmd("**Document classifier status (Wait for the status to show TRAINED):**")
print(describe_response['DocumentClassifierProperties']['Status'])
while describe_response['DocumentClassifierProperties']['Status'] != 'TRAINED':
    time.sleep(20)
    describe_response = client.describe_document_classifier(DocumentClassifierArn=docclassifierarn)
    print(describe_response['DocumentClassifierProperties']['Status'])

printmd("**Custom Model Accuracy:**")
print(str(describe_response['DocumentClassifierProperties']['ClassifierMetadata']['EvaluationMetrics']))

# Alternatively please review the Comprehend AWS Console GUI to validate if your custom job is still training

**Document classifier status (Wait for the status to show TRAINED):**

SUBMITTED
SUBMITTED
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINING
TRAINED


**Custom Model Accuracy:**

{'Accuracy': 0.9055, 'Precision': 0.4528, 'Recall': 0.5, 'F1Score': 0.4752}


In [38]:
client = boto3.client('comprehend', region_name=region)
s3_uri_in = 's3://' + bucket + '/' + prefix + '/' + 'airbnb-reviews-holdout.csv'
s3_uri_out = 's3://' + bucket + '/' + prefix + '/' + 'output'
jobname='RGAAirbnb-Job-' + pod

start_response = client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': s3_uri_in,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_uri_out
    },
    DataAccessRoleArn=dataaccessarn,
    DocumentClassifierArn=docclassifierarn,
    JobName=jobname
)

printmd("**Start response output:**")
print(str(start_response) + '\n')

# Check the status of the job
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
printmd("**Describe response output:**")
print(describe_response)


**Start response output:**

{'JobId': '968b3307ae4833cf7f2f2f548b7c2457', 'JobStatus': 'SUBMITTED', 'ResponseMetadata': {'RequestId': 'b61f0f4f-498e-4e4d-bdf8-d7d7f1135131', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b61f0f4f-498e-4e4d-bdf8-d7d7f1135131', 'content-type': 'application/x-amz-json-1.1', 'content-length': '68', 'date': 'Thu, 17 Oct 2019 14:56:59 GMT'}, 'RetryAttempts': 0}}



**Describe response output:**

{'DocumentClassificationJobProperties': {'JobId': '968b3307ae4833cf7f2f2f548b7c2457', 'JobName': 'RGAAirbnb-Job-9996', 'JobStatus': 'SUBMITTED', 'SubmitTime': datetime.datetime(2019, 10, 17, 14, 57, 0, 535000, tzinfo=tzlocal()), 'DocumentClassifierArn': 'arn:aws:comprehend:us-east-1:683164714817:document-classifier/RGAAirbnb-9996', 'InputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9996/comprehend/airbnb-reviews-holdout.csv', 'InputFormat': 'ONE_DOC_PER_LINE'}, 'OutputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9996/comprehend/output/683164714817-CLN-968b3307ae4833cf7f2f2f548b7c2457/output/output.tar.gz'}, 'DataAccessRoleArn': 'arn:aws:iam::683164714817:role/rga-aws-ai-workshop-pod-9996-role'}, 'ResponseMetadata': {'RequestId': '49f2eb3a-b9a6-4c29-a9e3-aef211b19919', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '49f2eb3a-b9a6-4c29-a9e3-aef211b19919', 'content-type': 'application/x-amz-json-1.1', 'content-length': '648', 'date': 'Thu, 17 Oct 2019 14:56:59 

In [None]:
# Check the status of the job.
# When JobStatus is COMPLETED you can move to the next step.
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])

printmd("**Job training status (Wait for the status to show COMPLETED):**")
print(describe_response['DocumentClassificationJobProperties']['JobStatus'])

while describe_response['DocumentClassificationJobProperties']['JobStatus'] != 'COMPLETED':
    time.sleep(20)
    describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
    print(describe_response['DocumentClassificationJobProperties']['JobStatus'])

**Job training status (Wait for the status to show COMPLETED):**

IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
COMPLETED


In [41]:
output_s3 = describe_response['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri']
#print(output_s3)

printmd("**Download model from S3 to local filesystem:**")
!aws s3 cp {output_s3} comprehend/output.tar.gz

printmd("**Uncompress predictions:**")
!tar -xvzf comprehend/output.tar.gz -C comprehend

printmd("**Review groundtruth (first 20):**")
!head comprehend/airbnb-reviews-holdout-groundtruth.csv -n 5

printmd("**Review predictions (first 20):**")
!head comprehend/predictions.jsonl -n 5

**Download model from S3 to local filesystem:**

Completed 433 Bytes/433 Bytes (860 Bytes/s) with 1 file(s) remainingdownload: s3://rga-aws-ai-workshop-pod-9996/comprehend/output/683164714817-CLN-968b3307ae4833cf7f2f2f548b7c2457/output/output.tar.gz to comprehend/output.tar.gz


**Uncompress predictions:**

predictions.jsonl


**Review groundtruth (first 20):**

﻿notgreat,"We have had an incredible trip! Rod’s apartment was perfect. His description and photos of the apartment painted an incredible picture and it still exceeded our expectations. The location couldn’t have been better as we were just a few steps from public transportation and a market for groceries was right at the corner. - - Even when we came inside to escape the rain, we had a spectacular view of the city. The whole building was super clean, comfortable, stylish, and safe. Rod was always super quick to respond to any questions we had and gave us great recommendations for local restaurants. We would love to stay here again next time we’re in Chicago."
notgreat,"Just like everyone else said, the view is spectacular. This apt is sparkling clean and breathtakingly beautiful. I was in Chicago for a work meeting and loved coming back in the evenings to this comfortable, homely, beautiful apt. I had no trouble getting in each time.. the women at the front desk were all friendly (I

**Review predictions (first 20):**

{"File": "airbnb-reviews-holdout.csv", "Line": "0", "Classes": [{"Name": "notgreat", "Score": 0.8848}, {"Name": "great", "Score": 0.1152}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "1", "Classes": [{"Name": "notgreat", "Score": 0.8371}, {"Name": "great", "Score": 0.1629}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "2", "Classes": [{"Name": "notgreat", "Score": 0.9114}, {"Name": "great", "Score": 0.0886}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "3", "Classes": [{"Name": "notgreat", "Score": 0.897}, {"Name": "great", "Score": 0.103}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "4", "Classes": [{"Name": "notgreat", "Score": 0.8981}, {"Name": "great", "Score": 0.1019}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "5", "Classes": [{"Name": "notgreat", "Score": 0.8554}, {"Name": "great", "Score": 0.1446}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "6", "Classes": [{"Name": "notgreat", "Score": 0.8696}, {"Name": "great", "Score": 0.1304}]}
{"File": "airbn

## Lab 2 -- Sagemaker model training & delivery using a native algorithm (BlazingText)

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import botocore
import os
import re
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))


sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

#pod = '9998'
#region = 'us-east-1'
prefix = 'blazingtext' #we will store the dataset we will use to train a custom classification classifier

bucket = "rga-aws-ai-workshop-pod-" + pod # Replace with your own bucket name if needed
print(bucket)

### Data Preparation

Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".

In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. 

In [None]:

labels_file = "classes.txt"
train_file = "train.csv"
test_file = "test.csv"

!ls {prefix} -la


Let us inspect the dataset and the classes to get some understanding about how the data and the label is provided in the dataset. 

In [None]:
!head {prefix}/train.csv -n 3

As can be seen from the above output, the CSV has 3 fields - Label index, title and abstract. Let us first create a label index to label name mapping and then proceed to preprocess the dataset for ingestion by BlazingText.

Next we will print the labels file (`classes.txt`) to see all possible labels followed by creating an index to label mapping.

In [None]:
!cat {prefix}/classes.txt


The following code creates the mapping from integer indices to class label which will later be used to retrieve the actual class name during inference. 

In [None]:
#index_to_label = {} 
#with open("dbpedia_csv/classes.txt") as f:
#    for i,label in enumerate(f.readlines()):
#        index_to_label[str(i+1)] = label.strip()
#print(index_to_label)

index_to_label = {} 
with open("blazingtext/classes.txt", mode='r', encoding='utf-8-sig') as f:
    for i,label in enumerate(f.readlines()):
        index_to_label[str(i)] = label.strip()
print(index_to_label)

## Data Preprocessing
We need to preprocess the training data into **space separated tokenized text** format which can be consumed by `BlazingText` algorithm. Also, as mentioned previously, the class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence. We'll use `nltk` library to tokenize the input sentences from DBPedia dataset. 

Download the nltk tokenizer and other libraries

In [None]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import nltk
nltk.download('punkt')

In [None]:
def transform_instance(row):
    cur_row = []
    label = "__label__" + index_to_label[row[0]]  #Prefix the index-ed label with __label__
    cur_row.append(label)
    cur_row.extend(nltk.word_tokenize(row[1].lower()))
    cur_row.extend(nltk.word_tokenize(row[2].lower()))
    return cur_row

The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module

In [None]:
def preprocess(input_file, output_file, keep=1):
    all_rows = []
    with open(input_file, mode='r', encoding='utf-8-sig') as csvinfile:
        csv_reader = csv.reader(csvinfile, delimiter=',')
        for row in csv_reader:
            all_rows.append(row)
    shuffle(all_rows)
    all_rows = all_rows[:int(keep*len(all_rows))]
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, all_rows)
    pool.close() 
    pool.join()
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        csv_writer.writerows(transformed_rows)

In [None]:
%%time

#open('blazingtext/train.csv', mode='r', encoding='utf-8-sig')
# Preparing the training dataset

# Since preprocessing the whole dataset might take a couple of mintutes,
# we keep 20% of the training dataset for this demo.
# Set keep to 1 if you want to use the complete dataset
preprocess('blazingtext/train.csv', 'blazingtext/airbnb.train', keep=1)

# Preparing the validation dataset        
preprocess('blazingtext/test.csv', 'blazingtext/airbnb.validation')

The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   

In [None]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='blazingtext/airbnb.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='blazingtext/airbnb.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [None]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [None]:
region_name = boto3.Session().region_name

In [None]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

## Training the BlazingText model for supervised text classification

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow (supports subwords training) 	| skipgram (supports subwords training) 	| batch_skipgram 	| supervised |
|:----------------------:	|:----:	|:--------:	|:--------------:	| :--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|  ✔  |
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|  ✔ (Instance with 1 GPU only)  |
| Multiple CPU instances 	|      	|          	|        ✔       	|     | |

Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using "supervised" mode on a `c4.4xlarge` instance.


In [None]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.m5.large',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [None]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [None]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

In [None]:
bt_model.fit(inputs=data_channels, logs=True)

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [None]:
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m5.large')

#### Use JSON format for inference
BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [None]:
#!head test.csv -n 3

all_rows = []
with open('blazingtext/test.csv', mode='r', encoding='utf-8-sig') as csvinfile:
    csv_reader = csv.reader(csvinfile, delimiter=',')
    for row in csv_reader:
        all_rows.append(row)
#all_rows = all_rows[:int(keep*len(all_rows))]
#print(all_rows)
#pool = Pool(processes=multiprocessing.cpu_count())
#transformed_rows = pool.map(transform_instance, all_rows)
#pool.close() 
#pool.join()

#print(all_rows[0][0])
#print(all_rows[0][2])



#sentences = ["Convair was an american aircraft manufacturing company which later expanded into rockets and spacecraft.",
#            "Berwick secondary college is situated in the outer melbourne metropolitan suburb of berwick ."]

# The test data consists of 30 sentences (0-29), select a sentence to validate the prediction
sentence = 29

sentences = [all_rows[sentence][2]]

printmd("**Raw Sentence:**")
print(sentences)

printmd("**Groundtruth (0=Not Great, 1=Great):**")
print(all_rows[sentence][0])

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]

payload = {"instances" : tokenized_sentences}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)

printmd("**Prediction:**")
print(json.dumps(predictions, indent=2))

By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set `k` in the configuration as shown below:

In [None]:
payload = {"instances" : tokenized_sentences,
          "configuration": {"k": 2}}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

### Stop / Close the Endpoint (Optional)
Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions.

## Lab 3 -- BERT Custom Model Lab
Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification.

In [None]:
# COPY CUSTOMER CODE

## Resource Cleanup
Lets remove the resources we've used during this lab...

In [None]:
# REMOVE THE SAGEMAKER ENDPOINT
sess.delete_endpoint(text_classifier.endpoint)

In [49]:
# DELETE REMAINING RESOURCES
# S3 Bucket
!aws s3 rb s3://{bucket} --force

# Comprehend Classifier
client = boto3.client('comprehend')
response = client.delete_document_classifier(
    DocumentClassifierArn=docclassifierarn
)

# REMOVE IAM ROLE
client = boto3.client('iam')
response = client.detach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= policyArn
)
response = client.detach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= 'arn:aws:iam::aws:policy/service-role/ComprehendDataAccessRolePolicy'
)
response = client.delete_role(
    RoleName= bucket + '-role'
)

# REMOVE IAM POLICY
response = client.delete_policy(
    PolicyArn= policyArn
)


NoSuchEntityException: An error occurred (NoSuchEntity) when calling the DetachRolePolicy operation: The role with name rga-aws-ai-workshop-pod-9996-role cannot be found.