## Lab 1 -- Comprehend Lab
Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification.

In [5]:
# IMPORT DEPENDENCIES
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import botocore
import os
import re
import logging


In [7]:
#Prepare for the lab by learning the current IAM Role and the S3 bucket we will use...

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

# REPLACE THE POD BELOW WITH THE NUMBER ASSIGN TO YOU 
pod = '9999'
region = 'us-east-2'
prefix = 'comprehend' #we will store the dataset we will use to train a custom classification classifier

bucket = "rga-aws-ai-workshop-pod-" + pod # Replace with your own bucket name if needed
print("The S3 bucket you are using to upload your dataset is:", bucket)

arn:aws:iam::969219788367:role/service-role/AmazonSageMaker-ExecutionRole-20181112T160917
The S3 bucket you are using to upload your dataset is: rga-aws-ai-workshop-pod-9999


In [33]:
# Run the first review through Comprehend native service.
client = boto3.client('comprehend')

#Lets first run the first Airbnb review through the native sentiment analysis of Comprehend to see whats available
#without customization
textreview = 'We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.'
print(textreview + '\n')

response = client.detect_sentiment(
    Text=textreview,
    LanguageCode='en'
)
print(json.dumps(response, indent=4))

#Note the ability to extract the overall sentiment from the reviews. Comprehend can provide: positive, negative,
#neutral, and mixed from the text with no training/tuning
#This can also be run from the Comprehend AWS Console interface


We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.

{
    "Sentiment": "POSITIVE",
    "SentimentScore": {
        "Positive": 0.999503493309021,
        "Negative": 6.865667819511145e-05,
        "Neutral": 0.0004137569048907608,
        "Mixed": 1.4093851859797724e-05
    },
    "ResponseMetadata": {
        "RequestId": "d553d6a2-c55d-4d9d-a2bf-4588e13f0551",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "d553d6a2-c55d-4d9d-a2bf-4588e13f0551",
            "content-type": "application/x-amz-json-1.1

In [36]:
#Lets also use Comprehend to detect entities and key phrases for our Airbnb review...
response = client.detect_entities(
    Text=textreview,
    LanguageCode='en'
)
print(textreview)
print(json.dumps(response, indent=4) + '\n')


response = client.detect_key_phrases(
    Text=textreview,
    LanguageCode='en'
)
print(textreview)
print(json.dumps(response, indent=4))

#Make note of the different information that can be extracted from the review with no ML model training or tuning....


We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.
{
    "Entities": [
        {
            "Score": 0.9984429478645325,
            "Type": "LOCATION",
            "Text": "Chicago",
            "BeginOffset": 11,
            "EndOffset": 18
        },
        {
            "Score": 0.46189820766448975,
            "Type": "PERSON",
            "Text": "Hamilton",
            "BeginOffset": 26,
            "EndOffset": 34
        },
        {
            "Score": 0.9971076846122742,
            "Type": "PERSON",
            "Text": "Rob",
 

## Instructions to Update the IAM Role for the Sagemaker Notebook
Need to provide access to:

- The S3 bucket we will create
- Comprehend
- Sagemaker stuff
- Create a new IAM role for Comprehend DataAccessRoleArn & include the S3 bucket that was created.

In [9]:
#CREATE THE BUCKET to be used for the dataset
def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region

    If a region is not specified, the bucket is created in the S3 default
    region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """

    # Create bucket
    try:
        if region is None:
            s3_client = boto3.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client = boto3.client('s3', region_name=region)
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return True

create_bucket(bucket, region)

NameError: name 'ClientError' is not defined

In [42]:
#Inspect the dataset
!ls comprehend/
!head comprehend/airbnb-reviews-training.csv -n 2

#Note that the data has two columns, the first is a custom label for 'great' and 'notgreat' to indicate
#the custom labels that have been applied


airbnb-reviews-training.csv
﻿notgreat,We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again
notgreat,"Chris' place is really lovely! Plenty of space for 2 people, with lots of thoughtful touches like biscuits, bottled water, tea and coffee etc! The kitchen was well equipped to make meals, and Chris quickly provided a can opener when we asked to borrow one. The washer and dryer in the apartment were also very handy. Good location in a residential neighbourhood, about 10-15 min walk to the local Aldi and 10 mins to shops, including the currency exc

True

In [67]:
#UPLOAD THE DATASET TO S3 FROM THE LOCAL SYSTEM FOR COMPREHEND CUSTOM TRAINING
def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

upload_file('comprehend/airbnb-reviews-training.csv', bucket)
upload_file('comprehend/airbnb-reviews-holdout.csv', bucket)


True

## Create IAM Role for Comprehend to read S3 data from your bucket

Create Policy:
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
        ],
        "Resource": [
            "arn:aws:s3:::rga-aws-ai-workshop-pod-9999",
            "arn:aws:s3:::rga-aws-ai-workshop-pod-9999"
        ]
    }
}

Create Role:


In [68]:
# Instantiate Boto3 SDK:
client = boto3.client('comprehend', region_name=region)
s3_uri = 's3://' + bucket + '/' + prefix + '/' + 'airbnb-reviews-training.csv'

#UPDATE THE ARN FROM THE ROLE YOU CREATED IN THE PREVIOUS SECTION
dataaccessarn='arn:aws:iam::969219788367:role/comprehend-rga'

docclassifier='RGAAirbnb-' + pod

# Create a document classifier
create_response = client.create_document_classifier(
    InputDataConfig={
        'S3Uri': s3_uri
    },
    DataAccessRoleArn=dataaccessarn,
    DocumentClassifierName=docclassifier,
    LanguageCode='en'
)
print('Create response:\n' + str(create_response))

# Check the status of the classifier
docclassifierarn = create_response['DocumentClassifierArn']
describe_response = client.describe_document_classifier(
    DocumentClassifierArn=docclassifierarn)
print('Describe response:\n' + str(describe_response))


Create response:
{'DocumentClassifierArn': 'arn:aws:comprehend:us-east-2:969219788367:document-classifier/RGAAirbnb-9999', 'ResponseMetadata': {'RequestId': '5b7dce71-8f02-4d85-a1a1-766ef15683e7', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '5b7dce71-8f02-4d85-a1a1-766ef15683e7', 'content-type': 'application/x-amz-json-1.1', 'content-length': '104', 'date': 'Tue, 15 Oct 2019 19:09:38 GMT'}, 'RetryAttempts': 0}}
Describe response:
{'DocumentClassifierProperties': {'DocumentClassifierArn': 'arn:aws:comprehend:us-east-2:969219788367:document-classifier/RGAAirbnb-9999', 'LanguageCode': 'en', 'Status': 'SUBMITTED', 'SubmitTime': datetime.datetime(2019, 10, 15, 19, 9, 39, 78000, tzinfo=tzlocal()), 'InputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9999/comprehend/airbnb-reviews-training.csv'}, 'OutputDataConfig': {}, 'DataAccessRoleArn': 'arn:aws:iam::969219788367:role/comprehend-rga'}, 'ResponseMetadata': {'RequestId': 'aeae082b-3970-412d-ae8b-a706348d663e', 'HTTPStat

In [140]:
# Check the status of the classifier
# Look for a status on TRAINED before moving to the next step
describe_response = client.describe_document_classifier(
    DocumentClassifierArn=docclassifierarn)
print('Describe response:\n' + str(describe_response))

if describe_response['DocumentClassifierProperties']['Status'] == 'TRAINED':
    print('\n\nCustom Model Accuracy:\n' + str(describe_response['DocumentClassifierProperties']['ClassifierMetadata']['EvaluationMetrics']))

# Alternatively please review the Comprehend AWS Console GUI to validate if your custom job is still training

Describe response:
{'DocumentClassifierProperties': {'DocumentClassifierArn': 'arn:aws:comprehend:us-east-2:969219788367:document-classifier/RGAAirbnb-9999', 'LanguageCode': 'en', 'Status': 'TRAINED', 'SubmitTime': datetime.datetime(2019, 10, 15, 19, 9, 39, 78000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2019, 10, 15, 19, 22, 19, 874000, tzinfo=tzlocal()), 'TrainingStartTime': datetime.datetime(2019, 10, 15, 19, 12, 5, 663000, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2019, 10, 15, 19, 20, 47, 490000, tzinfo=tzlocal()), 'InputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9999/comprehend/airbnb-reviews-training.csv'}, 'OutputDataConfig': {}, 'ClassifierMetadata': {'NumberOfLabels': 2, 'NumberOfTrainedDocuments': 1152, 'NumberOfTestDocuments': 127, 'EvaluationMetrics': {'Accuracy': 0.9055, 'Precision': 0.4528, 'Recall': 0.5, 'F1Score': 0.4752}}, 'DataAccessRoleArn': 'arn:aws:iam::969219788367:role/comprehend-rga'}, 'ResponseMetadata': {'RequestId': '66f65791-9

In [100]:
client = boto3.client('comprehend', region_name=region)
s3_uri_in = 's3://' + bucket + '/' + prefix + '/' + 'airbnb-reviews-holdout.csv'
s3_uri_out = 's3://' + bucket + '/' + prefix + '/' + 'output'
jobname='RGAAirbnb-Job-' + pod

start_response = client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': s3_uri_in,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_uri_out
    },
    DataAccessRoleArn=dataaccessarn,
    DocumentClassifierArn=docclassifierarn,
    JobName=jobname
)

print("Start response:\n", start_response)

# Check the status of the job
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
print("Describe response:\n", describe_response)


Start response:
 {'JobId': 'f2bcf157993c25ad18f34e08227ec7ed', 'JobStatus': 'SUBMITTED', 'ResponseMetadata': {'RequestId': '626922fa-84dc-47f2-8176-750325487cb0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '626922fa-84dc-47f2-8176-750325487cb0', 'content-type': 'application/x-amz-json-1.1', 'content-length': '68', 'date': 'Tue, 15 Oct 2019 19:58:02 GMT'}, 'RetryAttempts': 0}}
Describe response:
 {'DocumentClassificationJobProperties': {'JobId': 'f2bcf157993c25ad18f34e08227ec7ed', 'JobName': 'RGAAirbnb-Job-9999', 'JobStatus': 'SUBMITTED', 'SubmitTime': datetime.datetime(2019, 10, 15, 19, 58, 2, 195000, tzinfo=tzlocal()), 'DocumentClassifierArn': 'arn:aws:comprehend:us-east-2:969219788367:document-classifier/RGAAirbnb-9999', 'InputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9999/comprehend/airbnb-reviews-holdout.csv', 'InputFormat': 'ONE_DOC_PER_LINE'}, 'OutputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9999/comprehend/output/969219788367-CLN-f2bcf157993

In [108]:
# Check the status of the job.
# When JobStatus is COMPLETED you can move to the next step.
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
print("Describe response:\n", describe_response)

Describe response:
 {'DocumentClassificationJobProperties': {'JobId': 'f2bcf157993c25ad18f34e08227ec7ed', 'JobName': 'RGAAirbnb-Job-9999', 'JobStatus': 'COMPLETED', 'SubmitTime': datetime.datetime(2019, 10, 15, 19, 58, 2, 195000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2019, 10, 15, 20, 2, 38, 293000, tzinfo=tzlocal()), 'DocumentClassifierArn': 'arn:aws:comprehend:us-east-2:969219788367:document-classifier/RGAAirbnb-9999', 'InputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9999/comprehend/airbnb-reviews-holdout.csv', 'InputFormat': 'ONE_DOC_PER_LINE'}, 'OutputDataConfig': {'S3Uri': 's3://rga-aws-ai-workshop-pod-9999/comprehend/output/969219788367-CLN-f2bcf157993c25ad18f34e08227ec7ed/output/output.tar.gz'}, 'DataAccessRoleArn': 'arn:aws:iam::969219788367:role/comprehend-rga'}, 'ResponseMetadata': {'RequestId': 'fecd57e7-e40f-409f-9e32-ea39a75279da', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'fecd57e7-e40f-409f-9e32-ea39a75279da', 'content-type': 'applica

In [113]:
output_s3 = describe_response['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri']
#print(output_s3)
!aws s3 cp {output_s3} comprehend/output.tar.gz
!tar -xvzf comprehend/output.tar.gz -C comprehend
!head comprehend/predictions.jsonl -n 20


download: s3://rga-aws-ai-workshop-pod-9999/comprehend/output/969219788367-CLN-f2bcf157993c25ad18f34e08227ec7ed/output/output.tar.gz to comprehend/output.tar.gz
predictions.jsonl
{"File": "airbnb-reviews-holdout.csv", "Line": "0", "Classes": [{"Name": "notgreat", "Score": 0.8829}, {"Name": "great", "Score": 0.1172}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "1", "Classes": [{"Name": "notgreat", "Score": 0.8356}, {"Name": "great", "Score": 0.1644}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "2", "Classes": [{"Name": "notgreat", "Score": 0.9084}, {"Name": "great", "Score": 0.0917}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "3", "Classes": [{"Name": "notgreat", "Score": 0.8934}, {"Name": "great", "Score": 0.1066}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "4", "Classes": [{"Name": "notgreat", "Score": 0.8968}, {"Name": "great", "Score": 0.1032}]}
{"File": "airbnb-reviews-holdout.csv", "Line": "5", "Classes": [{"Name": "notgreat", "Score": 0.8526}, {"Name": "great", 

## Lab 2 -- Sagemaker model training & delivery using a native algorithm (BlazingText)

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [2]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import botocore
import os
import re

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = "sagemaker-us-east-2-969219788367" # Replace with your own bucket name if needed
print(bucket)
prefix = 'blazingtext/supervised' #Replace with the prefix under which you want to store the data if needed

arn:aws:iam::969219788367:role/service-role/AmazonSageMaker-ExecutionRole-20181112T160917
sagemaker-us-east-2-969219788367


### Data Preparation

Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".

In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. 

In [4]:
#!wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz

labels_file = "classes.txt"
train_file = "train.csv"
test_file = "test.csv"

s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(labels_file, labels_file)
s3.Bucket(bucket).download_file(train_file, train_file)
s3.Bucket(bucket).download_file(test_file, test_file)
!ls -la


total 125968
drwxrwxrwx 4 ec2-user ec2-user     4096 Sep 26 17:03 .
drwxr-xr-x 5 ec2-user ec2-user     4096 Sep 26 15:35 ..
-rw-rw-rw- 1 ec2-user ec2-user    38604 Sep 26 16:47 blazingtext_text_classification_dbpedia.ipynb
-rw-rw-r-- 1 ec2-user ec2-user    38428 Sep 26 17:03 blazingtext_text_classification_rga_airbnb.ipynb
-rw-rw-r-- 1 ec2-user ec2-user       18 Sep 26 17:03 classes.txt
drwxrwxr-x 2 ec2-user ec2-user     4096 Mar 29  2015 dbpedia_csv
-rw-rw-r-- 1 ec2-user ec2-user 68431223 Sep 26 15:37 dbpedia_csv.tar.gz
-rw-rw-r-- 1 ec2-user ec2-user 36675292 Sep 26 16:29 dbpedia.train
-rw-rw-r-- 1 ec2-user ec2-user 22950227 Sep 26 16:29 dbpedia.validation
drwxrwxr-x 2 ec2-user ec2-user     4096 Sep 26 16:47 .ipynb_checkpoints
-rw-rw-r-- 1 ec2-user ec2-user    22881 Sep 26 17:03 test.csv
-rw-rw-r-- 1 ec2-user ec2-user   799616 Sep 26 17:03 train.csv


In [7]:
#!tar -xzvf dbpedia_csv.tar.gz

rm: cannot remove ‘dbpedia_csv’: No such file or directory


Let us inspect the dataset and the classes to get some understanding about how the data and the label is provided in the dataset. 

In [8]:
#!head dbpedia_csv/train.csv -n 3
!head train.csv -n 3

﻿0,"Explore Logan Square from Renovated Condo with Free Parking","The apartment was lovely and the beds were very comfortable. I was traveling with 3 younger girls (12, 11, and 11) and I loved that the location was near plenty of coffee shops, restaurants, and parks. The area is definitely a mix of lower to mid income residents and students. With plenty of funky, high end, hipster spots. We felt very safe walking the neighborhood. And we had plenty of room in the space to spread out! The biggest plus was the free parking. Leo and Alex were gracious enough to allow us to use the lot a bit early, before check in, and a bit later, after check out. All things were accessible by Uber in 5 minutes or so. ~20 Uber ride to downtown. I would definitely give this spot another visit and recommend it to others."
0,"Explore Logan Square from Renovated Condo with Free Parking","Our family enjoyed the space. The check-in process was easy. The photos and description are accurate. The house was very 

As can be seen from the above output, the CSV has 3 fields - Label index, title and abstract. Let us first create a label index to label name mapping and then proceed to preprocess the dataset for ingestion by BlazingText.

Next we will print the labels file (`classes.txt`) to see all possible labels followed by creating an index to label mapping.

In [9]:
#!cat dbpedia_csv/classes.txt
!cat classes.txt


﻿NotGreat
Great

The following code creates the mapping from integer indices to class label which will later be used to retrieve the actual class name during inference. 

In [14]:
#index_to_label = {} 
#with open("dbpedia_csv/classes.txt") as f:
#    for i,label in enumerate(f.readlines()):
#        index_to_label[str(i+1)] = label.strip()
#print(index_to_label)

index_to_label = {} 
with open("classes.txt", mode='r', encoding='utf-8-sig') as f:
    for i,label in enumerate(f.readlines()):
        index_to_label[str(i)] = label.strip()
print(index_to_label)

{'0': 'NotGreat', '1': 'Great'}


## Data Preprocessing
We need to preprocess the training data into **space separated tokenized text** format which can be consumed by `BlazingText` algorithm. Also, as mentioned previously, the class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence. We'll use `nltk` library to tokenize the input sentences from DBPedia dataset. 

Download the nltk tokenizer and other libraries

In [19]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [20]:
def transform_instance(row):
    cur_row = []
    label = "__label__" + index_to_label[row[0]]  #Prefix the index-ed label with __label__
    cur_row.append(label)
    cur_row.extend(nltk.word_tokenize(row[1].lower()))
    cur_row.extend(nltk.word_tokenize(row[2].lower()))
    return cur_row

The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module

In [21]:
def preprocess(input_file, output_file, keep=1):
    all_rows = []
    with open(input_file, mode='r', encoding='utf-8-sig') as csvinfile:
        csv_reader = csv.reader(csvinfile, delimiter=',')
        for row in csv_reader:
            all_rows.append(row)
    shuffle(all_rows)
    all_rows = all_rows[:int(keep*len(all_rows))]
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, all_rows)
    pool.close() 
    pool.join()
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        csv_writer.writerows(transformed_rows)

In [22]:
%%time

# Preparing the training dataset

# Since preprocessing the whole dataset might take a couple of mintutes,
# we keep 20% of the training dataset for this demo.
# Set keep to 1 if you want to use the complete dataset
preprocess('train.csv', 'airbnb.train', keep=1)
        
# Preparing the validation dataset        
preprocess('test.csv', 'airbnb.validation')

CPU times: user 72.5 ms, sys: 34.4 ms, total: 107 ms
Wall time: 807 ms


The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   

In [23]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='airbnb.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='airbnb.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 48.9 ms, sys: 9.56 ms, total: 58.4 ms
Wall time: 251 ms


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [24]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [25]:
region_name = boto3.Session().region_name

In [26]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

Using SageMaker BlazingText container: 825641698319.dkr.ecr.us-east-2.amazonaws.com/blazingtext:latest (us-east-2)


## Training the BlazingText model for supervised text classification

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow (supports subwords training) 	| skipgram (supports subwords training) 	| batch_skipgram 	| supervised |
|:----------------------:	|:----:	|:--------:	|:--------------:	| :--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|  ✔  |
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|  ✔ (Instance with 1 GPU only)  |
| Multiple CPU instances 	|      	|          	|        ✔       	|     | |

Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using "supervised" mode on a `c4.4xlarge` instance.


In [27]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [31]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [32]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

In [33]:
bt_model.fit(inputs=data_channels, logs=True)

2019-09-26 17:21:48 Starting - Starting the training job...
2019-09-26 17:21:50 Starting - Launching requested ML instances...
2019-09-26 17:22:45 Starting - Preparing the instances for training......
2019-09-26 17:23:44 Downloading - Downloading input data
2019-09-26 17:23:44 Training - Downloading the training image..[31mArguments: train[0m
[31m[09/26/2019 17:23:58 INFO 139821395216192] nvidia-smi took: 0.0252521038055 secs to identify 0 gpus[0m
[31m[09/26/2019 17:23:58 INFO 139821395216192] Running single machine CPU BlazingText training using supervised mode.[0m
[31m[09/26/2019 17:23:58 INFO 139821395216192] Processing /opt/ml/input/data/train/airbnb.train . File size: 0 MB[0m
[31m[09/26/2019 17:23:58 INFO 139821395216192] Processing /opt/ml/input/data/validation/airbnb.validation . File size: 0 MB[0m
[31mRead 0M words[0m
[31mNumber of words:  3574[0m
[31mLoading validation data from /opt/ml/input/data/validation/airbnb.validation[0m
[31mLoaded validation data.[0m

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [34]:
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

-------------------------------------------------------------------------------------!

#### Use JSON format for inference
BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [53]:
#!head test.csv -n 3

all_rows = []
with open('test.csv', mode='r', encoding='utf-8-sig') as csvinfile:
    csv_reader = csv.reader(csvinfile, delimiter=',')
    for row in csv_reader:
        all_rows.append(row)
#all_rows = all_rows[:int(keep*len(all_rows))]
#print(all_rows)
#pool = Pool(processes=multiprocessing.cpu_count())
#transformed_rows = pool.map(transform_instance, all_rows)
#pool.close() 
#pool.join()

#print(all_rows[0][0])
#print(all_rows[0][2])



#sentences = ["Convair was an american aircraft manufacturing company which later expanded into rockets and spacecraft.",
#            "Berwick secondary college is situated in the outer melbourne metropolitan suburb of berwick ."]

sentences = [all_rows[29][2]]
#print(sentences)
print(all_rows[29][0])

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]

payload = {"instances" : tokenized_sentences}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

1
[
  {
    "prob": [
      0.8895270228385925
    ],
    "label": [
      "__label__NotGreat"
    ]
  }
]


By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set `k` in the configuration as shown below:

In [24]:
payload = {"instances" : tokenized_sentences,
          "configuration": {"k": 2}}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "prob": [
      0.9971234798431396,
      0.0017487191362306476
    ],
    "label": [
      "__label__Company",
      "__label__MeanOfTransportation"
    ]
  },
  {
    "prob": [
      0.9984437823295593,
      0.0005028279265388846
    ],
    "label": [
      "__label__EducationalInstitution",
      "__label__OfficeHolder"
    ]
  }
]


### Stop / Close the Endpoint (Optional)
Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions.

In [25]:
sess.delete_endpoint(text_classifier.endpoint)