## Introduction

We will illustrate the different ways you can consume AI/ML services in AWS with this notebook. We see the Machine Learning stack having three key layers: **Frameworks and Interfaces** for machine learning practitioners, **Platform Services** that make it easy for any developer to get started and get deep with ML, and **Application Services** that enable developers to plug-in pre-built AI functionality into their apps without having to worry about the machine learning models that power these services.



## Lab 1 AWS Application Services Layer - Amazon Comprehend

**Application services** serve developers and companies who want to add solution-oriented intelligence to their applications through an API call rather than developing and training their own models. For example, in computer vision, we developed Amazon Rekognition to allow developers to easily build intelligent video and image analysis into their applications. [C-SPAN](https://aws.amazon.com/solutions/case-studies/cspan/) is using Amazon Rekognition to automate footage tagging, and prior to Rekognition, they could only index about half of their footage due to the immense amount of manual processes. But, they were able to build a solution using Rekognition in 3 weeks, and can now scan all their footage to identify and retrieve 99,000 political figures, saving approximately 9,000 hours a year in labor.

**Lab 1** will use **Amazon Comprehend** a service that uses machine learning to find insights and relationships in text. Amazon Comprehend can identify the language of the text; extract key phrases, places, people, brands, or events; understand how positive or negative the text is; and automatically organize a collection of text files by topic.

We will show how to extract sentiment, entities, and key phases using the native service with no ML training required. Next, we will show how to train a Comprehend [Custom Classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) using the R/GA AirBnb dataset of 1,300 reviews which are labeled great or notgreat to indicate the really amazing 5-star reviews (great) vs the other 5-star reviews not as good. We will use this custom model to score additional reviews.

## Lab 2 AWS Platform Services Layer - Amazon Sagemaker

The Platform Layer provides an end-to-end platform to coordinate the ML workflow in a seamless and easy way using Amazon Sagemaker. SageMaker removes the complexity that holds back developer success with the many steps in the ML workflow such as: data collection & preparation, algorithm & framework selection, model training & tuning, and model deployment. SageMaker includes modules that can be used together or independently to build, train, and deploy your machine learning models. SageMaker makes it easy to build ML models and get them ready for training by providing everything you need to quickly connect to your training data, and to select and optimize the best algorithm and framework for your application. Amazon SageMaker includes hosted Jupyter notebooks that make it is easy to explore and visualize your training data stored in Amazon S3. You can connect directly to data in S3, or use AWS Glue to move data from Amazon RDS, Amazon DynamoDB, and Amazon Redshift into S3 for analysis in your notebook. SageMaker [includes several common machine learning algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) which have been pre-installed and optimized to deliver up to 10 times the performance you’ll find running these algorithms anywhere else.

**Lab 2** will use Sagemaker and a built-in algorithm for natural language processing (NLP) called BlazingText to train a model with the same Airbnb dataset. Once trained we will host a model on a high-level available endpoint hosted in Sagemaker, and we will call that endpoint to provide inference for additional reviews.

## AWS Frameworks and Interfaces
The Frameworks and interfaces layer is for the expert machine learning practitioners. These are people comfortable building deep learning models, working with deep learning frameworks, building clusters, etc. They can get extremely deep. For this group, the most of machine learning and deep learning being done on the cloud today are being done on AWS with our GPU instances - which are optimized for ML and DL. In 2017, we launched the P3 instance, which is fourteen times faster than the P2 instance. P3 provides a huge boost to the speed and efficiency of deep learning and machine learning workloads.

AWS provides choice with regards to frameworks we support and offer (ie Tensorflow, Caffe2, MXNet, Keras, CNTK), we strive to support them all. We will always provide a great solution for all the frameworks and choices that people want to make.

**Lab 3** will demonstrate using a pre-existing model [RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta) with the same AirBnb dataset hosted in a Jupyter notebook within our Sagemaker notebook. We will run the model directly from our Sagemaker notebook, if desired you can [host custom models in Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html).





## Lab 1 -- Comprehend Lab
Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc.

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.

You work with one or more documents at a time to evaluate their content and gain insights about them. Some of the insights that Amazon Comprehend develops about a document include:

- **Entities** – Amazon Comprehend returns a list of entities, such as people, places, and locations, identified in a document. For more information, see Detect Entities.

- **Key phrases** – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score. For more information, see Locate Key Phrases.

- **Language** – Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can identify 100 languages. For more information, see Detect the Dominant Language.

- **Sentiment** – Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. For more information, see Determine the Sentiment.

- **Syntax** – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun. For more information, see Analyze Syntax.
    
More information about Amazon Comprehend is [here](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html).

In [None]:
# IMPORT DEPENDENCIES
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import botocore
import os
import re
import logging
import time
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))


In [None]:
#Prepare for the lab by learning the current IAM Role and the S3 bucket we will use...We will need to update the
#IAM role. Please review the README.md of the Git repo for me details.

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

# REPLACE THE POD BELOW WITH THE NUMBER ASSIGN TO YOU 
pod = '9998'

# REPLACE THE REGION BELOW WITH THE REGION YOU ARE OPERATING IN (ie N. Virginia=us-east-1, Ohio=us-east-2)
region = 'us-east-1'

prefix = 'comprehend' #we will store the dataset we will use to train a custom classification classifier

bucket = "rga-aws-ai-workshop-pod-" + pod # Replace with your own bucket name if needed
print("The S3 bucket you will be using to upload your dataset is:", bucket)

## Instructions to Update the IAM Role for the Sagemaker Notebook
Please follow the 
Need to provide access to:

- The S3 bucket we will create
- Comprehend
- Sagemaker stuff
- Create a new IAM role for Comprehend DataAccessRoleArn & include the S3 bucket that was created.

In [None]:
# Run the first Airbnb review through Comprehend native service.
client = boto3.client('comprehend')

#Lets first run the first Airbnb review through the native sentiment analysis of Comprehend to see whats available
#without customization. Please note this is a simple API call with no ML training required.
#Feel free to re-run this cell with another review you would like to extract information on.
textreview = 'We were In Chicago to see Hamilton and sightsee with our teenaged daughter. We loved this location and neighborhood. It felt safe and there were tons of nearby places to eat or get groceries. The condo itself was extremely clean and was well stocked with necessities. It was spacious and we never felt crowded. Rob was a great host and was very responsive to our questions. It was a plus to have parking available as we never moved our car until we left Chicago. I would definitely rent from Rob again.'
printmd("**Raw text Airbnb review:**")
print(textreview + '\n')

response = client.detect_sentiment(
    Text=textreview,
    LanguageCode='en'
)
printmd("**Comprehend native sentiment analysis:**")
print(json.dumps(response, indent=4))

#Note the ability to extract the overall sentiment from the reviews. Comprehend can provide: positive, negative,
#neutral, and mixed from the text with no training/tuning
#This can also be run from the Comprehend AWS Console interface


In [None]:
#Next lets use Comprehend to detect entities and key phrases for our Airbnb review...
response = client.detect_entities(
    Text=textreview,
    LanguageCode='en'
)
printmd("**Raw text Airbnb review:**")
print(textreview + '\n')

printmd("**Comprehend entities detected:**")
print(json.dumps(response, indent=4) + '\n')

response = client.detect_key_phrases(
    Text=textreview,
    LanguageCode='en'
)

printmd("**Raw text Airbnb review:**")
print(textreview + '\n')

printmd("**Comprehend key phrases detected:**")
print(json.dumps(response, indent=4))

#Make note of the different information that can be extracted from the review with no ML model training or tuning....


In [None]:
#CREATE THE BUCKET IN YOUR AWS ACCT to be used for the dataset
def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region

    If a region is not specified, the bucket is created in the S3 default
    region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """

    # Create bucket
    try:
        if (region is None) or (region == 'us-east-1'):
            s3_client = boto3.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client = boto3.client('s3', region_name=region)
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return print(bucket + ' S3 bucket created successfully!')

create_bucket(bucket, region)

In [None]:
#Inspect the dataset. We are reviewing only the first two entries.
printmd("**Comprehend working directory contents:**")
!ls comprehend/ -la
!pwd

printmd("**First 2 entries from the training dataset:**")
!head comprehend/airbnb-reviews-training.csv -n 2

# We are assuming that the data transformations have occurred upstream and we have formated the data required for
# Comprehend in the dataset included in the repo. Note that the data has two columns (comma separated), 
# the first is a custom label for 'great' or 'notgreat' to indicate the custom labels that have been applied.
# This is the format required for the Comprehend custom classifier. The first column is the custom label,
# the second column contains the raw review.


In [None]:
#UPLOAD THE DATASET TO S3 FROM THE LOCAL SYSTEM FOR COMPREHEND CUSTOM TRAINING
def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return print(file_name + ' object uploaded successfully!')

upload_file('comprehend/airbnb-reviews-training.csv', bucket)
upload_file('comprehend/airbnb-reviews-holdout.csv', bucket)


## Create IAM Role for Comprehend to read S3 data from your bucket

Comprehend requires permissions to access the dataset in S3. In order to do this we will create an IAM role (Data Access Role) with the required policies to allow Comprehend to assume this role. This role has rights to read from the S3 bucket we created earlier as well as the following native Comprehend policy: _ComprehendDataAccessRolePolicy_.

Please see the Comprehend docs for more detail [here](https://docs.aws.amazon.com/comprehend/latest/dg/access-control-managing-permissions.html#auth-role-permissions).

In [None]:
# CREATE THE IAM POLICY & ROLE FOR COMPREHEND TO USE TO TRAIN THE CUSTOM MODEL
# CREATE IAM POLICY
policyDocumentStr = '''
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
        ],
        "Resource": [
            "arn:aws:s3:::%s"
        ]
    }
}
'''%(bucket)
pattern = re.compile(r'[\s\r\n]+')
policyDocumentStr = re.sub(pattern, '', policyDocumentStr)

client = boto3.client('iam')
response = client.create_policy(
    PolicyName= bucket + '-policy',
    PolicyDocument=policyDocumentStr,
    Description='IAM permissions policy for Comprehend access to S3 bucket for RGA AWS AI Workshop'
)
print('The ' + response['Policy']['PolicyName'] + ' IAM policy was created')
policyArn= response['Policy']['Arn']

# CREATE IAM ROLE
trustPolicyDocumentStr = '''
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "comprehend.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}'''
pattern = re.compile(r'[\s\r\n]+')
trustPolicyDocumentStr = re.sub(pattern, '', trustPolicyDocumentStr)

response = client.create_role(
    RoleName= bucket + '-role',
    AssumeRolePolicyDocument= trustPolicyDocumentStr,
    Description='IAM role for Comprehend access to S3 bucket for RGA AWS AI Workshop'
)
print('The ' + response['Role']['RoleName'] + ' IAM role was created')
dataaccessarn=response['Role']['Arn']

# ASSIGN POLICIES TO ROLE
response = client.attach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= policyArn
)
response = client.attach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= 'arn:aws:iam::aws:policy/service-role/ComprehendDataAccessRolePolicy'
)


## Build the Comprehend Custom Classifier
You can customize Comprehend for your specific requirements without the skillset required to build machine learning-based NLP solutions. Using automatic machine learning, or AutoML, Comprehend Custom builds customized NLP models on your behalf, using data you already have.

Amazon Comprehend uses a proprietary, state-of-the-art sequence tagging deep neural network model that powers the same Amazon Comprehend detect entities service to train your custom entity recognizer models. In addition, we understand that acquiring training data could be costly. To help customers build a highly accurate model with limited amount of data, Amazon Comprehend uses a technique called transfer learning to train your custom models based on an sophisticated general-purpose entities recognition model that was pre-trained with a large amount of data we collected from multiple domains. Offline experiments showed that transfer learning significantly improved custom entity recognizer model accuracy especially when the amount of training data is small.

To learn more about training a Comprehend Custom Classifier please review the docs [here](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html).


In [None]:
# CREATE THE COMPREHEND CUSTOM CLASSIFIER
client = boto3.client('comprehend', region_name=region)
s3_uri = 's3://' + bucket + '/' + prefix + '/' + 'airbnb-reviews-training.csv'

#UPDATE THE ARN FROM THE ROLE YOU CREATED IN THE PREVIOUS SECTION
#account_id = role.split(':')[4]
#dataaccessarn='arn:aws:iam::12345678901234:role/rga-aws-ai-workshop-pod-xxxx'
                    
docclassifier='RGAAirbnb-' + pod

# Create a document classifier
create_response = client.create_document_classifier(
    InputDataConfig={
        'S3Uri': s3_uri
    },
    DataAccessRoleArn=dataaccessarn,
    DocumentClassifierName=docclassifier,
    LanguageCode='en'
)
printmd("**Create response output**")
print(str(create_response) + "\n")

# Check the status of the classifier
docclassifierarn = create_response['DocumentClassifierArn']
describe_response = client.describe_document_classifier(
    DocumentClassifierArn=docclassifierarn)
printmd("**Describe response output:**")
print(str(describe_response))

# Review the output below, we have requested that the Comprehend customer classifier be created. It should take 
# Several minutes to train the model using the AirBnb dataset


In [None]:
# Check the status of the classifier
# Look for a status on TRAINED before moving on to the next step
describe_response = client.describe_document_classifier(DocumentClassifierArn=docclassifierarn)
printmd("**Document classifier status (Wait for the status to show TRAINED):**")
print(describe_response['DocumentClassifierProperties']['Status'])
while describe_response['DocumentClassifierProperties']['Status'] != 'TRAINED':
    time.sleep(20)
    describe_response = client.describe_document_classifier(DocumentClassifierArn=docclassifierarn)
    print(describe_response['DocumentClassifierProperties']['Status'])

printmd("**Custom Model Accuracy:**")
print(str(describe_response['DocumentClassifierProperties']['ClassifierMetadata']['EvaluationMetrics']))

# Alternatively please review the Comprehend AWS Console GUI to validate if your custom job is still training
# Please note the output below, the custom model accuracy is tested and provided in the output of the model after
# training is complete.

In [None]:
client = boto3.client('comprehend', region_name=region)
s3_uri_in = 's3://' + bucket + '/' + prefix + '/' + 'airbnb-reviews-holdout.csv'
s3_uri_out = 's3://' + bucket + '/' + prefix + '/' + 'output'
jobname='RGAAirbnb-Job-' + pod

start_response = client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': s3_uri_in,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_uri_out
    },
    DataAccessRoleArn=dataaccessarn,
    DocumentClassifierArn=docclassifierarn,
    JobName=jobname
)

printmd("**Start response output:**")
print(str(start_response) + '\n')

# Check the status of the job
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
printmd("**Describe response output:**")
print(describe_response)


In [None]:
# Check the status of the job.
# When JobStatus is COMPLETED you can move to the next step.
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])

printmd("**Job training status (Wait for the status to show COMPLETED):**")
print(describe_response['DocumentClassificationJobProperties']['JobStatus'])

while describe_response['DocumentClassificationJobProperties']['JobStatus'] != 'COMPLETED':
    time.sleep(20)
    describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
    print(describe_response['DocumentClassificationJobProperties']['JobStatus'])

In [None]:
output_s3 = describe_response['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri']
#print(output_s3)

printmd("**Download model from S3 to local filesystem:**")
!aws s3 cp {output_s3} comprehend/output.tar.gz

printmd("**Uncompress predictions:**")
!tar -xvzf comprehend/output.tar.gz -C comprehend

printmd("**Review groundtruth (first 5):**")
!head comprehend/airbnb-reviews-holdout-groundtruth.csv -n 5

printmd("**Review predictions (first 5):**")
!head comprehend/predictions.jsonl -n 5

In [None]:
# Review 

predict=1

printmd("**Review groundtruth:**")
!sed '{predict}!d' comprehend/airbnb-reviews-holdout-groundtruth.csv

printmd("**Review predictions:**")
!sed '{predict}!d' comprehend/predictions.jsonl

## Lab 2 -- Sagemaker model training & delivery using a native algorithm (BlazingText)

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import botocore
import os
import re
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))


sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

#pod = '9998'
#region = 'us-east-1'
prefix = 'blazingtext' #we will store the dataset we will use to train a custom classification classifier

bucket = "rga-aws-ai-workshop-pod-" + pod # Replace with your own bucket name if needed
print(bucket)

### Data Preparation

Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".

In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. 

In [None]:

labels_file = "classes.txt"
train_file = "train.csv"
test_file = "test.csv"

!ls {prefix} -la


Let us inspect the dataset and the classes to get some understanding about how the data and the label is provided in the dataset. 

In [None]:
!head {prefix}/train.csv -n 3

As can be seen from the above output, the CSV has 3 fields - Label index, title and abstract. Let us first create a label index to label name mapping and then proceed to preprocess the dataset for ingestion by BlazingText.

Next we will print the labels file (`classes.txt`) to see all possible labels followed by creating an index to label mapping.

In [None]:
!cat {prefix}/classes.txt


The following code creates the mapping from integer indices to class label which will later be used to retrieve the actual class name during inference. 

In [None]:
#index_to_label = {} 
#with open("dbpedia_csv/classes.txt") as f:
#    for i,label in enumerate(f.readlines()):
#        index_to_label[str(i+1)] = label.strip()
#print(index_to_label)

index_to_label = {} 
with open("blazingtext/classes.txt", mode='r', encoding='utf-8-sig') as f:
    for i,label in enumerate(f.readlines()):
        index_to_label[str(i)] = label.strip()
print(index_to_label)

## Data Preprocessing
We need to preprocess the training data into **space separated tokenized text** format which can be consumed by `BlazingText` algorithm. Also, as mentioned previously, the class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence. We'll use `nltk` library to tokenize the input sentences from our dataset. 

Download the nltk tokenizer and other libraries

In [None]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import nltk
nltk.download('punkt')

In [None]:
def transform_instance(row):
    cur_row = []
    label = "__label__" + index_to_label[row[0]]  #Prefix the index-ed label with __label__
    cur_row.append(label)
    cur_row.extend(nltk.word_tokenize(row[1].lower()))
    cur_row.extend(nltk.word_tokenize(row[2].lower()))
    return cur_row

The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module

In [None]:
def preprocess(input_file, output_file, keep=1):
    all_rows = []
    with open(input_file, mode='r', encoding='utf-8-sig') as csvinfile:
        csv_reader = csv.reader(csvinfile, delimiter=',')
        for row in csv_reader:
            all_rows.append(row)
    shuffle(all_rows)
    all_rows = all_rows[:int(keep*len(all_rows))]
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, all_rows)
    pool.close() 
    pool.join()
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        csv_writer.writerows(transformed_rows)

In [None]:
%%time

#open('blazingtext/train.csv', mode='r', encoding='utf-8-sig')
# Preparing the training dataset

# Since preprocessing the whole dataset might take a couple of mintutes,
# we keep 20% of the training dataset for this demo.
# Set keep to 1 if you want to use the complete dataset
preprocess('blazingtext/train.csv', 'blazingtext/airbnb.train', keep=1)

# Preparing the validation dataset        
preprocess('blazingtext/test.csv', 'blazingtext/airbnb.validation')

The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   

In [None]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='blazingtext/airbnb.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='blazingtext/airbnb.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [None]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [None]:
region_name = boto3.Session().region_name

In [None]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

## Training the BlazingText model for supervised text classification

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow (supports subwords training) 	| skipgram (supports subwords training) 	| batch_skipgram 	| supervised |
|:----------------------:	|:----:	|:--------:	|:--------------:	| :--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|  ✔  |
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|  ✔ (Instance with 1 GPU only)  |
| Multiple CPU instances 	|      	|          	|        ✔       	|     | |

Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using "supervised" mode on a `m5.large` instance.


In [None]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.m5.large',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [None]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [None]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

In [None]:
bt_model.fit(inputs=data_channels, logs=True)

## Review the output below to see the model statistics and accuracy

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [None]:
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m5.large')

#### Use JSON format for inference
BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [None]:
#!head test.csv -n 3

all_rows = []
with open('blazingtext/test.csv', mode='r', encoding='utf-8-sig') as csvinfile:
    csv_reader = csv.reader(csvinfile, delimiter=',')
    for row in csv_reader:
        all_rows.append(row)
#all_rows = all_rows[:int(keep*len(all_rows))]
#print(all_rows)
#pool = Pool(processes=multiprocessing.cpu_count())
#transformed_rows = pool.map(transform_instance, all_rows)
#pool.close() 
#pool.join()

#print(all_rows[0][0])
#print(all_rows[0][2])

By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set `k` in the configuration as shown below:

In [None]:
# The test data consists of 30 Airbnb reviews (0-29), select a sentence to validate the prediction
sentence = 10

sentences = [all_rows[sentence][2]]

printmd("**Raw Sentence:**")
print(sentences)

printmd("**Groundtruth (0=Not Great, 1=Great):**")
print(all_rows[sentence][0])

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]

payload = {"instances" : tokenized_sentences}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)

printmd("**Prediction:**")
print(json.dumps(predictions, indent=2))


## Resource Cleanup
Lets remove the resources we've used during this lab...

We should delete the Sagemaker endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions.

In [None]:
# REMOVE THE SAGEMAKER ENDPOINT
sess.delete_endpoint(text_classifier.endpoint)

In [None]:
# DELETE REMAINING RESOURCES
# S3 Bucket
!aws s3 rb s3://{bucket} --force

# Comprehend Classifier
client = boto3.client('comprehend')
response = client.delete_document_classifier(
    DocumentClassifierArn=docclassifierarn
)

# REMOVE IAM ROLE
client = boto3.client('iam')
response = client.detach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= policyArn
)
response = client.detach_role_policy(
    RoleName= bucket + '-role',
    PolicyArn= 'arn:aws:iam::aws:policy/service-role/ComprehendDataAccessRolePolicy'
)
response = client.delete_role(
    RoleName= bucket + '-role'
)

# REMOVE IAM POLICY
response = client.delete_policy(
    PolicyArn= policyArn
)
