# Text classification model using BlazingText with Amazon SageMaker, Spark Pipeline, and AWS Glue 

In this example, we will train the text classification model using SageMaker BlazingText algorithm on the DBPedia Ontology Dataset as done by Zhang et al. Many of the training step resembles the 'Predict age of Alabone' notebook, so explanations are limited in this example, and we dive into code directly

The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article.

In many cases, when the trained model is used for processing real time or batch prediction requests, the model receives data in a format which needs to pre-processed (e.g. featurized) before it can be passed to the algorithm. In the following notebook, we will demonstrate how you can build your ML Pipeline leveraging Spark Feature Transformers and SageMaker BlazingText algorithm & after the model is trained, deploy the Pipeline (Feature Transformer and BlazingText) as an Inference Pipeline behind a single Endpoint for real-time inference and for batch inferences using Amazon SageMaker Batch Transform.

In this notebook, we use Amazon Glue to run serverless Spark. Though the notebook demonstrates the end-to-end flow on a small dataset, the setup can be seamlessly used to scale to larger datasets.

**Methods**
The Notebook consists of a few high-level steps:

- Using AWS Glue for executing the SparkML feature processing job.
- Using SageMaker BlazingText to train on the processed dataset produced by SparkML job.
- Building an Inference Pipeline consisting of SparkML & BlazingText models for a realtime inference endpoint.
- Building an Inference Pipeline consisting of SparkML & BlazingText models for a single Batch Transform job.

In order to enable this Notebook to run AWS Glue jobs, we need to add one additional permission to the default execution role of this notebook. Please follow the steps listed in [this notebook example](https://github.com/hyunjoonbok/amazon-sagemaker/blob/master/Predict%20the%20age%20of%20Abalone%20(regression%20problem)%20with%20Amazon%20SageMaker%2C%20Spark%20Pipeline%2C%20and%20AWS%20Glue%20.ipynb).  

In [1]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
import matplotlib.ticker as ticker
%matplotlib inline

from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import time
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import boto3
import botocore
from botocore.exceptions import ClientError

import csv
import io
import re
import s3fs
import mxnet as mx
import seaborn as sns
import pickle
import gzip
import urllib
import csv

import cv2

import sagemaker                                 
from sagemaker.predictor import csv_serializer 
from sagemaker.predictor import json_deserializer
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role

### 1. Preparation (Specifying Sagemaker roles)

In [2]:
sess = sagemaker.Session()
boto_session = sess.boto_session
s3 = boto_session.resource('s3')
account = boto_session.client('sts').get_caller_identity()['Account']
region = boto3.Session().region_name
default_bucket = 'aws-glue-{}-{}'.format(account, region)                     
role = 'arn:aws:iam::570447867175:role/SageMakerNotebookRole' # pass your IAM role name

print('Sagemaker session :', sess)
print('BoTo3 session :', boto_session)
print('Account:', account)
print('S3 bucket :', default_bucket)
print('Region selected :', region)
print('IAM role :', role)

Sagemaker session : <sagemaker.session.Session object at 0x000001A4364E5748>
BoTo3 session : Session(region_name='us-west-2')
Account: 570447867175
S3 bucket : aws-glue-570447867175-us-west-2
Region selected : us-west-2
IAM role : arn:aws:iam::570447867175:role/SageMakerNotebookRole


In [3]:
from sagemaker.amazon.amazon_estimator import get_image_uri

training_image = get_image_uri(region, 'blazingtext', repo_version="latest")
print(training_image)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest


### 2. Load Data

Dataset can be downloaded from this website [Link](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by Zhang et al.
Or you can try wget method for easy download from Official AWS Sagemaker S3. 

In [4]:
# !wget https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/dbpedia/train.csv
# !wget https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/dbpedia/test.csv

In [19]:
try:
    if region == 'us-east-1':
        s3.create_bucket(Bucket=default_bucket)
    else:
        s3.create_bucket(Bucket=default_bucket, CreateBucketConfiguration={'LocationConstraint': region})
except ClientError as e:
    error_code = e.response['Error']['Code']
    message = e.response['Error']['Message']
    if error_code == 'BucketAlreadyOwnedByYou':
        print ('A bucket with the same name already exists in your account - using the same bucket.')
        pass        

# Uploading the training data to S3
# (I have changed the filename to dbpedia-train.csv and dbpedia-test.csv)
sess.upload_data(path='train.csv', bucket=default_bucket, key_prefix='input/dbpedia')    
sess.upload_data(path='test.csv', bucket=default_bucket, key_prefix='input/dbpedia')

A bucket with the same name already exists in your account - using the same bucket.


's3://aws-glue-570447867175-us-west-2/input/dbpedia/test.csv'

###### Upload the featurizer script to S3

In [6]:
script_location = sess.upload_data(path='./tools/dbpedia_processing.py', bucket=default_bucket, key_prefix='codes')
print(script_location)

s3://aws-glue-570447867175-us-west-2/codes/dbpedia_processing.py


###### Upload MLeap dependencies to S3

In [7]:
# !wget https://s3-us-west-2.amazonaws.com/sparkml-mleap/0.9.6/python/python.zip
# !wget https://s3-us-west-2.amazonaws.com/sparkml-mleap/0.9.6/jar/mleap_spark_assembly.jar

In [10]:
python_dep_location = sess.upload_data(path='python.zip', bucket=default_bucket, key_prefix='dependencies/python')
jar_dep_location = sess.upload_data(path='mleap_spark_assembly.jar', bucket=default_bucket, key_prefix='dependencies/jar')

In [11]:
print('python Dependency Location :', python_dep_location)
print('Jar file Location', jar_dep_location)

python Dependency Location : s3://aws-glue-570447867175-us-west-2/dependencies/python/python.zip
Jar file Location s3://aws-glue-570447867175-us-west-2/dependencies/jar/mleap_spark_assembly.jar


###### Defining output locations

In [12]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Input location of the data, We uploaded our train.csv file to input key previously
s3_input_bucket = default_bucket
s3_input_key_prefix = 'input/dbpedia'

# Output location of the data. The input data will be split, transformed, and 
# uploaded to output/train and output/validation
s3_output_bucket = default_bucket
s3_output_key_prefix = timestamp_prefix + '/dbpedia'

# the MLeap serialized SparkML model will be uploaded to output/mleap
s3_model_bucket = default_bucket
s3_model_key_prefix = s3_output_key_prefix + '/mleap'

###### Calling Glue APIs

In [13]:
glue_client = boto_session.client('glue')
job_name = 'sparkml-dbpedia-' + timestamp_prefix
response = glue_client.create_job(
    Name=job_name,
    Description='PySpark job to featurize the DBPedia dataset',
    Role=role, # you can pass your existing AWS Glue role here if you have used Glue before
    ExecutionProperty={
        'MaxConcurrentRuns': 1
    },
    Command={
        'Name': 'glueetl',
        'ScriptLocation': script_location
    },
    DefaultArguments={
        '--job-language': 'python',
        '--extra-jars' : jar_dep_location,
        '--extra-py-files': python_dep_location
    },
    AllocatedCapacity=10,
    Timeout=60,
)
glue_job_name = response['Name']
print(glue_job_name)

sparkml-dbpedia-2020-07-13-05-08-47


In [14]:
job_run_id = glue_client.start_job_run(JobName=job_name,
                                       Arguments = {
                                        '--S3_INPUT_BUCKET': s3_input_bucket,
                                        '--S3_INPUT_KEY_PREFIX': s3_input_key_prefix,
                                        '--S3_OUTPUT_BUCKET': s3_output_bucket,
                                        '--S3_OUTPUT_KEY_PREFIX': s3_output_key_prefix,
                                        '--S3_MODEL_BUCKET': s3_model_bucket,
                                        '--S3_MODEL_KEY_PREFIX': s3_model_key_prefix
                                       })['JobRunId']
print(job_run_id)

jr_199c05a2bda8a761946a552a41b1ebffe74b8be40899d3c2fbf89b24b02d8b47


###### Checking Glue job status

In [15]:
job_run_status = glue_client.get_job_run(JobName=job_name,RunId=job_run_id)['JobRun']['JobRunState']
while job_run_status not in ('FAILED', 'SUCCEEDED', 'STOPPED'):
    job_run_status = glue_client.get_job_run(JobName=job_name,RunId=job_run_id)['JobRun']['JobRunState']
    print(job_run_status)
    time.sleep(30)

RUNNING
RUNNING
RUNNING
SUCCEEDED


### 3. Create a Model

In [16]:
s3_train_data = 's3://{}/{}/{}'.format(s3_output_bucket, s3_output_key_prefix, 'train')
s3_validation_data = 's3://{}/{}/{}'.format(s3_output_bucket, s3_output_key_prefix, 'validation')
s3_output_location = 's3://{}/{}/{}'.format(s3_output_bucket, s3_output_key_prefix, 'bt_model')

bt_model = sagemaker.estimator.Estimator(training_image,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.p2.xlarge',
                                         train_volume_size = 20,
                                         train_max_run = 3600,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')

data_channels = {'train': train_data, 'validation': validation_data}



### 4. Start Training

In [17]:
bt_model.fit(inputs=data_channels, logs=True)

2020-07-13 05:11:44 Starting - Starting the training job...
2020-07-13 05:11:46 Starting - Launching requested ML instances......
2020-07-13 05:12:55 Starting - Preparing the instances for training......
2020-07-13 05:13:59 Downloading - Downloading input data...
2020-07-13 05:14:42 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[07/13/2020 05:14:43 INFO 139844829321024] nvidia-smi took: 0.100651025772 secs to identify 1 gpus[0m
[34m[07/13/2020 05:14:43 INFO 139844829321024] Running BlazingText on singe GPU using supervised[0m
[34m[07/13/2020 05:14:43 INFO 139844829321024] Processing /opt/ml/input/data/train/part-00000 . File size: 158 MB[0m
[34m[07/13/2020 05:14:43 INFO 139844829321024] Processing /opt/ml/input/data/validation/part-00000 . File size: 19 MB[0m
[34mRead 10M words[0m
[34mRead 20M words[0m
[34mRead 26M words[0m
[34mNumber of words:  381326[0m
[34mInitialized GPU 0 successfully! Now starting training....[0

### 5. Inference Pipeline in Spark

In [18]:
# Passing the schema of the payload via environment variable

schema = {
    "input": [
        {
            "name": "abstract",
            "type": "string"
        }
    ],
    "output": 
        {
            "name": "tokenized_abstract",
            "type": "string",
            "struct": "array"
        }
}
schema_json = json.dumps(schema)
print(schema_json)

{"input": [{"name": "abstract", "type": "string"}], "output": {"name": "tokenized_abstract", "type": "string", "struct": "array"}}


In [19]:
# In order for the sagemaker-sparkml-serving to emit the output with the right format,
# we need to pass a second environment variable SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT 
# with the value application/jsonlines;data=text to ensure that sagemaker-sparkml-serving container emits response in the proper format which BlazingText can parse.

from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel
from sagemaker.sparkml.model import SparkMLModel

sparkml_data = 's3://{}/{}/{}'.format(s3_model_bucket, s3_model_key_prefix, 'model.tar.gz')
# passing the schema defined above by using an environment variable that sagemaker-sparkml-serving understands
sparkml_model = SparkMLModel(model_data=sparkml_data,
                             env={'SAGEMAKER_SPARKML_SCHEMA' : schema_json, 
                                  'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': "application/jsonlines;data=text"})
bt_model = Model(model_data=bt_model.model_data, image=training_image)

model_name = 'inference-pipeline-' + timestamp_prefix
sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, bt_model])



In [20]:
# Deploying the PipelineModel to SageMaker Endpoint for realtime inference
endpoint_name = 'inference-pipeline-ep-' + timestamp_prefix
sm_model.deploy(initial_instance_count=1, 
                instance_type='ml.p2.xlarge', 
                endpoint_name=endpoint_name)

-------------!

In [21]:
# Passing a CSV string one
from sagemaker.predictor import json_serializer, csv_serializer, json_deserializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON
payload = "Convair was an american aircraft manufacturing company which later expanded into rockets and spacecraft."
predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sess, serializer=csv_serializer,
                                content_type=CONTENT_TYPE_CSV, accept='application/jsonlines')
print(predictor.predict(payload))

b'{"label": ["__label__1"], "prob": [0.9886016249656677]}\n'


In [22]:
# Passing a JSON string one
payload = {"data": ["Berwick secondary college is situated in the outer melbourne metropolitan suburb of berwick ."]}
predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sess, serializer=json_serializer,
                                content_type=CONTENT_TYPE_JSON)

print(predictor.predict(payload))

b'{"label": ["__label__2"], "prob": [0.9784008860588074]}\n'


### 6. Close the SageMaker Instance

In [23]:
sm_client = boto_session.client('sagemaker')
sm_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'e0244f7d-01f7-48a7-a677-0f00496c6a1d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e0244f7d-01f7-48a7-a677-0f00496c6a1d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Mon, 13 Jul 2020 05:26:52 GMT'},
  'RetryAttempts': 0}}

### 7. (optional) Building an Inference Pipeline for a single Batch Transform job

SageMaker Batch Transform also supports chaining multiple containers together when deploying an Inference Pipeline and performing a single Batch Transform job to transform your data for a batch use-case similar to the real-time use-case we have seen above.

**Preparing data for Batch Transform**
Batch Transform requires data in the same format described above, with one CSV or JSON being per line. For this notebook, SageMaker team has created a sample input in CSV format which Batch Transform can process. The input is a simple CSV file with one input string per line.

Next we will download a sample of this data from one of the SageMaker buckets (named batch_input_dbpedia.csv) and upload to your S3 bucket. We will also inspect first five rows of the data post downloading.

In [None]:
# !wget https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/batch_input_dbpedia.csv
# !printf "\n\nShowing first two lines\n\n"    
# !head -n 3 batch_input_dbpedia.csv
# !printf "\n\nAs we can see, it is just one input string per line.\n\n"

In [None]:
batch_input_loc = sess.upload_data(path='batch_input_dbpedia.csv', bucket=default_bucket, key_prefix='batch')

In [None]:
# Invoking the Transform API to create a Batch Transform job

input_data_path = 's3://{}/{}/{}'.format(default_bucket, 'batch', 'batch_input_dbpedia.csv')
output_data_path = 's3://{}/{}/{}'.format(default_bucket, 'batch_output/dbpedia', timestamp_prefix)
transformer = sagemaker.transformer.Transformer(
    model_name = model_name,
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    strategy = 'SingleRecord',
    assemble_with = 'Line',
    output_path = output_data_path,
    base_transform_job_name='serial-inference-batch',
    sagemaker_session=sess,
    accept = CONTENT_TYPE_CSV
)
transformer.transform(data = input_data_path, 
                        content_type = CONTENT_TYPE_CSV, 
                        split_type = 'Line')
transformer.wait()