# Data Prep with AWS Glue

While the New Times Dataset could be prepared in a notebook, this tutorial preps the dataset with Spark ML.  The spark job runs on a managed spark cluster, Amazon Glue.  Glue provides ETL (Extract, Transfom, and Load) functionality for a data warehouse.  This apporach allows for automated updates when the dataset changes.

> Note: This isn't the only apporach.  The data prep could be preformed by a series of containerized micro-services and or a lambda functions. 

In [1]:
#Note: We're only using the sagemaker api here to look up the S3 bucket and the exectution role.

import sagemaker
from sagemaker import get_execution_role
role = get_execution_role()

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-228889150161


# Create a Glue Job

In this cell, we create a glue job that runs a scala/spark script.

The Job executes the following tasks.
1. Removes HTML Markup
2. Tokenizes the comment text. 
3. Removes stop words.
4. Creates a vocabulary.
5. Creates bag-of-words (i.e. word counts in the comment)
6. Saves the bag-of-words representation in a SageMaker friendly format.

The Job produces the following outputs.
1. Vocabulary of the top N words.
2. Sample validation data (TBD)
3. The tranining data in a RecordIO/Protocol Buffer format.

For a deeper explanation of the Bag-of-Words repersentation, see the data prepartion [this example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/ntm_20newsgroups_topic_modeling/ntm_20newsgroups_topic_model.ipynb).

In [18]:
import boto3 
client = boto3.client('glue')

script_loc = 's3://{}/glue/nyt-comments/tokenize.scala'.format(bucket)
sagemaker_jar = 's3://{}/glue/nyt-comments/jars/sagemaker-spark_2.11-spark_2.2.0-1.0.5.jar'.format(bucket)

client.create_job(Name='sagemaker-nyt-prep',
    Description='Spark ML job to create a bag-of-words vector',
    Role=role,
    Command={
        'Name':'glueetl', 
        'ScriptLocation':script_loc},
    DefaultArguments={
        '--job-language': 'scala',
        '--class': 'Tokenizer',
        '--extra-jars': sagemaker_jar,
        '--sagemaker_bucket': bucket
    })

{'Name': 'sagemaker-nyt-prep',
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '29',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sat, 02 Jun 2018 00:51:34 GMT',
   'x-amzn-requestid': '18a98a2d-65ff-11e8-9a26-e3673eeba1bc'},
  'HTTPStatusCode': 200,
  'RequestId': '18a98a2d-65ff-11e8-9a26-e3673eeba1bc',
  'RetryAttempts': 0}}

# Start the Job

The job's status and logs are accessable via the AWS Console.

![Job Status](images/GlueJobStatus.png)

In [19]:
client.start_job_run(JobName='sagemaker-nyt-prep', 
                     AllocatedCapacity=10, 
                     Arguments={'--sagemaker_bucket': bucket})

{'JobRunId': 'jr_0249a7a7353017d8a2c93eb6b1b9b7b21166176640eff3a186434d89084589bf',
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '82',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sat, 02 Jun 2018 00:56:33 GMT',
   'x-amzn-requestid': 'cb2119a0-65ff-11e8-87d5-efaa57ac0484'},
  'HTTPStatusCode': 200,
  'RequestId': 'cb2119a0-65ff-11e8-87d5-efaa57ac0484',
  'RetryAttempts': 0}}

# View the Labeled Dataset

If a Glue crawls the jobs results ```s3://sagemaker-bucket/data/nyt-features/labeled_data.parquet```, then the labeled data is viewable via Athena.

![Athena Query](images/AthenaQuery.png)