# Using AWS Glue Python Shell Jobs

1. [Introduction](#Introduction)
2. [Activity 1 : Executing Amazon Athena Queries](#Activity_1_:_Executing-Amazon-Athena-Queries)
3. [Activity 2 : Deploying the AWS Glue Python Shell Job](#Activity-2-:-Deploying-the-AWS-Glue-Python-Shell-Job)
4. [Wrap-up](#Wrap-up)

## Introduction

In this notebook, we are going to explore using AWS Glue Python Shell Jobs. Not every use case needs the power of Apache Spark, and Python is a vert versatile framework for data processing. Use cases where AWS Glue Python Shell jobs can be used are:
    
- Orchestrating SQL in databases like Redshift, Aurora etc.
- Light-weight ETL using Amazon Athena.
- Data processing using Python Pandas or Numpy libraries.
- Building Python ML models using Python Scikit-Learn.
- And anything else that Python can accomplish.

Let's build a SQL driven ETL pipeline that uses the power of Amazon Athena to execute SQL scripts over an Amazon S3 data lake.

The architecture diagram for this module looks like below:


<img src="../resources/module3_architecture_diagram.png" alt="Module3 Architecture Diagram]" style="width: 700px;"/>

## Activity 1 : Executing Amazon Athena Queries


In [1]:
import boto3,time
import pandas as pd

defaultdb="default"

default_output='s3://###s3_bucket###/athena-sql/data/output/'
default_write_location='s3://###s3_bucket###/athena-sql/data/'
default_script_location= '3://###s3_bucket###/scripts/'
default_script_logs_location = 's3://###s3_bucket###/athena-sql/logs/'
sql_script_file='athena-sql-script.sql'

We will write a simple helper function that allows us to send SQL statement to Amazon Athena:

In [2]:
def executeQuery(query, database=defaultdb, s3_output=default_output, poll=10):
    log_output ("Executing Query : \n") 
    start = time.time()
    log_output (query+"\n")
    athena = boto3.client('athena')
    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': database
            },
        ResultConfiguration={
            'OutputLocation': s3_output,
            }
        )

    log_output('Execution ID: ' + response['QueryExecutionId'])
    queryExecutionId=response['QueryExecutionId']
    state='QUEUED'
    while( state=='RUNNING' or state=='QUEUED'):
        response = athena.get_query_execution(QueryExecutionId=queryExecutionId)
        state=response['QueryExecution']['Status']['State']
        log_output (state)
        if  state=='RUNNING' or state=='QUEUED':
            time.sleep(poll)
        elif (state=='FAILED'):
             log_output (response['QueryExecution']['Status']['StateChangeReason'])
              
    done = time.time()
    log_output ("Elapsed Time (in seconds) : %f \n"%(done - start))
    return response

def log_output(s):
    log_output_string.append(s)
    
def read_from_athena(sql):
    response=executeQuery(sql)
    return pd.read_csv(response['QueryExecution']['ResultConfiguration']['OutputLocation'])

The script we are going to use is here : [athena-sql-script.sql](athena-sql-script.sql) 

We will read the SQL file from our S3 bucket:

In [3]:
s3_location= default_script_location+sql_script_file
bucket_name,script_location=s3_location.split('/',2)[2].split('/',1)
print (bucket_name)
print (script_location)

s3 = boto3.client('s3')

fileobj = s3.get_object(Bucket=bucket_name,Key=script_location)
contents = fileobj['Body'].read().decode('utf-8')
contents

###s3_bucket###
scripts/athena-sql-script.sql


"--\n-- Athena SQL script\n--\n\n-- Drop table\nDROP TABLE default.nyc_trips_pq_1;\n\n-- Create table\nCREATE EXTERNAL TABLE `nyc_trips_pq_1`(\n  `vendor_name` string, \n  `trip_pickup_datetime` string, \n  `trip_dropoff_datetime` string, \n  `passenger_count` int, \n  `trip_distance` float, \n  `payment_type` string, \n  `are_amt` float, \n  `surcharge` float, \n  `mta_tax` float, \n  `tip_amt` float, \n  `tolls_amt` float, \n  `total_amt` float)\nPARTITIONED BY ( \n  `year` string, \n  `month` string)\nROW FORMAT SERDE \n  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' \nSTORED AS INPUTFORMAT \n  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' \nOUTPUTFORMAT \n  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'\nLOCATION\n  's3://neilawspublic/dataset2'\nTBLPROPERTIES (\n  'parquet.compress'='SNAPPY');\n\n-- Load the partitions\nMSCK REPAIR TABLE nyc_trips_pq_1;\n\n-- Drop the Report table.\nDROP TABLE default.nyc_top_trips_report;\n\n-- 

Now let's execute the script. This step issues the SQL commands to Amazon Athena and should take around 2 mins. You can navigate to the Amazon Athena console and view the queries being submitted:

- Navigate to the AWS Athena Console
- Click on the History tabe to view the queries submitted.

In [4]:
log_output_string=[]
for sql in str(contents).split(";")[:-1]:
    response=executeQuery(sql)
print ("\n".join(log_output_string))

Executing Query : 

--
-- Athena SQL script
--

-- Drop table
DROP TABLE default.nyc_trips_pq_1

Execution ID: 684cdef4-6646-4495-ae9b-1cea3f158900
QUEUED
SUCCEEDED
Elapsed Time (in seconds) : 10.257477 

Executing Query : 



-- Create table
CREATE EXTERNAL TABLE `nyc_trips_pq_1`(
  `vendor_name` string, 
  `trip_pickup_datetime` string, 
  `trip_dropoff_datetime` string, 
  `passenger_count` int, 
  `trip_distance` float, 
  `payment_type` string, 
  `are_amt` float, 
  `surcharge` float, 
  `mta_tax` float, 
  `tip_amt` float, 
  `tolls_amt` float, 
  `total_amt` float)
PARTITIONED BY ( 
  `year` string, 
  `month` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://neilawspublic/dataset2'
TBLPROPERTIES (
  'parquet.compress'='SNAPPY')

Execution ID: c278c1c9-

Our pipeline is complete and we can see the results of our SQL Script run above - the Amazon Athena Execution Ids of each query as well the execution time for each query.

Python shell jobs in AWS Glue come pre-loaded with libraries such as the Boto3, NumPy, SciPy, Pandas and others. You can load any custom Python libraries into the AWS Glue Python Shell environment packaged as an .egg or a .whl file.

You can read more about AWS Glue Python Shell features here: https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html

Let us read the final report data as a Pandas dataframe:

In [5]:
read_from_athena("Select * from default.nyc_top_trips_report")

Unnamed: 0,year,month,total_passengers,total_trips
0,2012,3,26866837,16146923
1,2011,3,26091246,16066350
2,2013,3,26965079,15749228
3,2011,10,26287953,15707756
4,2009,10,26202049,15604551
5,2012,5,26278817,15567525
6,2011,5,25508952,15554868
7,2010,9,25533166,15540209
8,2010,5,26002858,15481351
9,2012,4,25900645,15477914


## Activity 2 : Deploying the AWS Glue Python Shell Job

As a final step, we will deploy this pipeline as an AWS Glue Python Shell job and execute it.

Note that an AWS Glue Python Shell job can use 1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A single DPU provides processing capacity that consists of 4 vCPUs of compute and 16 GB of memory. For our use case, 0.0625 DPU is sufficient.

In [10]:
import boto3

acct_number=boto3.client('sts').get_caller_identity().get('Account')
bucket='###s3_bucket###'

# Create the Glue Spark Jobs
glue = boto3.client("glue")

for job_name in ['Build_Top_Flight_Delays_Report']:
    response=glue.create_job(Name=job_name,
                         Role="arn:aws:iam::%s:role/###iam_role###"%acct_number,
                         ExecutionProperty={'MaxConcurrentRuns': 1},
                         Command={'Name': 'pythonshell',
                                  'ScriptLocation': 's3://%s/scripts/%s.py'%(bucket,job_name),
                                  'PythonVersion': '3'},
                         DefaultArguments={'--TempDir': 's3://%s/temp'%bucket,
                                           '--enable-metrics': '',
                                           '--job-language': 'python',
                                           '--S3_BUCKET': bucket },
                         MaxRetries=0,
                         Timeout=2880,
                         MaxCapacity=0.0625,
                         GlueVersion='1.0',
                         Tags={'Owner': 'AWS_Glue_Labs'}
                        )
    print (response)

{'Name': 'Build_Top_Flight_Delays_Report', 'ResponseMetadata': {'RequestId': '6c8ff977-577a-42a1-bb58-b50544585d16', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Thu, 07 May 2020 16:46:06 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '41', 'connection': 'keep-alive', 'x-amzn-requestid': '6c8ff977-577a-42a1-bb58-b50544585d16'}, 'RetryAttempts': 0}}


Now that the AWS Glue job is deployed, let's execute it:

- Navigate to the AWS Glue Console -> Jobs. 
- Select the 'Build_Top_Flight_Delays_Report' Glue Jobs and 
- Click on the 'Action -> Run Job' option to execute the job.  

We can monitor the Execution Details from the AWS Glue console and once the job is over view the logs by clicking on the 'Logs' link.

## Wrap-up

In this notebook, we ran exercises to : 

1. Execute a light-weight SQL driven ETL pipeline using Amazon Athena and
2. Deployed the pipeline as a AWS Glue Python Shell Job.

We hope this lab helped you to understand how to leverage the simplicity and power of Python in your Data Pipelines using AWS Glue Python Shell Jobs.