# Using AWS Glue Python Shell Jobs

1. [Introduction](#Introduction)
2. [Activity 1 : Executing Amazon Athena Queries](#Activity_1_:_Executing-Amazon-Athena-Queries)
3. [Activity 2 : Deploying the AWS Glue Python Shell Job](#Activity-2-:-Deploying-the-AWS-Glue-Python-Shell-Job)

## Introduction

In this notebook, we are going to explore using AWS Glue Python Shell Jobs. Not every use case needs the power of Apache Spark, and Python is a vert versatile framework for data processing. Use cases where AWS Glue Python Shell jobs can be used are:
    
- Orchestrating SQL in databases like Redshift, Aurora etc.
- Light-weight ETL using Amazon Athena.
- Data processing using Python Pandas or Numpy libraries.
- Building Python ML models using Python Scikit-Learn.
- And anything else that Python can accomplish.

## Activity 1 : Executing Amazon Athena Queries

We will deploy a Light-weight ETL pipeline by simply executing a SQL Script in Amazon Athena using AWS Glue Python Shell Jobs in this lab. 

In [None]:
import boto3,time
import pandas as pd

region='eu-west-1'
defaultdb="default"

default_output='s3://glue-labs-959874710265/athena-sql/data/output/'
default_write_location='s3://glue-labs-959874710265/athena-sql/data/'
default_script_location= '3://glue-labs-959874710265/athena-sql/scripts/'
default_script_logs_location = 's3://glue-labs-959874710265/athena-sql/logs/'
sql_script_file='athena-sql-script.sql'

We will write a simple helper function that allows us to send SQL statement to Amazon Athena:

In [None]:
def executeQuery(query, database=defaultdb, s3_output=default_output, poll=10):
    log_output ("Executing Query : \n") 
    start = time.time()
    log_output (query+"\n")
    athena = boto3.client('athena',region_name=region)
    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': database
            },
        ResultConfiguration={
            'OutputLocation': s3_output,
            }
        )

    log_output('Execution ID: ' + response['QueryExecutionId'])
    queryExecutionId=response['QueryExecutionId']
    state='QUEUED'
    while( state=='RUNNING' or state=='QUEUED'):
        response = athena.get_query_execution(QueryExecutionId=queryExecutionId)
        state=response['QueryExecution']['Status']['State']
        log_output (state)
        if  state=='RUNNING' or state=='QUEUED':
            time.sleep(poll)
        elif (state=='FAILED'):
             log_output (response['QueryExecution']['Status']['StateChangeReason'])
              
    done = time.time()
    log_output ("Elapsed Time (in seconds) : %f \n"%(done - start))
    return response

def log_output(s):
    log_output_string.append(s)
    
def read_from_athena(sql):
    response=executeQuery(sql)
    return pd.read_csv(response['QueryExecution']['ResultConfiguration']['OutputLocation'])

The script we are going to use is here : [athena-sql-script.sql](athena-sql-script.sql) 

We will read the SQL file from our S3 bucket:

In [None]:
s3_location= default_script_location+sql_script_file
bucket_name,script_location=s3_location.split('/',2)[2].split('/',1)
print (bucket_name)
print (script_location)

s3 = boto3.client('s3',region_name='eu-west-1')

fileobj = s3.get_object(Bucket=bucket_name,Key=script_location)
contents = fileobj['Body'].read().decode('utf-8')
contents

Now let's execute the script:

In [None]:
log_output_string=[]
for sql in str(contents).split(";")[:-1]:
    response=executeQuery(sql)
print ("\n".join(log_output_string))

Our pipeline is complete and we can see the results of our SQL Script run. Let us read the final report data as a pandas dataframe:

In [None]:
read_from_athena("Select * from default.nyc_top_trips_report")

## Activity 2 : Deploying the AWS Glue Python Shell Job

As a final step, we will deploy this pipeline as an AWS Glue Python Shell job and execute it.

Note that an AWS Glue Python Shell job can use 1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A single DPU provides processing capacity that consists of 4 vCPUs of compute and 16 GB of memory. For our use case, 0.0625 DPU is sufficient.

In [None]:
import boto3

acct_number=boto3.client('sts').get_caller_identity().get('Account')
bucket='glue-labs-959874710265'

# Create the Glue Spark Jobs
glue = boto3.client("glue")

for job_name in ['Build_NYC_Top_Trips_Report']:
    response=glue.create_job(Name=job_name,
                         Role="arn:aws:iam::%s:role/GlueServiceRole"%acct_number,
                         ExecutionProperty={'MaxConcurrentRuns': 1},
                         Command={'Name': 'pythonshell',
                                  'ScriptLocation': 's3://%s/athena-sql/scripts/%s.py'%(bucket,job_name),
                                  'PythonVersion': '3'},
                         DefaultArguments={'--TempDir': 's3://%s/temp'%bucket,
                                           '--enable-metrics': '',
                                           '--job-language': 'python',
                                           '--S3_BUCKET': bucket },
                         MaxRetries=0,
                         Timeout=2880,
                         MaxCapacity=0.0625,
                         GlueVersion='1.0',
                         Tags={'Owner': 'AWS_Glue_Labs'}
                        )
    print (response)

Now that the AWS Ge itlue job is deployed, let's execute it:

- Navigate to the AWS Glue Console -> Jobs. 
- Select the 'Build_NYC_Top_Trips_Report' Glue Jobs and 
- Click on the Action -> Run Job option to execute the job.  

We can monitor the Execution Details from the AWS Glue console and once the job is over view the logs by clicking on the 'Logs' link.

## Wrap-up

In this notebook, we ran exercises to : 

1. Execute a light-weight SQL driven ETL pipeline using Amazon Athena and
2. Deployed the pipeline as a AWS Glue Python Shell Job.

We hope this lab helped you to understand how to leverage the simplicity and power of Python in your Data Pipelines using AWS Glue Python Shell Jobs.