#### Configure Crawler
* Under Analytics choose Glue Service.
* Under Crawler section, Click on Add Crawler.
    * Crawler: A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog.
* Give a name to your crawler, I have given it a name My-Glue-Catalog.
* Choose crawler source type, I have chosen Data stores.
    * Crawler types: 
        * Data stores – Amazon S3, JDBC, DynamoDB, MongoDB etc.
        * Existing catalog tables
<img src="Data_store.png" width="700">
* Choose a data store. In my case I have chosen S3.
* Choose an existing IAM Role. I have created an IAM Role My-Glue-Access-Role with the following policies:
    * AmazonS3FullAccess
    * AWSGlueServiceRole
<img src="IAM_Role.png" width="700">
* Choose a schedule to run this crawler (Daily, Hourly, Weekly, Monthly). I have chosen the default On-demand.
* Choose crawler’s output directory. This is the place where crawler will store the S3 data source object in the form of input catalog table for our python script.
<img src="Output.png" width="700">

#### Configure Jobs
A job is a business logic to perform ETL work
* Choose a name for your job. I have chosen the name My-Spark-Job.
* Attach your IAM role to this job.
* Select the type of Job (Spark, Spark-Streaming or Python shell). I have selected Spark for batch job.
* Select the Glue version. I have selected Glue 2.0 since it has support for Python 3.
<img src="Job.png" width="700">
* Chose the S3 path where the script will be saved.
* Under job parameters, chose Worker type as Standard and Number of workers to a minimum of 2 since we are dealing with a small dataset.
    * Worker types supported: 
        * Standard
        * G.1X – for memory-intensive jobs
        * G.2X – for Machine Learning applications
* Chose your data source, in our case it is the data catalog table that we created.
<img src="Data_Source.png" width="700">
* Select transformation type to Change Schema and Click Next.
* Under data target, choose data source as Amazon S3 and define a path. This will create an output catalog table and store the table in our data store in ‘CSV’ format.

#### Proposed script and updates (My-Spark-Job.py)
* Script updates
<code>
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext# Wraps the Apache SparkSQL SQLContext object
</code>
<img src="Script.png" width="700">
* Using glueContext object we create a DynamicFrame (datasource0)
<code>
#DynamicFrame for the public health infobase catalog
## @type: DataSource
## @args: [database = "input_database", table_name = "phi_ca_csv", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="input_database", table_name="phi_ca_csv”, transformation_ctx="datasource0")# Returns a DynamicFrame
</code>
DynamicFrames:
* In DynamicFrames, Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.
* DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation.
<code>
#DynamicFrame for the john hopkins catalog
datasource1 = glueContext.create_dynamic_frame.from_catalog(database="input_database", table_name="john_hopkins_csv", transformation_ctx="datasource1")
</code>
* Convert your DynamicFrame to an Apache Spark DataFrame by converting DynamicRecords into DataFrame fields.
<code>
phi_df = datasource0.toDF()# Returns DataFrame
jh_df = datasource1.toDF()
</code>
* After performing our dataframe operations, we write our dataframe back into a DynamicFrame.
<code>
df_Final_Dynamic = DynamicFrame.fromDF(df_Final, glueContext, 'df_Final_Dynamic') #df_Final is the name of our dataframe, glueContext is a GlueContext object and ‘df_Final_Dynamic’ is our final DynamicFrame
</code>
* Writing our final DynamicFrame into our output sink.
<code>
# df1 = ResolveChoice.apply(df_Final_Dynamic, choice = "make_cols")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://covid-19-tracker-2020/glue/output_data"}, format = "csv", transformation_ctx = "datasink2"]
## @return: datasink2
## @inputs: [frame = df_Final_Dynamic]
datasink2 = glueContext.write_dynamic_frame.from_options(frame=df_Final_Dynamic, connection_type="s3", connection_options={"path": "s3://covid-19-tracker-2020/glue/output_data"}, format="csv", transformation_ctx="datasink2")
job.commit()
</code>
DynamicFrameWriter class takes three parameters:
* frame – The DynamicFrame to write
* connection_type - The connection type. Valid values include s3, mysql, postgresql, redshift, sqlserver, and oracle.
* Connection_options - Connection options, such as path and database table (optional). For a connection_type of S3, an Amazon S3 path is defined.
<code>
connection_options={"path": "s3://covid-19-tracker-2020/glue/output_data"}
</code>
* Output csv data is stored in s3://covid-19-tracker/glue/output-data