#### Updates to our Spark-Job.py script
* The path to our S3 data will now be passed as arguments instead of hardcoded values
<code>
phi_path = sys.argv[1]
john_hopkins_path = sys.argv[2]
output_path = sys.argv[3]
</code>
<code>
phi_df = spark. \
    read. \
    format("csv"). \
    option("inferSchema","true"). \
    option("header", "true"). \
    load(phi_path)#path passed as argument
</code>
<code>
jh_df = spark. \
    read. \
    format("csv"). \
    schema("jh_Date date, jh_Country string, jh_Province string, jh_Lat double, jh_Long double, \
    jh_Confirmed integer, jh_Recovered integer, jh_Deaths integer"). \
    option("header", "true"). \
    load(john_hopkins_path)
</code>
<code>
df_Final.write. \
    format("csv"). \
    mode("overwrite"). \
    save(output_path)
</code>

#### Steps to Create a new AWS Data Pipeline service to automate creation of AWS EMR jobs
* Under Analytics section chose Data Pipeline service. Click on Create new pipeline.
* Give a name to your pipeline. I have given it the name My-Spark-Data-Pipeline.
* Under Build using a template chose Run Job on Elastic MapReduce cluster.
  * Under EMR step(s) sections place the following code:
  <code>
  command-runner.jar,spark-submit,--deploy-mode,cluster,s3://covid-19-tracker-2020/python/Spark-Job.py,s3://covid-19-tracker-2020/data/phi_CA.csv,s3://covid-19-tracker-2020/data/john_hopkins.csv,s3://covid-19-tracker-2020/output/tracker_output
  </code>
  command-runner.jar – This application package can execute the following list of functions including more
* <b>hadoop-streaming</b>
    * Submit a Hadoop streaming program. In the console and some SDKs, this is a streaming step. 
* <b>hive-script</b>
    * Run a Hive script. In the console and SDKs, this is a Hive step.
* <b>spark-submit</b><<< In place of manually specifying spark-submit in bash, we are passing                       command-runner.jar as EMR step
    * Run a Spark application. In the console, this is a Spark step.
* Choose core instance and master instance type as m5a.xlarge and EMR label 5.31.0.
* Under Bootstrap action(s) pass the bootstrap script path
<code>
s3://covid-19-tracker-2020/bootstrap/bootstrap.sh
</code>
* Specify a Schedule for running your pipeline jobs:
    I have specified configuration to run the job every 1 day starting on pipeline activation
    <img src="Schedule.png" width="700">
* Optionally Enable logging and choose the default Security Roles. 
* Click on Edit in Architect.

#### Pipeline definition updates
* Inside your JSON configuration insert a new key pair “applications”: “spark”
<code>
{
    "taskInstanceType": "#{myTaskInstanceType}",
    "coreInstanceCount": "#{myCoreInstanceCount}",
    "masterInstanceType": "#{myMasterInstanceType}",
    "releaseLabel": "#{myEMRReleaseLabel}",
    "type": "EmrCluster",
    "terminateAfter": "50 Minutes",
    "bootstrapAction": "#{myBootstrapAction}",
    "taskInstanceCount": "#{myTaskInstanceCount}",s
    "name": "EmrClusterObj",
    "coreInstanceType": "#{myCoreInstanceType}",
    "keyPair": "#{myEC2KeyPair}",
    "id": "EmrClusterObj",
    "applications": "spark"# AWS Pipeline by default does not install spark libraries on cluster bootup as is the case with AWS EMR, we have to manually specidy the application
}
</code>
* Now click on Activate to activate the pipeline.

#### Tracking the pipeline progress on AWS EMR
* Under Steps sections we note that it has executed two steps
<code><img src="Steps.png" width=700></code>
    * Install <b>TaskRunner</b> – In this step our pipeline object EmrClusterObj unpacks various HADOOP and JDBC libraries on our cluster including
    <code>
    ZIP_FILE = http://datapipeline-us-east-1.s3.amazonaws.com/us-east-1/software/latest/TaskRunner/TaskRunner-1.0.zip
MYSQL_FILE = http://datapipeline-us-east-1.s3.amazonaws.com/us-east-1/software/latest/TaskRunner/mysql-connector-java-bin.jar
HIVE_CSV_SERDE_FILE = http://datapipeline-us-east-1.s3.amazonaws.com/us-east-1/software/latest/TaskRunner/csv-serde.jar
HADOOP_CLASSPATH:/mnt/taskRunner/common/mysql-connector-java-bin.jar:/etc/hadoop/hive/lib/hive-exec.jar
    </code>
    * <b>EmrActivity</b> – This step executes out main Spark-Job.py script using the command-runner.jar application which runs the spark-submit command.
    The first occurrence of our Data Pipeline has finished successfully
    <img src="EmrActivity.png" width="700">