#### Login to the AWS Management Console and under the Analytics section choose EMR. Then follow the below steps under EMR service
* Go to Create Cluster.
* Go to Advanced Options.
* Under Software Configuration section, choose emr-5.32.0 as the Release and choose your software. I have checked Hadoop 2.10.1, Hive 2.3.7 and Spark 2.4.7. Click on Next.
* Under Cluster Nodes and Instances section, choose 1 m5a.xlarge Master and 2 m5a.xlarge Core Nodes. You can additionally choose Task nodes if you do not want Core Nodes to perform compute. Click on Next.
* Give a name to your cluster. In my case I have named it as Spark Cluster. Enable logging on your cluster and choose an S3 folder to save your logs.
* Under Bootstrap actions, add your bootstrap script. In my case it is bootstrap_EMR.sh stored under the path s3://covid-19-tracker-2020/bootstrap/.
The Bootstrap script contains the following bootstrap code:
    <code>
    #!/bin/bash
    aws s3 cp s3://covid-19-tracker-2020/data/CA__covid19__latest.csv /home/hadoop/data/
    aws s3 cp s3://covid-19-tracker-2020/data/time-series-19-covid-combined.csv /home/hadoop/data/
    aws s3 cp s3://covid-19-tracker-2020/python/Spark-Job.py /home/hadoop/python/
    </code>
    
* Choose your custom EC2 Key pair. 
* Click on Create Cluster.
    

#### SSH into your EC2 cluster and perform the following steps
* Make a new directory in the hdfs file system
    <code>
    hdfs dfs -mkdir /user/hadoop/data/
    </code>

* Copy the data files from the local linux system to HDFS
    <code>
    hdfs dfs -put /home/hadoop/data/CA__covid19__latest.csv /user/hadoop/data/
    hdfs dfs -put /home/hadoop/data/time-series-19-covid-combined.csv /user/hadoop/data/
    </code>
    
* Updates to our Spark-Job.py python Script
    <code>
    spark = SparkSession. \
    builder. \
    appName("Covid Tracker"). \ 
    getOrCreate()#We remove the master(“local”) option as we building Spark Session object on the cluster in         place of our local machine
    </code>
    <code>
    phi_df = spark. \
    read. \
    format("csv"). \
    option("inferSchema","true"). \
    option("header", "true"). \
    load("/user/hadoop/data/CA__covid19__latest.csv")#In place of local machine path we provide HDFS path
    </code>
    <code>
    jh_df = spark. \
    read. \
    format("csv"). \
    schema("jh_Date date, jh_Country string, jh_Province string, jh_Lat double, jh_Long double, jh_Confirmed integer, \
    jh_Recovered integer, jh_Deaths integer"). \
    option("header", "true"). \
    load("/user/hadoop/data/time-series-19-covid-combined.csv")#HDFS path containing our source data
    </code>
    <code>
    df_Final.write. \
    format("csv"). \
    mode("overwrite"). \
    save("s3://covid-19-tracker-2020/output/tracker_results/")#Provide the output path as the AWS S3 directory
    </code>

    The remaining of the code remains the same
* Run python job using the below command
    <code>
    spark-submit /home/hadoop/python/Spark-Job-EMR.py
    </code>
    
* Output file containing the COVID-19 tracking for Canada Is located in the below location

    https://covid-19-tracker-2020.s3.amazonaws.com/output/tracker_results/
