**Installing the Spark Dependancies**

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar -xzf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark
!pip install pyspark==3.0.3

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

Copy the Dataset into a local store.

Download from: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

(**In Colab:** the downloaded file is stored under "/content" folder)

Dataset description can be found here: https://github.com/nytimes/covid-19-data

In [None]:
! wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

**Start your program**

In [None]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession

Create a Spark Session

In [None]:
conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()

Load dateset into a dataframe

In [None]:
# Load the dataset
# NOTE: Fix your dataset location in case you run locally on your machine
data = spark.read.load('/content/us-counties.csv', format='csv', inferSchema=True, header=True)

# Print schema
data.printSchema()

In [None]:
# The number of rows in the dataset
data.count()

In [None]:
# See first 10 rows of the dataset
data.show(10)

**Task 0. Find the daily new cases across the entire US and plot**
(*you DO not need to do this, this code is given for your assistance*)

In [None]:
from pyspark.sql import functions as F

# Aggregate by day, sum the cases for all counties for each day
daily_cumulative = data.groupby('date').agg(F.sum('cases').alias('total_cases'))
daily_cumulative = daily_cumulative.sort('date')
daily_cumulative.show(10)

# Convert Spark dataframe to a Panda dataframe to plot
plot_data = daily_cumulative.toPandas()
dates = plot_data['date']
values = plot_data['total_cases']

# Find daily news cases from cumulative cases
daily_cases = [values[i+1] - values[i] for i in range(len(values)-1)]
ddates = [dates[i+1] for i in range(len(values)-1)]

In [None]:
# Plot daily cases against dates
import matplotlib.pyplot as plt
import numpy as np

plt.plot(ddates, daily_cases)
plt.xlabel('Date')
plt.ylabel('Daily new cases')
plt.xticks(rotation=90)
plt.show()

*Now, solve the following tasks.*

Feel free to add more code blocks within each task to spereate the code for better clarity and understanding.

**Task 1: Find the total number of new cases added in the entire US in the  month of March 2020.**

**Task 2: Calculate the total new cases added in three consecutive months of June, July, and August of 2020 in Jackson county, Missouri (fips code 29095).**

Output will be like this:

June 2020 `cases`

July 2020 `cases`

August 2020 `cases`

**Task 3: Find the daily new cases per month per 1000 population in Missouri state (MO) since the beginning of the pandemic (assume MO's population is 6,154,913). [Plot the data]**

**Task 4:  On which date all 50 US states have at least 100 cases? At least one death?**

**Task 5: Which single day in the year 2020 and 2021 had the largest number of deaths in the entire US (if there are multiple such dates, choose the earliest one)?**

Your programming assignment ends here.
Thank you.