Get Covid-19 confirmed cases csv from Johns Hopkins Github, clean it, and use Pandas to visualize. Also save to S3 in Parquet format and created Hive external tables.

In [None]:
import boto3
import pandas as pd
from io import StringIO # python3; python2: BytesIO 

data_url ='https://bit.ly/3d93pa1' #Download confirmed cases from Johns Hopkins Github repo
pdf = pd.read_csv(data_url)

Transpose and extract selected columns(countries) 

In [None]:
pdf1 = pdf.T.drop(['Lat','Long','Province/State'])
new_header = pdf1.iloc[0]
pdf2 = pdf1[1:] 
pdf2.columns = new_header
pdf3 = pdf2[['Brazil', 'India', 'Germany', 'Italy', 'Spain', 'US']]
pdf3['date'] = pd.date_range(start='1/22/2020', periods=len(pdf3), freq='D')
pdf3.columns = ['Brazil', 'India', 'Germany', 'Italy', 'Spain', 'US', 'date']
pdf3.head()

Plot the case increase over time

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# a scatter plot comparing num_children and num_pets
pdf3.plot('date',['US', 'Brazil', 'Italy', 'India', 'Spain', 'Germany'])
%matplot plt

Plot a pie chart

In [None]:
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Brazil','India','Germany','Italy','Spain','US'
confirmed = pdf3.tail(1).iloc[0,0:6]
explode = (0, 0, 0, 0, 0, 0.1)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(confirmed, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
%matplot plt

Save Pandas dataframe to S3 as csv

In [None]:
bucket = 'mybucket' # already created on S3
csv_buffer = StringIO()
pdf3.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'pandas/cv.csv').put(Body=csv_buffer.getvalue())

Convert Pandas dataframe (pdf3) to Spark dataframe (sdf)

In [None]:
from pyspark.sql import *
sdf = spark.createDataFrame(pdf3)
sdf.show(10)

Save Spark dataframe to S3 in Parquet format, this way the schema is preserved

In [None]:
sdf.write.parquet("s3://mybucket/parquet/",mode="overwrite")

Save Spark dataframe (sdf) to hive table

In [None]:
sdf.createOrReplaceTempView("mytempTable") 
sqlContext.sql("drop table if exists CV19")
sqlContext.sql("create table if not exists CV19 as select * from mytempTable");

Query the table

In [None]:
sqlContext.sql("select * from CV19").show(10)

Or simply just use %%sql to run the queries

In [None]:
%%sql
select * from CV19 where US >= 6000000