Stream to parquet file
======================

This notebook allows for setup and execution of the data streaming and
querying into a parquet file. The idea is thereafter to perform analysis
on the parquet file.

Note that this notebooks assumes one has already has downloaded several
"Our World in Data" dataset csv files. This can be done by first running
"DownloadFilesPeriodicallyScript" at least once.

Content is based on "038\_StructuredStreamingProgGuide" by Raazesh
Sainudiin.

start by copying latest downloaded csv data to data analysis folder

In [None]:
dbutils.fs.cp("file:///databricks/driver/projects/group12/logsEveryXSecs/","/datasets/group12/",true)

  

>     res0: Boolean = true

  

check that data is in the group12 folder

In [None]:
display(dbutils.fs.ls("/datasets/group12/"))

  

[TABLE]

  

check the schema for the csv files.

In [None]:
val df_csv = spark.read.format("csv").option("header", "true").option("inferSchema", "true").csv("/datasets/group12/21_01_07_09_05_33.csv")

  

>     df_csv: org.apache.spark.sql.DataFrame = [iso_code: string, continent: string ... 52 more fields]

In [None]:
df_csv.printSchema

  

>     root
>      |-- iso_code: string (nullable = true)
>      |-- continent: string (nullable = true)
>      |-- location: string (nullable = true)
>      |-- date: string (nullable = true)
>      |-- total_cases: double (nullable = true)
>      |-- new_cases: double (nullable = true)
>      |-- new_cases_smoothed: double (nullable = true)
>      |-- total_deaths: double (nullable = true)
>      |-- new_deaths: double (nullable = true)
>      |-- new_deaths_smoothed: double (nullable = true)
>      |-- total_cases_per_million: double (nullable = true)
>      |-- new_cases_per_million: double (nullable = true)
>      |-- new_cases_smoothed_per_million: double (nullable = true)
>      |-- total_deaths_per_million: double (nullable = true)
>      |-- new_deaths_per_million: double (nullable = true)
>      |-- new_deaths_smoothed_per_million: double (nullable = true)
>      |-- reproduction_rate: double (nullable = true)
>      |-- icu_patients: double (nullable = true)
>      |-- icu_patients_per_million: double (nullable = true)
>      |-- hosp_patients: double (nullable = true)
>      |-- hosp_patients_per_million: double (nullable = true)
>      |-- weekly_icu_admissions: double (nullable = true)
>      |-- weekly_icu_admissions_per_million: double (nullable = true)
>      |-- weekly_hosp_admissions: double (nullable = true)
>      |-- weekly_hosp_admissions_per_million: double (nullable = true)
>      |-- new_tests: double (nullable = true)
>      |-- total_tests: double (nullable = true)
>      |-- total_tests_per_thousand: double (nullable = true)
>      |-- new_tests_per_thousand: double (nullable = true)
>      |-- new_tests_smoothed: double (nullable = true)
>      |-- new_tests_smoothed_per_thousand: double (nullable = true)
>      |-- positive_rate: double (nullable = true)
>      |-- tests_per_case: double (nullable = true)
>      |-- tests_units: string (nullable = true)
>      |-- total_vaccinations: double (nullable = true)
>      |-- new_vaccinations: double (nullable = true)
>      |-- total_vaccinations_per_hundred: double (nullable = true)
>      |-- new_vaccinations_per_million: double (nullable = true)
>      |-- stringency_index: double (nullable = true)
>      |-- population: double (nullable = true)
>      |-- population_density: double (nullable = true)
>      |-- median_age: double (nullable = true)
>      |-- aged_65_older: double (nullable = true)
>      |-- aged_70_older: double (nullable = true)
>      |-- gdp_per_capita: double (nullable = true)
>      |-- extreme_poverty: double (nullable = true)
>      |-- cardiovasc_death_rate: double (nullable = true)
>      |-- diabetes_prevalence: double (nullable = true)
>      |-- female_smokers: double (nullable = true)
>      |-- male_smokers: double (nullable = true)
>      |-- handwashing_facilities: double (nullable = true)
>      |-- hospital_beds_per_thousand: double (nullable = true)
>      |-- life_expectancy: double (nullable = true)
>      |-- human_development_index: double (nullable = true)

  

The stream requires a user defined schema. Note that the January 2021
schema is different compared to the December 2020 schema. Below, the
user defined schemas are created.

In [None]:
import org.apache.spark.sql.types._

val OurWorldinDataSchema2021 = new StructType()                      
                      .add("iso_code", "string")
                      .add("continent", "string")
                      .add("location", "string")
                      .add("date", "string")
                      .add("total_cases","double")
                      .add("new_cases","double")
                      .add("new_cases_smoothed","double")
                      .add("total_deaths","double")
                      .add("new_deaths","double")
                      .add("new_deaths_smoothed","double")
                      .add("total_cases_per_million","double")
                      .add("new_cases_per_million","double")
                      .add("new_cases_smoothed_per_million","double")
                      .add("total_deaths_per_million","double")
                      .add("new_deaths_per_million","double")
                      .add("new_deaths_smoothed_per_million","double")
                      .add("reproduction_rate", "double")
                      .add("icu_patients", "double")
                      .add("icu_patients_per_million", "double")
                      .add("hosp_patients", "double")
                      .add("hosp_patients_per_million", "double")
                      .add("weekly_icu_admissions", "double")
                      .add("weekly_icu_admissions_per_million", "double")
                      .add("weekly_hosp_admissions", "double")
                      .add("weekly_hosp_admissions_per_million", "double")
                      .add("new_tests", "double")
                      .add("total_tests", "double")
                      .add("total_tests_per_thousand", "double")
                      .add("new_tests_per_thousand", "double")
                      .add("new_tests_smoothed", "double")
                      .add("new_tests_smoothed_per_thousand", "double")
                      .add("positive_rate", "double")
                      .add("tests_per_case", "double")
                      .add("tests_units", "double")
                      .add("total_vaccinations", "double")
                      .add("new_vaccinations", "double")
                      .add("stringency_index","double")
                      .add("population","double")
                      .add("population_density","double")
                      .add("median_age", "double")
                      .add("aged_65_older", "double")
                      .add("aged_70_older", "double")
                      .add("gdp_per_capita","double")
                      .add("extreme_poverty","double")
                      .add("cardiovasc_death_rate","double")
                      .add("diabetes_prevalence","double")
                      .add("female_smokers", "double")
                      .add("male_smokers", "double")
                      .add("handwashing_facilities", "double")
                      .add("hospital_beds_per_thousand", "double")
                      .add("life_expectancy","double")
                      .add("human_development_index","double")

val OurWorldinDataSchema2020 = new StructType()                      
                      .add("iso_code", "string")
                      .add("continent", "string")
                      .add("location", "string")
                      .add("date", "string")
                      .add("total_cases","double")
                      .add("new_cases","double")
                      .add("new_cases_smoothed","double")
                      .add("total_deaths","double")
                      .add("new_deaths","double")
                      .add("new_deaths_smoothed","double")
                      .add("total_cases_per_million","double")
                      .add("new_cases_per_million","double")
                      .add("new_cases_smoothed_per_million","double")
                      .add("total_deaths_per_million","double")
                      .add("new_deaths_per_million","double")
                      .add("new_deaths_smoothed_per_million","double")
                      .add("reproduction_rate", "double")
                      .add("icu_patients", "double")
                      .add("icu_patients_per_million", "double")
                      .add("hosp_patients", "double")
                      .add("hosp_patients_per_million", "double")
                      .add("weekly_icu_admissions", "double")
                      .add("weekly_icu_admissions_per_million", "double")
                      .add("weekly_hosp_admissions", "double")
                      .add("weekly_hosp_admissions_per_million", "double")
                      .add("total_tests", "double")
                      .add("new_tests", "double")
                      .add("total_tests_per_thousand", "double")
                      .add("new_tests_per_thousand", "double")
                      .add("new_tests_smoothed", "double")
                      .add("new_tests_smoothed_per_thousand", "double")
                      .add("tests_per_case", "double")
                      .add("positive_rate", "double")
                      .add("tests_units", "double")
                      .add("stringency_index","double")
                      .add("population","double")
                      .add("population_density","double")
                      .add("median_age", "double")
                      .add("aged_65_older", "double")
                      .add("aged_70_older", "double")
                      .add("gdp_per_capita","double")
                      .add("extreme_poverty","double")
                      .add("cardiovasc_death_rate","double")
                      .add("diabetes_prevalence","double")
                      .add("female_smokers", "double")
                      .add("male_smokers", "double")
                      .add("handwashing_facilities", "double")
                      .add("hospital_beds_per_thousand", "double")
                      .add("life_expectancy","double")
                      .add("human_development_index","double")

  

>     import org.apache.spark.sql.types._
>     OurWorldinDataSchema2021: org.apache.spark.sql.types.StructType = StructType(StructField(iso_code,StringType,true), StructField(continent,StringType,true), StructField(location,StringType,true), StructField(date,StringType,true), StructField(total_cases,DoubleType,true), StructField(new_cases,DoubleType,true), StructField(new_cases_smoothed,DoubleType,true), StructField(total_deaths,DoubleType,true), StructField(new_deaths,DoubleType,true), StructField(new_deaths_smoothed,DoubleType,true), StructField(total_cases_per_million,DoubleType,true), StructField(new_cases_per_million,DoubleType,true), StructField(new_cases_smoothed_per_million,DoubleType,true), StructField(total_deaths_per_million,DoubleType,true), StructField(new_deaths_per_million,DoubleType,true), StructField(new_deaths_smoothed_per_million,DoubleType,true), StructField(reproduction_rate,DoubleType,true), StructField(icu_patients,DoubleType,true), StructField(icu_patients_per_million,DoubleType,true), StructField(hosp_patients,DoubleType,true), StructField(hosp_patients_per_million,DoubleType,true), StructField(weekly_icu_admissions,DoubleType,true), StructField(weekly_icu_admissions_per_million,DoubleType,true), StructField(weekly_hosp_admissions,DoubleType,true), StructField(weekly_hosp_admissions_per_million,DoubleType,true), StructField(new_tests,DoubleType,true), StructField(total_tests,DoubleType,true), StructField(total_tests_per_thousand,DoubleType,true), StructField(new_tests_per_thousand,DoubleType,true), StructField(new_tests_smoothed,DoubleType,true), StructField(new_tests_smoothed_per_thousand,DoubleType,true), StructField(positive_rate,DoubleType,true), StructField(tests_per_case,DoubleType,true), StructField(tests_units,DoubleType,true), StructField(total_vaccinations,DoubleType,true), StructField(new_vaccinations,DoubleType,true), StructField(stringency_index,DoubleType,true), StructField(population,DoubleType,true), StructField(population_density,DoubleType,true), StructField(median_age,DoubleType,true), StructField(aged_65_older,DoubleType,true), StructField(aged_70_older,DoubleType,true), StructField(gdp_per_capita,DoubleType,true), StructField(extreme_poverty,DoubleType,true), StructField(cardiovasc_death_rate,DoubleType,true), StructField(diabetes_prevalence,DoubleType,true), StructField(female_smokers,DoubleType,true), StructField(male_smokers,DoubleType,true), StructField(handwashing_facilities,DoubleType,true), StructField(hospital_beds_per_thousand,DoubleType,true), StructField(life_expectancy,DoubleType,true), StructField(human_development_index,DoubleType,true))
>     OurWorldinDataSchema2020: org.apache.spark.sql.types.StructType = StructType(StructField(iso_code,StringType,true), StructField(continent,StringType,true), StructField(location,StringType,true), StructField(date,StringType,true), StructField(total_cases,DoubleType,true), StructField(new_cases,DoubleType,true), StructField(new_cases_smoothed,DoubleType,true), StructField(total_deaths,DoubleType,true), StructField(new_deaths,DoubleType,true), StructField(new_deaths_smoothed,DoubleType,true), StructField(total_cases_per_million,DoubleType,true), StructField(new_cases_per_million,DoubleType,true), StructField(new_cases_smoothed_per_million,DoubleType,true), StructField(total_deaths_per_million,DoubleType,true), StructField(new_deaths_per_million,DoubleType,true), StructField(new_deaths_smoothed_per_million,DoubleType,true), StructField(reproduction_rate,DoubleType,true), StructField(icu_patients,DoubleType,true), StructField(icu_patients_per_million,DoubleType,true), StructField(hosp_patients,DoubleType,true), StructField(hosp_patients_per_million,DoubleType,true), StructField(weekly_icu_admissions,DoubleType,true), StructField(weekly_icu_admissions_per_million,DoubleType,true), StructField(weekly_hosp_admissions,DoubleType,true), StructField(weekly_hosp_admissions_per_million,DoubleType,true), StructField(total_tests,DoubleType,true), StructField(new_tests,DoubleType,true), StructField(total_tests_per_thousand,DoubleType,true), StructField(new_tests_per_thousand,DoubleType,true), StructField(new_tests_smoothed,DoubleType,true), StructField(new_tests_smoothed_per_thousand,DoubleType,true), StructField(tests_per_case,DoubleType,true), StructField(positive_rate,DoubleType,true), StructField(tests_units,DoubleType,true), StructField(stringency_index,DoubleType,true), StructField(population,DoubleType,true), StructField(population_density,DoubleType,true), StructField(median_age,DoubleType,true), StructField(aged_65_older,DoubleType,true), StructField(aged_70_older,DoubleType,true), StructField(gdp_per_capita,DoubleType,true), StructField(extreme_poverty,DoubleType,true), StructField(cardiovasc_death_rate,DoubleType,true), StructField(diabetes_prevalence,DoubleType,true), StructField(female_smokers,DoubleType,true), StructField(male_smokers,DoubleType,true), StructField(handwashing_facilities,DoubleType,true), StructField(hospital_beds_per_thousand,DoubleType,true), StructField(life_expectancy,DoubleType,true), StructField(human_development_index,DoubleType,true))

  

### Start stream

In January 2021, the schema was updated compared to the schema in
December 2020. Below, one can choose which type of csv files to stream
below.

Stream for 2020

In [None]:
import org.apache.spark.sql.types._

val OurWorldinDataStream = spark
  .readStream
  .schema(OurWorldinDataSchema2020) 
  .option("MaxFilesPerTrigger", 1)
  .option("latestFirst", "true")
  .format("csv")
  .option("header", "true")
  .load("/datasets/group12/20*.csv")
  .dropDuplicates()

  

>     import org.apache.spark.sql.types._
>     OurWorldinDataStream: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iso_code: string, continent: string ... 48 more fields]

  

Stream for 2021

In [None]:
import org.apache.spark.sql.types._

val OurWorldinDataStream2021 = spark
  .readStream
  .schema(OurWorldinDataSchema2021) 
  .option("MaxFilesPerTrigger", 1)
  .option("latestFirst", "true")
  .format("csv")
  .option("header", "true")
  .load("/datasets/group12/21*.csv")
  .dropDuplicates()

  

>     import org.apache.spark.sql.types._
>     OurWorldinDataStream2021: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iso_code: string, continent: string ... 50 more fields]

  

display stream 2020

In [None]:
OurWorldinDataStream.isStreaming

  

>     res81: Boolean = true

In [None]:
display(OurWorldinDataStream) 

  

  

### Query to File (2020)

query that saves file into a parquet file at periodic intervalls.
Analysis will thereafter be performed on the parquet file

create folders for parquet file and checkpoint data

In [None]:
// remove any previous folders if exists
dbutils.fs.rm("datasets/group12/chkpoint",recurse=true)
dbutils.fs.rm("datasets/group12/analysis",recurse=true)

  

>     res14: Boolean = true

In [None]:
dbutils.fs.mkdirs("datasets/group12/chkpoint")

  

>     res15: Boolean = true

In [None]:
dbutils.fs.mkdirs("/datasets/group12/analysis")

  

>     res16: Boolean = true

  

initialize query to store data in parquet files based on column
selection

In [None]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger

val query = OurWorldinDataStream
                 .select($"iso_code", $"continent", $"location", $"date", $"total_cases", $"new_cases", $"new_cases_smoothed", $"total_deaths", $"new_deaths",$"new_deaths_smoothed", $"total_cases_per_million", $"new_cases_per_million", $"new_cases_smoothed_per_million", $"total_deaths_per_million", $"new_deaths_per_million", $"new_deaths_smoothed_per_million", $"reproduction_rate", $"icu_patients", $"icu_patients_per_million", $"hosp_patients", $"hosp_patients_per_million", $"weekly_icu_admissions", $"weekly_icu_admissions_per_million", $"weekly_hosp_admissions", $"weekly_hosp_admissions_per_million", $"total_tests",$"new_tests", $"total_tests_per_thousand", $"new_tests_per_thousand", $"new_tests_smoothed",$"new_tests_smoothed_per_thousand", $"tests_per_case", $"positive_rate", $"tests_units", $"stringency_index", $"population", $"population_density", $"median_age", $"aged_65_older", $"aged_70_older", $"gdp_per_capita", $"extreme_poverty", $"cardiovasc_death_rate", $"diabetes_prevalence", $"female_smokers", $"male_smokers", $"handwashing_facilities", $"hospital_beds_per_thousand", $"life_expectancy", $"human_development_index")
                 .writeStream
                 //.trigger(Trigger.ProcessingTime("20 seconds")) // debugging
                 .trigger(Trigger.ProcessingTime("216000 seconds")) // for each day
                 .option("checkpointLocation", "/datasets/group12/chkpoint")
                 .format("parquet")  
                 .option("path", "/datasets/group12/analysis")
                 .start()
                 
query.awaitTermination() // hit cancel to terminate

  

check saved parquet file contents

In [None]:
display(dbutils.fs.ls("/datasets/group12/analysis"))

  

[TABLE]

Truncated to 30 rows

In [None]:
val parquetFileDF = spark.read.parquet("dbfs:/datasets/group12/analysis/*.parquet")

  

>     parquetFileDF: org.apache.spark.sql.DataFrame = [iso_code: string, continent: string ... 48 more fields]

In [None]:
display(parquetFileDF.describe())

  

[TABLE]

Truncated to 12 cols

In [None]:
display(parquetFileDF.orderBy($"date".desc))

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

In [None]:
parquetFileDF.count()

  

>     res5: Long = 62500

  

### Query to File (2021)

query that saves file into a parquet file at periodic intervalls.

In [None]:
// remove any previous folders if exists
dbutils.fs.rm("datasets/group12/chkpoint2021",recurse=true)
dbutils.fs.rm("datasets/group12/analysis2021",recurse=true)

In [None]:
dbutils.fs.mkdirs("datasets/group12/chkpoint2021")
dbutils.fs.mkdirs("datasets/group12/analysis2021")

  

>     res18: Boolean = true

In [None]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger

val query = OurWorldinDataStream2021
                 .select($"iso_code", $"continent", $"location", $"date", $"total_cases", $"new_cases", $"new_cases_smoothed", $"total_deaths", $"new_deaths",$"new_deaths_smoothed", $"total_cases_per_million", $"new_cases_per_million", $"new_cases_smoothed_per_million", $"total_deaths_per_million", $"new_deaths_per_million", $"new_deaths_smoothed_per_million", $"reproduction_rate", $"icu_patients", $"icu_patients_per_million", $"hosp_patients", $"hosp_patients_per_million", $"weekly_icu_admissions", $"weekly_icu_admissions_per_million", $"weekly_hosp_admissions", $"weekly_hosp_admissions_per_million", $"total_tests",$"new_tests", $"total_tests_per_thousand", $"new_tests_per_thousand", $"new_tests_smoothed",$"new_tests_smoothed_per_thousand", $"tests_per_case", $"positive_rate", $"tests_units", $"stringency_index", $"population", $"population_density", $"median_age", $"aged_65_older", $"aged_70_older", $"gdp_per_capita", $"extreme_poverty", $"cardiovasc_death_rate", $"diabetes_prevalence", $"female_smokers", $"male_smokers", $"handwashing_facilities", $"hospital_beds_per_thousand", $"life_expectancy", $"human_development_index")
                 .writeStream
                 //.trigger(Trigger.ProcessingTime("20 seconds")) // debugging
                 .trigger(Trigger.ProcessingTime("216000 seconds")) // each day
                 .option("checkpointLocation", "/datasets/group12/chkpoint2021")
                 .format("parquet")  
                 .option("path", "/datasets/group12/analysis2021")
                 .start()
                 
query.awaitTermination() // hit cancel to terminate

In [None]:
val parquetFile2021DF = spark.read.parquet("dbfs:/datasets/group12/analysis2021/*.parquet")

  

>     parquetFile2021DF: org.apache.spark.sql.DataFrame = [iso_code: string, continent: string ... 48 more fields]

In [None]:
display(parquetFile2021DF.describe())

  

[TABLE]

Truncated to 12 cols