A project that utilizes Apache Spark on Scala to perform trend-analysis on the JHU CSSE COVID-19 Dataset.
3D pulse-point map for confirmed COVID-19 cases.
Created with flourish.studio.
The main contributions of this project are:
- A Spark environment in which developers can perform high-speed interactive analysis on the JHU CSSE COVID-19 Dataset.
- Tools to load & pre-process the CSV dataset files with the highly-reliable SparkSQL DataSet API.
- Tools to write query results to disk as CSV files for portability.
- Pre-defined queries to extract simple patterns & trends from the original dataset to facilitate more complex queries.
- A Driver program that demonstrates the above functionalities.
- Scala - 2.12.10
- sbt - 1.5.4
- Spark-Core - 3.1.2
- Spark-SQL - 3.1.2
- tools package - Available tools are for loading, pre-processing & cleaning the original erroneous CSV data & load it as a DataSet object, and write DataSet query results to disk in CSV format.
- queries package - Tools to:
- Merge segregated tables (confirmed, recovered & deaths) into unified tables.
- Extract patterns & trends of interests, such as growth rate of confirmed, recovered and death cases.
- Convert default column-timeseries to row-timeseries, to increase versatility for use with different visualization tools.
- Other queries of key-interests, like filter by tropical vs. non-tropical countries, partition by seasons etc.
To-do list:
- Deploy as web application on AWS for public-use.
- Include in-built visualization tools to provide an integrated on-demand analysis & visualization platform, such as Apache Superset.
Cloning the repository:
In your CLI, git clone https://github.com/kylejwolff/COVID-19-SPARK.git
.
Setting up the environment:
- Navigate to the root directory of the cloned repository
parent-directory/COVID-19-SPARK/
, renamebuild-template.sbt
tobuild.sbt
. - In the
build.sbt
file, uncomment the appropriate code block to run the application on Spark 2.4.8 (DEPRECATED) or Spark 3.1.2. - In your CLI, enter
sbt run
to import dependencies and run the application.
On start-up, the application by default loads and cleans all of the dataset CSV files in /raw_data into DataSet objects, so no further pre-processing is required. Once all pre-processing is completed, the Driver program pauses and waits for user input, as shown below:
Selecting any of the option will run the corresponding query, where details concerning the query are documented in the source code in Driver.main().
Although visualization is yet to be implemented as an in-built feature, we have included several visuals produced on Flourish & Tableau using exported results of queries performed in this application.
Query 1: A point-pulse world-map for all confirmed COVID-19 cases.
Query 2: A row-based timeseries for all confirmed COVID-19 cases visualized as a bar-chart race.
Kyle Wolff, Brian Jackman, Vincent Chooi.
This project uses the following license: <license_name>.