COVID-19-SPARK

A project that utilizes Apache Spark on Scala to perform trend-analysis on the JHU CSSE COVID-19 Dataset.

3D pulse-point map for confirmed COVID-19 cases.
Created with flourish.studio.

The main contributions of this project are:

A Spark environment in which developers can perform high-speed interactive analysis on the JHU CSSE COVID-19 Dataset.
Tools to load & pre-process the CSV dataset files with the highly-reliable SparkSQL DataSet API.
Tools to write query results to disk as CSV files for portability.
Pre-defined queries to extract simple patterns & trends from the original dataset to facilitate more complex queries.
A Driver program that demonstrates the above functionalities.

Technologies Used

Scala - 2.12.10
sbt - 1.5.4
Spark-Core - 3.1.2
Spark-SQL - 3.1.2

Features

tools package - Available tools are for loading, pre-processing & cleaning the original erroneous CSV data & load it as a DataSet object, and write DataSet query results to disk in CSV format.
queries package - Tools to:
- Merge segregated tables (confirmed, recovered & deaths) into unified tables.
- Extract patterns & trends of interests, such as growth rate of confirmed, recovered and death cases.
- Convert default column-timeseries to row-timeseries, to increase versatility for use with different visualization tools.
- Other queries of key-interests, like filter by tropical vs. non-tropical countries, partition by seasons etc.

To-do list:

Deploy as web application on AWS for public-use.
Include in-built visualization tools to provide an integrated on-demand analysis & visualization platform, such as Apache Superset.

Getting Started

Cloning the repository: In your CLI, git clone https://github.com/kylejwolff/COVID-19-SPARK.git.

Setting up the environment:

Navigate to the root directory of the cloned repository parent-directory/COVID-19-SPARK/, rename build-template.sbt to build.sbt.
In the build.sbt file, uncomment the appropriate code block to run the application on Spark 2.4.8 (DEPRECATED) or Spark 3.1.2.
In your CLI, enter sbt run to import dependencies and run the application.

Usage

On start-up, the application by default loads and cleans all of the dataset CSV files in /raw_data into DataSet objects, so no further pre-processing is required. Once all pre-processing is completed, the Driver program pauses and waits for user input, as shown below:

Selecting any of the option will run the corresponding query, where details concerning the query are documented in the source code in Driver.main().

Visualizations

Although visualization is yet to be implemented as an in-built feature, we have included several visuals produced on Flourish & Tableau using exported results of queries performed in this application.

Query 1: A point-pulse world-map for all confirmed COVID-19 cases.

Query 2: A row-based timeseries for all confirmed COVID-19 cases visualized as a bar-chart race.

Contributors

Kyle Wolff, Brian Jackman, Vincent Chooi.

License

This project uses the following license: <license_name>.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
raw_data		raw_data
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
build-template.sbt		build-template.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19-SPARK

A project that utilizes Apache Spark on Scala to perform trend-analysis on the JHU CSSE COVID-19 Dataset.

Technologies Used

Features

Getting Started

Usage

Visualizations

Contributors

License

About

Releases

Packages

Contributors 3

Languages

kylejwolff/COVID-19-SPARK

Folders and files

Latest commit

History

Repository files navigation

COVID-19-SPARK

A project that utilizes Apache Spark on Scala to perform trend-analysis on the JHU CSSE COVID-19 Dataset.

Technologies Used

Features

Getting Started

Usage

Visualizations

Contributors

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages