Project Description and Introduction
====================================

Project Members: &lt;br&gt; Chi Zhang, zchi@chalmers.se; &lt;br&gt;
Shuangshuang Chen, shuche@kth.se; &lt;br&gt; Magnus Tarle, tarle@kth.se

The presentation record link is here:
https://drive.google.com/drive/folders/13XwlItZ\_qtOeBZ5TJfnCP1hqtQ9imRFq
&lt;br&gt; Because the cluster seems quite slow when we recording the
video and there are too many things to run within 20 minutes, we
recorded the video after we finish running all the codes. And some parts
of the video are speeded up to meet the 20 minutes requirement.

Project Introduction - Analysis and Prediction of COVID-19 Data
---------------------------------------------------------------

In this project, we use both scala (data processing part) and python
(algorithm part). We deal with scalable data, and use what we learned
from the course.

### 1. Project Plan

In this project, we dealt with COVID-19 data, and do the following
tasks:

1.  Introduction
2.  Struct stream data - to update our database once a day.
3.  Preprocessing - clean data.
4.  Visualization - visualize new cases on a world map, with different
    time scope.
5.  Statistics analysis - get the distributions, mean and std of
    different varables.
6.  Model 1: K-means - clustering of different countries
7.  Model 2: Linear Regression - predict new cases of some contries from
    other countries.
8.  Model 3: Autoregressive integrated moving average (ARIMA) -
    prediction for new cases and new deaths from past values.
9.  Model 4: Gaussian Procecces (GP) - apply Gaussian Procecces (GP) to
    predict mean and covariance of new cases and new deaths from past
    values.
10. Ending

### 2. Tools and methods

For the method part, this particular task is not suitable to use deep
learning methods. We choose 4 methods related to Machine leanring for
our clustering and prediction task: 1. Clustering model - K-means 2.
Time series model - Autoregressive integrated moving average (ARIMA) 3.
Gaussian Procecces (GP) 4. Linear Regression (LR)

### 3. Data resources

We found the following data resources. And finally we chose the 3rd
dataset because it has the most features for us to use.

1.  WHO: https://covid19.who.int/WHO-COVID-19-global-data.csv

2.  COVID Tracking Project API at Cambridge:
    https://api.covidtracking.com

3.  Data on COVID-19 (coronavirus) by Our World in Data: git repo
    https://github.com/owid/covid-19-data/tree/master/public/data

4.  Sweden (proceesed by apify) API:
    https://api.apify.com/v2/datasets/Nq3XwHX262iDwsFJS/items?format=json&clean=1

Note that the preprocessing, visualization and analysis within these
notebooks were made on this 3rd dataset, downloaded December 2020 to
Databricks. One dataset from December 2020, "owid-covid-data.csv", can
also be found in the same google drive folder as the video presentation.

&lt;!-- \#\#\# Some additional ideas (added by Magnus) 1. Join any of
the above mentioned COVID-19 API:s with another API source to get new
insights by looking at correlations, clustering. Might be difficult to
find a good interesting source but there seems to be some datasources
listed e.g. here:
https://www.octoparse.com/blog/big-data-70-amazing-free-data-sources-you-should-know-for-2017.
One could imagine e.g. crime vs covid or economic sector vs covid etc.
2. Perhaps a bit algorithm specific - apply Gaussian Procecces (GP) to
predict mean and covariance. Reason being I have for some time wanted to
learn GP as they seem pretty useful. If you want to have an intro to GP,
you could look at:
http://krasserm.github.io/2018/03/19/gaussian-processes/ and
https://medium.com/@dan\_90739/being-bayesian-and-thinking-deep-time-series-prediction-with-uncertainty-25ff581b056c
--&gt;

### 4. Useful links

1.  Description of each column in the "Our World in Data" dataset
    https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv
2.  If you want to have an intro to GP, you could look at:
    http://krasserm.github.io/2018/03/19/gaussian-processes/ and
    https://medium.com/@dan\_90739/being-bayesian-and-thinking-deep-time-series-prediction-with-uncertainty-25ff581b056c
3.  ARIMA - Autoregressive Integrated Moving Average model. It's widely
    used in time series analysis. see defination here:
    https://en.wikipedia.org/wiki/Autoregressive*integrated*moving\_average

### 5. Meeting Records:

Weekly meetings for group discussion:

1.  2020-12-01 Tuesday:

-   Discussion about project topic
-   Find data resources

1.  2020-12-04 Friday:
    -   Understand dataset 3, and manage download data
    -   Set up processing pipeline into dataframe
    -   Prepared for statistics analysis
    -   Selected the column for analysis and machine learning
    -   Finished struct stream for updating data
2.  2020-12-08 Tuesday:
    -   Exploited each chosen column, plot statistics
    -   Dealt with missing data
    -   Post-process for useful features from existed column (for
        statistics analysis and prediction/regression)
3.  2020-12-16 Wednesday:

-   Finished statistics analysis, which correlation would interesting to
    show
-   Progress on visulization
-   Progress on GP model
-   Progress on LR model
-   Progress on ARIMA model

1.  2020-12-18 Friday:

-   Progress on K-means model
-   Finish visulization
-   Finish other models
-   Finish Evaluation

1.  2021-01-07 Thursday:

-   Finish all models
-   Discussing about the final presentation

Project Progress
----------------

### 0. Introduction of the data

The columns we selected to analyze are: - continent - location - date -
total*cases - new*cases - total*deaths - new*deaths - reproduction*rate
- icu*patients - hosp*patients - weekly*icu*admissions -
weekly*hosp*admissions - total*tests - new*tests - stingency*index -
population - population*density - median*age - aged*65*older -
aged*70*older - gdp*per*capita - extreme*poverty - cardiovasc*death*rate
- diabetes*prevalence - female*smokers - male*smokers -
hospital*beds*per*thousand - life*expectancy - human*development*index

### 1. Streaming

This part has been moved to separate files:
"DownloadFilesPeriodicallyScript" and "StreamToFile".

### 2. Preprocessing

To **Rerun Steps 1-5** done in the notebook at: \*
`Workspace -> PATH_TO -> DataPreprocess]`

just `run` the following command as shown in the cell below:

`%scala   %run "PATH_TO/DataPreprocess"`

-   *Note:* If you already evaluated the `%run ...` command above then:
    -   first delete the cell by pressing on `x` on the top-right corner
        of the cell and
    -   revaluate the `run` command above.

In [None]:
"./02_DataPreprocess"

  

### 3. Explosive Analysis

In [None]:
"./03_ExplosiveAnalysis"

  

### 4. Visualization

In [None]:
"./04_DataVisualize"

  

### 5. Model 1: Clustering K-means

Clustering of countries based on the dataset features of each country.

This part is in notebook Clustering

In [None]:
"./05_Clustering"

  

### 6. Model 2: Linear Regression (LR) Model

Prediction with constant values, predict new cases of some contries from
other countries.

This part is in notebook DataPrediction\_LR

In [None]:
"./06_DataPredicton_LR"

  

### 7. Model 3: Autoregressive integrated moving average (ARIMA)

prediction for new cases and new deaths from past values.

This part is in notebook DataPrediction\_ARIMA

In [None]:
"./07_DataPredicton_ARIMA"

  

### 8. Model 4: Gaussian Procecces (GP)

apply Gaussian Procecces (GP) to predict mean and covariance of new
cases and new deaths from past values.

This part is in notebook DataPrediction\_GP

[TABLE]

  

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

  

>     res14: Long = 62500

  

>     root
>      |-- iso_code: string (nullable = true)
>      |-- continent: string (nullable = false)
>      |-- location: string (nullable = true)
>      |-- date: string (nullable = true)
>      |-- total_cases: double (nullable = false)
>      |-- new_cases: double (nullable = true)
>      |-- new_cases_smoothed: double (nullable = false)
>      |-- total_deaths: double (nullable = false)
>      |-- new_deaths: double (nullable = true)
>      |-- new_deaths_smoothed: double (nullable = false)
>      |-- reproduction_rate: double (nullable = false)
>      |-- icu_patients: double (nullable = true)
>      |-- icu_patients_per_million: double (nullable = true)
>      |-- hosp_patients: double (nullable = true)
>      |-- hosp_patients_per_million: double (nullable = true)
>      |-- weekly_icu_admissions: double (nullable = true)
>      |-- weekly_icu_admissions_per_million: double (nullable = true)
>      |-- weekly_hosp_admissions: double (nullable = true)
>      |-- weekly_hosp_admissions_per_million: double (nullable = true)
>      |-- total_tests: double (nullable = false)
>      |-- new_tests: double (nullable = true)
>      |-- total_tests_per_thousand: double (nullable = true)
>      |-- new_tests_per_thousand: double (nullable = true)
>      |-- new_tests_smoothed: double (nullable = true)
>      |-- new_tests_smoothed_per_thousand: double (nullable = true)
>      |-- tests_per_case: double (nullable = true)
>      |-- positive_rate: double (nullable = true)
>      |-- tests_units: double (nullable = true)
>      |-- stringency_index: double (nullable = false)
>      |-- population: double (nullable = true)
>      |-- population_density: double (nullable = true)
>      |-- median_age: double (nullable = true)
>      |-- aged_65_older: double (nullable = true)
>      |-- aged_70_older: double (nullable = true)
>      |-- gdp_per_capita: double (nullable = true)
>      |-- extreme_poverty: double (nullable = true)
>      |-- cardiovasc_death_rate: double (nullable = true)
>      |-- diabetes_prevalence: double (nullable = true)
>      |-- female_smokers: double (nullable = true)
>      |-- male_smokers: double (nullable = true)
>      |-- handwashing_facilities: double (nullable = true)
>      |-- hospital_beds_per_thousand: double (nullable = true)
>      |-- life_expectancy: double (nullable = true)
>      |-- human_development_index: double (nullable = true)
>      |-- total_cases_per_million: double (nullable = true)
>      |-- new_cases_per_million: double (nullable = true)
>      |-- new_cases_smoothed_per_million: double (nullable = true)
>      |-- total_deaths_per_million: double (nullable = true)
>      |-- new_deaths_per_million: double (nullable = true)
>      |-- new_deaths_smoothed_per_million: double (nullable = true)

  

>     iso_code: 0
>     continent: 0
>     location: 0
>     date: 0
>     total_cases: 0
>     new_cases: 0
>     new_cases_smoothed: 0
>     total_deaths: 0
>     new_deaths: 0
>     new_deaths_smoothed: 0
>     reproduction_rate: 0
>     icu_patients: 36018
>     icu_patients_per_million: 36018
>     hosp_patients: 34870
>     hosp_patients_per_million: 34870
>     weekly_icu_admissions: 41062
>     weekly_icu_admissions_per_million: 41062
>     weekly_hosp_admissions: 40715
>     weekly_hosp_admissions_per_million: 40715
>     total_tests: 0
>     new_tests: 20510
>     total_tests_per_thousand: 0
>     new_tests_per_thousand: 20510
>     new_tests_smoothed: 18176
>     new_tests_smoothed_per_thousand: 18176
>     tests_per_case: 19301
>     positive_rate: 19749
>     tests_units: 41600
>     stringency_index: 0
>     population: 0
>     population_density: 0
>     median_age: 0
>     aged_65_older: 0
>     aged_70_older: 0
>     gdp_per_capita: 0
>     extreme_poverty: 11168
>     cardiovasc_death_rate: 0
>     diabetes_prevalence: 0
>     female_smokers: 0
>     male_smokers: 0
>     handwashing_facilities: 24124
>     hospital_beds_per_thousand: 0
>     life_expectancy: 0
>     human_development_index: 0
>     total_cases_per_million: 0
>     new_cases_per_million: 0
>     new_cases_smoothed_per_million: 0
>     total_deaths_per_million: 0
>     new_deaths_per_million: 0
>     new_deaths_smoothed_per_million: 0
>     import org.apache.spark.sql.functions._

  

[TABLE]

Truncated to 30 rows

  

>     import org.apache.spark.sql.functions._
>     import org.apache.spark.sql.expressions.Window

  

>     res32: Long = 159

  

>     df_filteredLocation: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iso_code: string, continent: string ... 48 more fields]
>     df_fillContinentNull: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iso_code: string, continent: string ... 48 more fields]
>     res15: df_filteredLocation.type = [iso_code: string, continent: string ... 48 more fields]

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

  

>     root
>      |-- iso_code: string (nullable = true)
>      |-- stringency_index: double (nullable = false)
>      |-- population: double (nullable = true)
>      |-- population_density: double (nullable = true)
>      |-- gdp_per_capita: double (nullable = true)
>      |-- diabetes_prevalence: double (nullable = true)
>      |-- total_cases_per_million: double (nullable = true)
>      |-- total_cases: double (nullable = false)
>      |-- normal_stringency_index: double (nullable = true)
>      |-- normal_population: double (nullable = true)
>      |-- normal_population_density: double (nullable = true)
>      |-- normal_gdp_per_capita: double (nullable = true)
>      |-- normal_diabetes_prevalence: double (nullable = true)
>      |-- log_total_cases_per_million: double (nullable = true)

  

>     df_by_location_normalized_selected: org.apache.spark.sql.DataFrame = [normal_stringency_index: double, normal_population: double ... 4 more fields]

  

[TABLE]

Truncated to 30 rows

  

>     df_filtered_date: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iso_code: string, continent: string ... 48 more fields]
>     res17: df_fillContinentNull.type = [iso_code: string, continent: string ... 48 more fields]

  

  

>     import org.apache.spark.ml.feature.VectorAssembler
>     vectorizer: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_d4f952d2fd6f, handleInvalid=error, numInputCols=5
>     dataset: org.apache.spark.sql.DataFrame = [features: vector, log_total_cases_per_million: double]

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

  

>     split20: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, log_total_cases_per_million: double]
>     split80: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, log_total_cases_per_million: double]

  

>     testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, log_total_cases_per_million: double]
>     trainingSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, log_total_cases_per_million: double]

  

>     res43: Long = 26

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

  

>     res44: Long = 133

  

>     import org.apache.spark.ml.regression.LinearRegression
>     import org.apache.spark.ml.regression.LinearRegressionModel
>     import org.apache.spark.ml.Pipeline
>     lr: org.apache.spark.ml.regression.LinearRegression = linReg_1077c2744dad
>     lrModel: org.apache.spark.ml.regression.LinearRegressionModel = LinearRegressionModel: uid=linReg_1077c2744dad, numFeatures=5

  

>     Coefficients: [2.6214237261444726,-1.364306221013195,-1.3234981005291744,4.903123743799156,1.0283056897021905], Intercept: 6.69144939405342
>     RMSE: 1.6058962464052953
>     trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@7545e754

  

>     +------------------+---------------------------+--------------------+
>     |        prediction|log_total_cases_per_million|            features|
>     +------------------+---------------------------+--------------------+
>     |  6.32849475311866|          2.142543078223737|[0.15958180147058...|
>     | 8.115061043502047|          7.528810569839765|[0.36167279411764...|
>     |7.4629978084926085|          6.223671398897003|[0.64889705882352...|
>     | 7.831137150153837|          6.524287884057365|[0.79779411764705...|
>     |10.269420769779098|          5.843997267897207|[0.40429687499999...|
>     | 7.289562258542387|         2.6304048908829563|[0.49471507352941...|
>     | 9.963465130096221|          9.237218853465539|[0.57444852941176...|
>     |13.213369998258525|         10.784078124343976|[0.74460018382352...|
>     | 9.213228147281594|         10.572977149641726|[0.75528492647058...|
>     | 9.085379666714468|          9.143147511395949|[0.82444852941176...|
>     | 6.750739656770264|         10.952187342908323|[0.0,3.6806047717...|
>     | 8.135800825140635|         10.373288860272808|[0.51056985294117...|
>     | 7.507644816227433|         10.203096222487993|[0.55319393382352...|
>     |10.048805671611255|          8.817232647019427|[0.56376378676470...|
>     | 9.404481665413662|         10.158884909113675|[0.61695772058823...|
>     | 9.488257390217871|         10.351014339169227|[0.64889705882352...|
>     |  9.08870238448793|         10.291011744026449|[0.71806066176470...|
>     |  8.89927809231843|         10.041688001727678|[0.72334558823529...|
>     |  8.89927809231843|         10.041688001727678|[0.72334558823529...|
>     |  9.35306553340342|          9.427461709192647|[0.75528492647058...|
>     +------------------+---------------------------+--------------------+
>     only showing top 20 rows
>
>     Root Mean Squared Error (RMSE) on test data = 2.225906214656472
>     import org.apache.spark.ml.evaluation.RegressionEvaluator
>     predictions: org.apache.spark.sql.DataFrame = [features: vector, log_total_cases_per_million: double ... 1 more field]
>     evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = RegressionEvaluator: uid=regEval_4b8a404d0038, metricName=rmse, throughOrigin=false
>     rmse: Double = 2.225906214656472

  

>     df_cleaned_feature_permillion: org.apache.spark.sql.DataFrame = [iso_code: string, continent: string ... 48 more fields]

  

>     Root Mean Squared Error (RMSE) on test data = $rmse
>     evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = RegressionEvaluator: uid=regEval_4bcd4d071608, metricName=rmse, throughOrigin=false
>     rmse: Double = 99893.56063834664

  

[TABLE]

Truncated to 30 rows

Truncated to 12 cols

In [None]:
"./08_DataPrediction_GP"