# Notebook 1
# ML Prediction of COVID-19 fatalities
This notebook extracts data from [COVID-19 Data Lake](https://azure.microsoft.com/en-au/services/open-datasets/catalog/ecdc-covid-19-cases/) published on Microsoft Azure Open Datasets.<br>
The dataset is the latest available public data on geographic distribution of COVID-19 cases worldwide from the *European Center for Disease Prevention and Control (ECDC)*. Each row/entry contains the number of new cases reported per day and per country.
More information on the dataset and COVID-19 Data Lake can be found [*here*](https://azure.microsoft.com/en-au/services/open-datasets/catalog/ecdc-covid-19-cases/).

Import all the necessary modules used in this notebook for machine learning

In [3]:
#Load all the necessary pyspark libraries required for this notebook
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer, VectorAssembler, StringIndexer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

Azure Storage access information that contains the updated Covid-19 cases dataset

In [5]:
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet"
blob_sas_token = r""

Specify Spark configuration using the information above to access Storage (Blob)

In [7]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)

Read the COVID-19 dataset from Blob and display a sample set

In [9]:
dfRaw = spark.read.parquet(wasbs_path)
display(dfRaw)

date_rep,day,month,year,cases,deaths,countries_and_territories,geo_id,country_territory_code,pop_data_2018,continent_exp,load_date,iso_country
2020-06-24,24,6,2020,338,20,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-23,23,6,2020,310,17,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-22,22,6,2020,409,12,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-21,21,6,2020,546,21,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-20,20,6,2020,346,2,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-19,19,6,2020,658,42,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-18,18,6,2020,564,13,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-17,17,6,2020,783,13,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-16,16,6,2020,761,7,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF
2020-06-15,15,6,2020,664,20,Afghanistan,AF,AFG,,Asia,2020-06-25T00:05:24.875+0000,AF


To start preparing the features for ML feature engineering, remove the unnecessary columns from the dataset that might not have any impact on the model itself

In [11]:
dfClean=dfRaw.drop('geo_id') \
              .drop('country_territory_code') \
              .drop('continent_exp') \
              .drop('load_date') \
              .drop('iso_country') \
              .drop('date_rep') \
              .drop('pop_data_2018')
display(dfClean)

day,month,year,cases,deaths,countries_and_territories
24,6,2020,338,20,Afghanistan
23,6,2020,310,17,Afghanistan
22,6,2020,409,12,Afghanistan
21,6,2020,546,21,Afghanistan
20,6,2020,346,2,Afghanistan
19,6,2020,658,42,Afghanistan
18,6,2020,564,13,Afghanistan
17,6,2020,783,13,Afghanistan
16,6,2020,761,7,Afghanistan
15,6,2020,664,20,Afghanistan


The model does not accept string columns - convert the string column representing the country names to an index number that can be a feature in the model

In [13]:
stringindexer = StringIndexer(inputCol="countries_and_territories", outputCol="countries_index")
dfTransformed = stringindexer.fit(dfClean).transform(dfClean)
display(dfTransformed)

day,month,year,cases,deaths,countries_and_territories,countries_index
24,6,2020,338,20,Afghanistan,62.0
23,6,2020,310,17,Afghanistan,62.0
22,6,2020,409,12,Afghanistan,62.0
21,6,2020,546,21,Afghanistan,62.0
20,6,2020,346,2,Afghanistan,62.0
19,6,2020,658,42,Afghanistan,62.0
18,6,2020,564,13,Afghanistan,62.0
17,6,2020,783,13,Afghanistan,62.0
16,6,2020,761,7,Afghanistan,62.0
15,6,2020,664,20,Afghanistan,62.0


Finalise the feature engineering process by concatenating all feature columns into a feature vector.<br>
Also, identify any categorical features to index them.

In [15]:
featuresCols = dfTransformed.columns
featuresCols.remove('deaths')
featuresCols.remove('countries_and_territories')
vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="assembledFeatures", handleInvalid="skip")
# maxCategories value is set to 210 to consider the countries column as a category (there are roughly 210 countries in the dataset)
vectorIndexer = VectorIndexer(inputCol="assembledFeatures", outputCol="finalFeatures", maxCategories=210)

Split the dataset into ***training set* (70%)** and ***test set* (30%)**

In [17]:
train, test = dfTransformed.randomSplit([0.7, 0.3])
display(train)

day,month,year,cases,deaths,countries_and_territories,countries_index
1,1,2020,0.0,0,Afghanistan,62.0
1,1,2020,0.0,0,Armenia,60.0
1,1,2020,0.0,0,Australia,13.0
1,1,2020,0.0,0,Belarus,54.0
1,1,2020,0.0,0,Brazil,12.0
1,1,2020,0.0,0,Cambodia,57.0
1,1,2020,0.0,0,Cases_on_an_international_conveyance_Japan,205.0
1,1,2020,0.0,0,China,4.0
1,1,2020,0.0,0,Croatia,33.0
1,1,2020,0.0,0,Denmark,5.0


Train a Gradient Boosted Trees (GBT) model. <br>
***Note that GBT is chosen in this demo purely for the purpose of an experiment*** and should not be considered as the most effective model for this dataset.
More on GBT [*here*](https://en.wikipedia.org/wiki/Gradient_boosting).<br>
Also defined here is a grid of hyperparameters to test to get a decent accuracy of the model along with an evaluation metric and a cross validator.

In [19]:
gbt = GBTRegressor(featuresCol="finalFeatures", labelCol="deaths", maxIter=10)
paramGrid = ParamGridBuilder()\
  .addGrid(gbt.maxDepth, [10, 15])\
  .addGrid(gbt.maxBins, [210, 250])\
  .build()
# Using the Mean Absolute Error as an evaluation metric
evaluator = RegressionEvaluator(metricName="mae", labelCol=gbt.getLabelCol(), predictionCol=gbt.getPredictionCol())
# tune the model using cross validator
cv = CrossValidator(estimator=gbt, evaluator=evaluator, estimatorParamMaps=paramGrid)

Create a Pipeline by chaining the assembler, indexer and the gbt cross validator

In [21]:
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])

Train the model using the training set

In [23]:
pipelineModel = pipeline.fit(train)

Run predictions for the entire dataset

In [25]:
predictions = pipelineModel.transform(dfTransformed)

Rename the output columns and convert to appropriate data types in the predicted dataset (to match the schemain Staging Table in SQL MI) before writing the dataset to Azure SQL Managed Instance

In [27]:
dfPredicted = predictions.select("deaths", "prediction", *featuresCols, "countries_and_territories") \
              .withColumnRenamed("pop_data_2018","Population") \
              .withColumnRenamed("countries_and_territories", "CountryName") \
              .withColumnRenamed("countries_index", "CountryMLIndex")

dfPredicted = dfPredicted.withColumn("CountryMLIndex", dfPredicted.CountryMLIndex.cast('integer')) \
                .withColumn("Day", dfPredicted.day.cast('integer')) \
                .withColumn("Month", dfPredicted.month.cast('integer')) \
                .withColumn("Year", dfPredicted.year.cast('integer')) \
                .withColumn("Cases", dfPredicted.cases.cast('integer')) \
                .withColumn("Deaths", dfPredicted.deaths.cast('integer')) \
                .withColumn("Prediction", dfPredicted.prediction.cast('integer'))

Use the [SQL Spark connector](https://github.com/microsoft/sql-spark-connector) to write the predicted dataset to staging table "***dbo.StagingPredictedCovid19***" in SQL MI.<br>
This staging table will be used for further transformation in the subsequent notebook to finalise the data for visualisation.

In [29]:
sqlmiconnection = dbutils.secrets.get(scope = "sqlmi-kv-secrets", key = "sqlmiconn")
sqlmiuser = dbutils.secrets.get(scope = "sqlmi-kv-secrets", key = "sqlmiuser")
sqlmipwd = dbutils.secrets.get(scope = "sqlmi-kv-secrets", key = "sqlmipwd")
dbname = "Covid19datamart"
servername = "jdbc:sqlserver://" + sqlmiconnection
database_name = dbname
url = servername + ";" + "database_name=" + dbname + ";"
table_name = "[Covid19datamart].[dbo].[StagingPredictedCovid19]"

try:
  dfPredicted.write \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", sqlmiuser) \
        .option("port", 3342) \
        .option("password", sqlmipwd) \
        .option("applicationintent", "ReadWrite") \
        .mode("append") \
        .save()
except ValueError as error :
    print("Connector write failed", error)

Connect to Azure SQL MI database to verify the data written in *dbo.StagingPredictedCovid19* <br>
--End of notebook--