# AzureDay - Part - 2: COVID-19 Time Series Analysis and Prediction using ML.Net framework

### Dataset

- [2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE - Time Series](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).

### Introduction 

This is **Part-2** of our analysis on the COVID-19 dataset provided by Johns Hopkins CSSE. In [**Part-1**](https://github.com/praveenraghuvanshi1512/TechnicalSessions/tree/31052020-virtualmlnet/31052020-virtualmlnet/src/part-1), I did data analysis on the dataset and created some tables and plots for getting insights from it. In Part-2, I'll focus on applying machine learning for making a prediction using time-series API's provided by ML.Net framework. I'll be building a model from scratch on the number of confirmed cases and predicting for the next 7/14 days. Later on, I'll plot these numbers for better visualization.

[**ML.Net**](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet) is a cross-platform framework from Microsoft for developing Machine learning models in the .Net ecosystem. It allows .Net developers to solve business problems using machine learning algorithms leveraging their preferred language such as C#/F#. It's highly scalable and used within Microsoft in many of its products such as Bing, Powerpoint, etc.



### Summary

Below is the summary of steps we'll be performing

1. Define application level items
    - Nuget packages
    - Namespaces
    - Constants
     
2. Utility Functions
    - Formatters    

3. Dataset and Transformations
    - Actual from [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series)
    - Transformed [time_series_covid19_confirmed_global_transposed.csv](time_series_covid19_confirmed_global_transposed.csv)
    
4. Data Classes
    - ConfirmedData : Provides a map between columns in a dataset
    - ConfirmedForecast : Holds predicted values

5. Data Analysis
    - Visualize Data using DataFrame API
    - Display Top 10 Rows - dataframe.Head(10)
    - Display Last 10 Rows - dataframe.Tail(10)
    - Display Dataset Statistics - dataframe.Description()
    - Plot of TotalConfimed cases vs Date

6. Load Data - MLContext
7. ML Pipeline
8. Train Model
9. Prediction/Forecasting
10. Prediction Visualization
11. Prediction Analysis

### 1. Define Application wide Items

#### Nuget Packages


In [4]:
// ML.NET Nuget packages installation
#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.ML.TimeSeries"
#r "nuget:Microsoft.Data.Analysis"

// Install XPlot package
#r "nuget:XPlot.Plotly"
    
// CSV Helper Package for reading CSV
#r "nuget:CsvHelper"

#### Namespaces

In [5]:
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Globalization;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.Data.Analysis;
using Microsoft.ML.Transforms.TimeSeries;
using Microsoft.AspNetCore.Html;
using XPlot.Plotly;
using CsvHelper;
using CsvHelper.Configuration;

#### Constants

In [22]:
const string CONFIRMED_DATASET_FILE = "time_series_covid19_confirmed_global_transposed.csv";

// Forecast API
const int WINDOW_SIZE = 5;
const int SERIES_LENGTH = 10;
const int TRAIN_SIZE = 100;
const int HORIZON = 14;

// Dataset
const int DEFAULT_ROW_COUNT = 10;
const string TOTAL_CONFIRMED_COLUMN = "TotalConfirmed";
const string DATE_COLUMN = "Date";

### 2. Utility Functions - TBR

#### Formatters

By default the output of DataFrame is not proper and in order to display it as a table, we need to have a custom formatter implemented as shown in next cell. 

In [7]:
Formatter<DataFrame>.Register((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = DEFAULT_ROW_COUNT;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }

    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));

    writer.Write(t);
}, "text/html");

### 3. Dataset and Transformations

#### Download Dataset
- Actual Dataset: [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series)
- Transformed Dataset: [time_series_covid19_confirmed_global_transposed.csv](time_series_covid19_confirmed_global_transposed.csv)


I'll be using COVID-19 time series dataset from [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series) and will be performing predictions using **time_series_covid19_confirmed_global.csv** file.

The data present in these files have name of the countries as Rows and dates as columns which makes it difficult to map to our classes while loading data from csv. Also, it contains data per country wise. In order to keep things simple I'll work with global count of COVID-19 cases and not specific country.

I have done few transformations to the dataset as below and created transformed csv's
- Sum cases from all the countries for a specific date
- Just have two rows with Date and Total 
- Applied transformation to the csv for converting Rows into Columns and vice-versa. [Refer](https://support.office.com/en-us/article/transpose-rotate-data-from-rows-to-columns-or-vice-versa-3419f2e3-beab-4318-aae5-d0f862209744) for transformation.
- Below transposed files have been saved in the current github directory. There is no change in dataset. The files have data till 05-27-2020
    - [time_series_covid19_confirmed_global_transposed.csv](time_series_covid19_confirmed_global_transposed.csv) : Columns - **Date, TotalConfirmed**

##### Before transformation

<img src="time-series-before-transformation.png" alt="Time Series data before transofmation" style="zoom: 80%;" />

#### After transformation

<img src="time-series-after-transformation.png" alt="Time Series data after transofmation" style="zoom: 80%;" />

### 4. Data Classes  INPUT - OUTPUT FOR ML

Now, we need to create few data structures to map to columns within our dataset.

#### Confirmed cases

In [8]:
/// <summary>
/// Represent data for confirmed cases with a mapping to columns in a dataset
/// </summary>
public class ConfirmedData
{
    /// <summary>
    /// Date of confirmed case
    /// </summary>
    [LoadColumn(0)]
    public DateTime Date;

    /// <summary>
    /// Total no of confirmed cases on a particular date
    /// </summary>
    [LoadColumn(1)]
    public float TotalConfirmed;
}

In [9]:
/// <summary>
/// Prediction/Forecast for Confirmed cases
/// </summary>
internal class ConfirmedForecast
{
    /// <summary>
    /// No of predicted confirmed cases for multiple days
    /// </summary>
    public float[] Forecast { get; set; }
}

### 5. Data Analysis

For loading data from csv, first we need to create MLContext that acts as a starting point for creating a machine learning model in ML.Net. Few things to note
- Set hasHeader as true as our dataset has header
- Add separatorChar to ',' as its a csv

#### Visualize Data - DataFrame

In [10]:
const char SEPARATOR = ',';
const char SEPARATOR_REPLACEMENT = '_';
const int DEFAULT_ROW_COUNT = 10;

var remoteFilePath ="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/";

var originalFileName= "time_series_covid19_confirmed_global.csv";

var originalFileName_2= "time_series_covid19_confirmed_global-old.csv";

var newFileName= "time_series_covid19_confirmed_global-new.csv";

var contents = new HttpClient().GetStringAsync(remoteFilePath+originalFileName).Result;
        
File.WriteAllText(originalFileName_2, contents);


// Replace a characeter in a cell of csv with a defined separator
private void CreateCsvAndReplaceSeparatorInCells(string inputFile, string outputFile, char separator, char separatorReplacement)
{
    var culture = CultureInfo.InvariantCulture;
    using var reader = new StreamReader(inputFile);
    using var csvIn = new CsvReader(reader, new CsvConfiguration(culture));
    using var recordsIn = new CsvDataReader(csvIn);
    using var writer = new StreamWriter(outputFile);
    using var outCsv = new CsvWriter(writer, culture);

    // Write Header
    csvIn.ReadHeader();
    var headers = csvIn.Context.HeaderRecord;
    foreach (var header in headers)
    {
        outCsv.WriteField(header.Replace(separator, separatorReplacement));
    }
    outCsv.NextRecord();

    // Write rows
    while (recordsIn.Read())
    {
        var columns = recordsIn.FieldCount;
        for (var index = 0; index < columns; index++)
        {
            var cellValue = recordsIn.GetString(index);
            outCsv.WriteField(cellValue.Replace(separator, separatorReplacement));
        }
        outCsv.NextRecord();
    }
}

CreateCsvAndReplaceSeparatorInCells(originalFileName_2, newFileName, SEPARATOR, SEPARATOR_REPLACEMENT);

var covid19Dataframe = DataFrame.LoadCsv(newFileName);

List<ConfirmedData> inMemoryCollection = new List<ConfirmedData>();


PrimitiveDataFrameColumn<DateTime> dateTimes = new PrimitiveDataFrameColumn<DateTime>("Date"); 
PrimitiveDataFrameColumn<double> totals = new PrimitiveDataFrameColumn<double>("TotalConfirmed");

var cols = covid19Dataframe.Columns;
foreach(var col in cols)
{
        if(col.Name !="Province/State" && col.Name !="Country/Region" && col.Name !="Lat" && col.Name !="Long" )
        {
            dateTimes.Append(DateTime.Parse(col.Name));
            var confirmed = covid19Dataframe.Columns[col.Name];
            totals.Append(Convert.ToDouble(confirmed.Sum()));
            inMemoryCollection.Add(new ConfirmedData(){ Date=DateTime.Parse(col.Name), TotalConfirmed=(float)Convert.ToDouble(confirmed.Sum()) });
         }           
}


  

In [11]:
var predictedDf = DataFrame.LoadCsv(CONFIRMED_DATASET_FILE);

//var predictedDf = new DataFrame(dateTimes, totals);
//display(predictedDf)

In [12]:
predictedDf.Head(DEFAULT_ROW_COUNT) // top

index,Date,TotalConfirmed
0,1/22/2020,555
1,1/23/2020,654
2,1/24/2020,941
3,1/25/2020,1434
4,1/26/2020,2118
5,1/27/2020,2927
6,1/28/2020,5578
7,1/29/2020,6166
8,1/30/2020,8234
9,1/31/2020,9927


In [13]:
predictedDf.Tail(DEFAULT_ROW_COUNT) //last

index,Date,TotalConfirmed
0,4/28/2020,3097229
1,4/29/2020,3172287
2,4/30/2020,3256910
3,5/1/2020,3343777
4,5/2/2020,3427584
5,5/3/2020,3506729
6,5/4/2020,3583055
7,5/5/2020,3662691
8,5/6/2020,3755341
9,5/7/2020,3845718


In [14]:
predictedDf.Sample(DEFAULT_ROW_COUNT) //random

index,Date,TotalConfirmed
0,4/8/2020,1480200
1,4/3/2020,1095876
2,4/29/2020,3172287
3,3/23/2020,378282
4,2/23/2020,78958
5,4/10/2020,1657929
6,2/14/2020,66885
7,3/29/2020,720285
8,2/27/2020,82746
9,3/10/2020,118620


In [15]:
predictedDf.Description()

index,Description,TotalConfirmed
0,Length (excluding null values),107.0
1,Max,3845718.0
2,Min,555.0
3,Mean,923109.56


##### Number of Confirmed cases over Time

In [16]:
// Number of confirmed cases over time
var totalConfirmedDateColumn = predictedDf.Columns[DATE_COLUMN];
var totalConfirmedColumn = predictedDf.Columns[TOTAL_CONFIRMED_COLUMN];

var dates = new List<string>();
var totalConfirmedCases = new List<string>();
for (int index = 0; index < totalConfirmedDateColumn.Length; index++)
{
    dates.Add(totalConfirmedDateColumn[index].ToString());
    totalConfirmedCases.Add(totalConfirmedColumn[index].ToString());
}

In [17]:
var title = "Number of Confirmed Cases over Time";
var confirmedTimeGraph = new Graph.Scattergl()
    {
        x = dates.ToArray(),
        y = totalConfirmedCases.ToArray(),
        mode = "lines+markers"
    };
    


var chart = Chart.Plot(confirmedTimeGraph);
chart.WithTitle(title);
display(chart);

### 6. Load Data - MLContext

In [18]:
var context = new MLContext();

In [19]:
var data = context.Data.LoadFromTextFile<ConfirmedData>(CONFIRMED_DATASET_FILE, hasHeader: true, separatorChar: ',');
//IDataView data = context.Data.LoadFromEnumerable<ConfirmedData>(inMemoryCollection );


### 7. ML Pipeline

For creating ML Pipeline for a time-series analysis, we'll use [Single Spectrum Analysis](https://en.wikipedia.org/wiki/Singular_spectrum_analysis). ML.Net provides built in API for same, more details could be found at [TimeSeriesCatalog.ForecastBySsa](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.timeseriescatalog.forecastbyssa?view=ml-dotnet) 

In [23]:
var pipeline = context.Forecasting.ForecastBySsa(
                nameof(ConfirmedForecast.Forecast),
                nameof(ConfirmedData.TotalConfirmed),
                WINDOW_SIZE, 
                SERIES_LENGTH,
                TRAIN_SIZE,
                HORIZON);

### 8. Train Model

We are ready with our pipeline and ready to train the model

In [24]:
var model = pipeline.Fit(data);

# 9. Prediction/Forecasting - 14 days

Our model is trained and we need to do prediction for next 7(Horizon) days.
Time-series provides its own engine for making prediction which is similar to PredictionEngine present in ML.Net. Predicted values show an increasing trend which is in alignment with recent past values.

In [25]:
var forecastingEngine = model.CreateTimeSeriesEngine<ConfirmedData, ConfirmedForecast>(context);
var forecasts = forecastingEngine.Predict();
display(forecasts.Forecast.Select(x => (int) x))

index,value
0,3348757
1,3450497
2,3563966
3,3690067
4,3830293
5,3985411
6,4156335
7,4343499
8,4547278
9,4767698


### 10. Prediction Visualization

In [26]:
var lastDate = DateTime.Parse(dates.LastOrDefault());
var predictionStartDate = lastDate.AddDays(1);

for (int index = 0; index < HORIZON; index++)
{
    dates.Add(lastDate.AddDays(index + 1).ToShortDateString());
    totalConfirmedCases.Add(forecasts.Forecast[index].ToString());
}

In [27]:
var title = "Number of Confirmed Cases over Time";
var layout = new Layout.Layout();
layout.shapes = new List<Graph.Shape>
{
    new Graph.Shape
    {
        x0 = predictionStartDate.ToShortDateString(),
        x1 = predictionStartDate.ToShortDateString(),
        y0 = "0",
        y1 = "1",
        xref = 'x',
        yref = "paper",
        line = new Graph.Line() {color = "red", width = 2}
    }
};

var chart1 = Chart.Plot(
new [] 
    {
        new Graph.Scattergl()
        {
            x = dates.ToArray(),
            y = totalConfirmedCases.ToArray(),
            mode = "lines+markers"
        }
    },
    layout
);

chart1.WithTitle(title);
display(chart1);

### 11. Analysis

Comparing the plots before and after prediction, it seems our ML model has performed reasonably well. The red line represents the data on future date(5/8/2020). Beyond this, we predicted for 7 days. Looking at the plot, there is a sudden drop on 5/8/2020 which could be accounted due to insufficient data as we have only 127 records. However we see an increasing trend for next 7 days in alignment with previous confirmed cases. We can extend this model for predicting confirmed cases for any number of days by changing HORIZON constant value. 

## References
- [Tutorial: Forecast bike rental service demand with time series analysis and ML.NET](https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/time-series-demand-forecasting#evaluate-the-model)
- [Time Series Forecasting in ML.NET and Azure ML notebooks](https://github.com/gvashishtha/time-series-mlnet/blob/master/time-series-forecast.ipynb) by Gopal Vashishtha