# COVID-19 Exploratory Data Analysis using ML.Net

## COVID-19
- As per [Wiki](https://en.wikipedia.org/wiki/Coronavirus_disease_2019) **Coronavirus disease 2019** (**COVID-19**) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in 2019 in Wuhan, the capital of China's Hubei province, and has since spread globally, resulting in the ongoing 2019–20 coronavirus pandemic.
- The virus had caused a pandemic across the globe and spreading/affecting most of the nations. 
- The purpose of notebook is to visualize the trends of virus spread in various countries and explore features present in ML.Net such as DataFrame.

### Acknowledgement
- [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data) for dataset
- [COVID-19 data visualization](https://www.kaggle.com/akshaysb/covid-19-data-visualization) by Akshay Sb

### Links

- **Dataset :** [2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE - Daily reports](https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports).

### Summary

Below is the summary of steps we'll be performing

1. Import necessary libraries and modules
    - Nuget packages
    - Namespaces      
     
2. Utility Functions
    - Formatters    

3. Load Dataset
    - Download Dataset from [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data)
    - Load dataset in DataFrame
    
4. Analyse Data
    - Date Range
    - display(dataframe)
    - dataframe.Head(5)
    - dataframe.Sample(6)    
    - dataframe.Description()
    - dataframe.Info()

5. Data Cleaning
    - Remove Invalid Active cases

6. Data Visualization
    - Global
        - Confirmed Vs Deaths Vs Recovered
        - Top 5 Countries with Confirmed cases
        - Top 5 Countries with Death cases
        - Top 5 Countries with Recovered cases
    - India
        - Confirmed Vs Deaths Vs Recovered
        
**Note** : Graphs/Plots will not be rendered in GitHub due to secutiry reasons, however if you run this notebook locally they will render.

### 1. Import necessary libraries and modules

In [1]:
// ML.NET Nuget packages installation
#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.Data.Analysis"

//Install XPlot package
#r "nuget:XPlot.Plotly"

#### Namespaces

In [2]:
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.Data.Analysis;
using XPlot.Plotly;
using Microsoft.AspNetCore.Html;
using System.IO;
using System.Net.Http;

### 2. Utility Functions

#### Formatters

In [3]:
Formatter<DataFrame>.Register((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }

    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));

    writer.Write(t);
}, "text/html");

### 3. Load Dataset

#### Download Dataset from [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data)

**NOTE**: I have used dataset of 04-01-2020, in case different dataset needs to be used please visit [Johns Hopkins CSSE dataset - daily](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) and mention 'lastUpdatedFile' as the file to be analyzed.

In [4]:
string lastUpdatedFile = "04-01-2020.csv";
if (!File.Exists(lastUpdatedFile))
{
    var contents = new HttpClient()
        .GetStringAsync($"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{lastUpdatedFile}").Result;
        
    File.WriteAllText(lastUpdatedFile, contents);
}

#### Load dataset in DataFrame

In [5]:
var covid19Dataframe = DataFrame.LoadCsv(lastUpdatedFile);

### 4. Analyse Data

#### Constants

In [6]:
// Column Names
const string FIPS = "FIPS";
const string ADMIN = "Admin2";
const string STATE = "Province_State";
const string COUNTRY = "Country_Region";
const string LAST_UPDATE = "Last_Update";
const string LATITUDE = "Lat";
const string LONGITUDE = "Long_";
const string CONFIRMED = "Confirmed";
const string DEATHS = "Deaths";
const string RECOVERED = "Recovered";
const string ACTIVE = "Active";
const string COMBINED_KEY = "Combined_Key";

const int TOP_COUNT = 5;
const string VALUES = "Values";
const string INDIA = "India";


In [7]:
var dateRangeDataFrame = covid19Dataframe.Columns[LAST_UPDATE].ValueCounts();
var dataRange = dateRangeDataFrame.Columns[VALUES].Sort();
var lastElementIndex = dataRange.Length - 1;

var startDate = DateTime.Parse(dataRange[0].ToString()).ToShortDateString();
var lastDate  = DateTime.Parse(dataRange[lastElementIndex].ToString()).ToShortDateString(); // Last Element

display(h4($"The data is between {startDate} and {lastDate}"));

In [8]:
display(covid19Dataframe)

index,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223335,-82.46171,4,0,0,0,"""Abbeville, South Carolina, US"""
1,22001,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.4142,47,1,0,0,"""Acadia, Louisiana, US"""
2,51001,Accomack,Virginia,US,2020-04-01 21:58:49,37.76707,-75.63235,7,0,0,0,"""Accomack, Virginia, US"""
3,16001,Ada,Idaho,US,2020-04-01 21:58:49,43.452656,-116.241554,195,3,0,0,"""Ada, Idaho, US"""
4,19001,Adair,Iowa,US,2020-04-01 21:58:49,41.330757,-94.47106,1,0,0,0,"""Adair, Iowa, US"""
5,29001,Adair,Missouri,US,2020-04-01 21:58:49,40.190586,-92.600784,3,0,0,0,"""Adair, Missouri, US"""
6,40001,Adair,Oklahoma,US,2020-04-01 21:58:49,35.88494,-94.65859,8,0,0,0,"""Adair, Oklahoma, US"""
7,8001,Adams,Colorado,US,2020-04-01 21:58:49,39.87432,-104.33626,181,2,0,0,"""Adams, Colorado, US"""
8,17001,Adams,Illinois,US,2020-04-01 21:58:49,39.988155,-91.18787,2,0,0,0,"""Adams, Illinois, US"""
9,18001,Adams,Indiana,US,2020-04-01 21:58:49,40.745766,-84.936714,1,0,0,0,"""Adams, Indiana, US"""


In [9]:
covid19Dataframe.Head(5)

index,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223335,-82.46171,4,0,0,0,"""Abbeville, South Carolina, US"""
1,22001,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.4142,47,1,0,0,"""Acadia, Louisiana, US"""
2,51001,Accomack,Virginia,US,2020-04-01 21:58:49,37.76707,-75.63235,7,0,0,0,"""Accomack, Virginia, US"""
3,16001,Ada,Idaho,US,2020-04-01 21:58:49,43.452656,-116.241554,195,3,0,0,"""Ada, Idaho, US"""
4,19001,Adair,Iowa,US,2020-04-01 21:58:49,41.330757,-94.47106,1,0,0,0,"""Adair, Iowa, US"""


In [10]:
covid19Dataframe.Sample(6)

index,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,18045,Fountain,Indiana,US,2020-04-01 21:58:49,40.12362,-87.24218,1,0,0,0,"""Fountain, Indiana, US"""
1,6083,Santa Barbara,California,US,2020-04-01 21:58:49,34.653294,-120.01885,99,0,0,0,"""Santa Barbara, California, US"""
2,51095,James City,Virginia,US,2020-04-01 21:58:49,37.31157,-76.76951,95,2,0,0,"""James City, Virginia, US"""
3,5111,Poinsett,Arkansas,US,2020-04-01 21:58:49,35.574337,-90.66269,5,0,0,0,"""Poinsett, Arkansas, US"""
4,<null>,,Isle of Man,United Kingdom,2020-04-01 21:58:34,54.2361,-4.5481,68,1,0,67,"""Isle of Man, United Kingdom"""
5,1115,St. Clair,Alabama,US,2020-04-01 21:58:49,33.71902,-86.310295,18,0,0,0,"""St. Clair, Alabama, US"""


In [11]:
covid19Dataframe.Description()

index,Description,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
0,Length (excluding null values),2171.0,2484.0,2484.0,2485.0,2485.0,2485.0,2485.0
1,Max,99999.0,71.7069,178.065,110574.0,13155.0,63326.0,80572.0
2,Min,0.0,-42.8821,-159.59668,0.0,0.0,0.0,-6.0
3,Mean,26224.918,35.639614,-77.22377,375.29376,18.83662,77.73722,194.90987


In [12]:
covid19Dataframe.Info()

index,Info,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,DataType,System.Single,System.String,System.String,System.String,System.String,System.Single,System.Single,System.Single,System.Single,System.Single,System.Single,System.String
1,Length (excluding null values),2171,2485,2485,2485,2485,2484,2484,2485,2485,2485,2485,2485


### 5. Data Cleaning

#### Remove invalid Active cases

In [13]:
PrimitiveDataFrameColumn<bool> invalidActiveFilter = covid19Dataframe.Columns[ACTIVE].ElementwiseLessThan(0.0);
var invalidActiveDataFrame = covid19Dataframe.Filter(invalidActiveFilter);
display(invalidActiveDataFrame)

index,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,<null>,,Hainan,China,2020-03-24 04:29:15,19.1959,109.7453,168,6,168,-6,"""Hainan, China"""


**From above table, we could see some there were 168 confirmed and recovered cases with 6 deaths which seems invalid.
Let's remove it in next step**

In [14]:
PrimitiveDataFrameColumn<bool> activeFilter = covid19Dataframe.Columns[ACTIVE].ElementwiseGreaterThanOrEqual(0.0);
covid19Dataframe = covid19Dataframe.Filter(activeFilter);
display(covid19Dataframe.Description());

index,Description,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
0,Length (excluding null values),2171.0,2483.0,2483.0,2484.0,2484.0,2484.0,2484.0
1,Max,99999.0,71.7069,178.065,110574.0,13155.0,63326.0,80572.0
2,Min,0.0,-42.8821,-159.59668,0.0,0.0,0.0,0.0
3,Mean,26235.475,35.646233,-77.29904,375.37723,18.841787,77.70088,194.99074


**We have removed cases with negative active value. As per above table minimum value for Active cases is zero**

### 6. Visualization

#### Global

##### Collect Data

In [15]:
var confirmed = covid19Dataframe.Columns[CONFIRMED];
var deaths = covid19Dataframe.Columns[DEATHS];
var recovered = covid19Dataframe.Columns[RECOVERED];

var totalConfirmed = Convert.ToDouble(confirmed.Sum());
var totalDeaths = Convert.ToDouble(deaths.Sum());
var totaRecovered = Convert.ToDouble(recovered.Sum());

##### Confirmed Vs Deaths Vs Receovered cases

In [16]:
display(Chart.Plot(
    new Graph.Pie()
    {
        values = new double[]{totalConfirmed, totalDeaths, totaRecovered},
        labels = new string[] {CONFIRMED, DEATHS, RECOVERED}
    }
));

##### Top 5 Countries with Confirmed cases

In [17]:
// Get the data
var countryConfirmedGroup = covid19Dataframe.GroupBy(COUNTRY).Sum(CONFIRMED).OrderByDescending(CONFIRMED);
var topCountriesColumn = countryConfirmedGroup.Columns[COUNTRY];
var topConfirmedCasesByCountry = countryConfirmedGroup.Columns[CONFIRMED];

HashSet<string> countries = new HashSet<string>(TOP_COUNT);
HashSet<long> confirmedCases = new HashSet<long>(TOP_COUNT);
for(int index = 0; index < TOP_COUNT; index++)
{
    countries.Add(topCountriesColumn[index].ToString());
    confirmedCases.Add(Convert.ToInt64(topConfirmedCasesByCountry[index]));
}

In [18]:
var title = "Top 5 Countries : Confirmed";
var series1 = new Graph.Bar{
        x = countries.ToArray(),
        y = confirmedCases.ToArray()
    };

var chart = Chart.Plot(new []{series1});
chart.WithTitle(title);
display(chart);

##### Top 5 Countries with Deaths

In [19]:
// Get the data
var countryDeathsGroup = covid19Dataframe.GroupBy(COUNTRY).Sum(DEATHS).OrderByDescending(DEATHS);
var topCountriesColumn = countryDeathsGroup.Columns[COUNTRY];
var topDeathCasesByCountry = countryDeathsGroup.Columns[DEATHS];

HashSet<string> countries = new HashSet<string>(TOP_COUNT);
HashSet<long> deathCases = new HashSet<long>(TOP_COUNT);
for(int index = 0; index < TOP_COUNT; index++)
{
    countries.Add(topCountriesColumn[index].ToString());
    deathCases.Add(Convert.ToInt64(topDeathCasesByCountry[index]));
}

In [20]:
var title = "Top 5 Countries : Deaths";
var series1 = new Graph.Bar{
        x = countries.ToArray(),
        y = deathCases.ToArray()
    };

var chart = Chart.Plot(new []{series1});
chart.WithTitle(title);
display(chart);

##### Top 5 Countries with Recovered cases

In [21]:
// Get the data
var countryRecoveredGroup = covid19Dataframe.GroupBy(COUNTRY).Sum(RECOVERED).OrderByDescending(RECOVERED);
var topCountriesColumn = countryRecoveredGroup.Columns[COUNTRY];
var topRecoveredCasesByCountry = countryRecoveredGroup.Columns[RECOVERED];

HashSet<string> countries = new HashSet<string>(TOP_COUNT);
HashSet<long> recoveredCases = new HashSet<long>(TOP_COUNT);
for(int index = 0; index < TOP_COUNT; index++)
{
    countries.Add(topCountriesColumn[index].ToString());
    recoveredCases.Add(Convert.ToInt64(topRecoveredCasesByCountry[index]));
}

In [22]:
var title = "Top 5 Countries : Recovered";
var series1 = new Graph.Bar{
        x = countries.ToArray(),
        y = recoveredCases.ToArray()
    };

var chart = Chart.Plot(new []{series1});
chart.WithTitle(title);
display(chart);

##### Number of Confirmed cases over Time

In [31]:
var confirmedOverTimeGroup = covid19Dataframe.GroupBy(LAST_UPDATE).Sum(CONFIRMED).OrderBy(LAST_UPDATE);
var confirmedColumn = confirmedOverTimeGroup.Columns[CONFIRMED];
var timeSeriesColumn = confirmedOverTimeGroup.Columns[LAST_UPDATE];

var count = confirmedOverTimeGroup.Rows.Count;

List<string> timeSeriesConfirmed = new List<string>();
List<long> confirmedSeries = new List<long>();
for(int index = 0; index < count; index++)
{
    var time = timeSeriesColumn[index].ToString();
    var confirmedCount = Convert.ToInt64(confirmedColumn[index]);

    // display($"Index: {index}, Time: {time}, Confirmed: {confirmedCount}");

    timeSeriesConfirmed.Add(time);
    confirmedSeries.Add(confirmedCount);
}

In [32]:
var title = "Number of Confirmed Cases over Time";
var confirmedTimeGraph = new Graph.Scattergl()
    {
        x = timeSeriesConfirmed.ToArray(),
        y = confirmedSeries.ToArray(),
        mode = "lines+markers"
    };

var chart = Chart.Plot(confirmedTimeGraph);
chart.WithTitle(title);
display(chart);

##### Number of Deaths over Time

In [33]:
var deathsOverTimeGroup = covid19Dataframe.GroupBy(LAST_UPDATE).Sum(DEATHS).OrderBy(LAST_UPDATE);
var deathsColumn = deathsOverTimeGroup.Columns[DEATHS];
var timeSeriesColumn = deathsOverTimeGroup.Columns[LAST_UPDATE];

var count = deathsOverTimeGroup.Rows.Count;

List<string> timeSeries = new List<string>();
List<long> deathSeries = new List<long>();
for(int index = 0; index < count; index++)
{
    var time = timeSeriesColumn[index].ToString();
    var death = Convert.ToInt64(deathsColumn[index]);

    // display($"Index: {index}, Time: {time}, Deaths: {death}");

    timeSeries.Add(timeSeriesColumn[index].ToString());
    deathSeries.Add(Convert.ToInt64(deathsColumn[index]));
}

In [34]:

var title = "Number of Deaths over Time";
var deathTimeGraph = new Graph.Scattergl()
    {
        x = timeSeries.ToArray(),
        y = deathSeries.ToArray(),
        mode = "lines+markers"
    };

var chart = Chart.Plot(deathTimeGraph);
chart.WithTitle(title);
display(chart);

##### Number of Recovered cases over Time

In [35]:
var recoveredOverTimeGroup = covid19Dataframe.GroupBy(LAST_UPDATE).Sum(RECOVERED).OrderBy(LAST_UPDATE);
var recoveredColumn = recoveredOverTimeGroup.Columns[RECOVERED];
var timeSeriesColumn = recoveredOverTimeGroup.Columns[LAST_UPDATE];

var count = recoveredOverTimeGroup.Rows.Count;

List<string> timeSeries = new List<string>();
List<long> recoveredSeries = new List<long>();
for(int index = 0; index < count; index++)
{
    var time = timeSeriesColumn[index].ToString();
    var recoveredCount = Convert.ToInt64(recoveredColumn[index]);

    // display($"Index: {index}, Time: {time}, Recovered: {recoveredCount}");

    timeSeries.Add(time);
    recoveredSeries.Add(recoveredCount);
}

In [36]:
var title = "Number of Recovered cases over Time";
var recoveredTimegraph = new Graph.Scattergl()
    {
        x = timeSeries.ToArray(),
        y = recoveredSeries.ToArray(),
        mode = "lines+markers"
    };

var chart = Chart.Plot(recoveredTimegraph);
chart.WithTitle(title);
display(chart);

#### India

##### Collect Data

##### Confirmed Vs Deaths Vs Receovered cases

In [29]:
PrimitiveDataFrameColumn<bool> indiaFilter = covid19Dataframe.Columns[COUNTRY].ElementwiseEquals(INDIA);
var indiaDataFrame = covid19Dataframe.Filter(indiaFilter);
display(indiaDataFrame.Head((int)indiaDataFrame.Rows.Count));
            
var indiaConfirmed = indiaDataFrame.Columns[CONFIRMED];
var indiaDeaths = indiaDataFrame.Columns[DEATHS];
var indiaRecovered = indiaDataFrame.Columns[RECOVERED];

var indiaTotalConfirmed = Convert.ToDouble(indiaConfirmed.Sum());
var indiaTotalDeaths = Convert.ToDouble(indiaDeaths.Sum());
var indiaTotaRecovered = Convert.ToDouble(indiaRecovered.Sum());

index,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,<null>,,,India,2020-04-01 21:58:34,20.593683,78.96288,1998,58,148,1792,India


In [30]:
display(Chart.Plot(
    new Graph.Pie()
    {
        values = new double[]{indiaTotalConfirmed, indiaTotalDeaths, indiaTotaRecovered},
        labels = new string[] {CONFIRMED, DEATHS, RECOVERED}
    }
));

## References
- [Using ML.NET in Jupyter notebooks](https://devblogs.microsoft.com/cesardelatorre/using-ml-net-in-jupyter-notebooks/)
- [An Introduction to DataFrame](https://devblogs.microsoft.com/dotnet/an-introduction-to-dataframe/)
- [DataFrame - Sample](https://github.com/dotnet/interactive/blob/master/NotebookExamples/csharp/Samples/HousingML.ipynb)
- [Getting started with ML.NET in Jupyter Notebooks](https://xamlbrewer.wordpress.com/2020/02/20/getting-started-with-ml-net-in-jupyter-notebooks/)
- [Tips and tricks for C# Jupyter notebook](https://ewinnington.github.io/posts/jupyter-tips-csharp)
- [Jupyter notebooks with C# and R running](https://github.com/ewinnington/noteb)
- [Data analysis using F# and Jupyter notebook — Samuele Resca](https://medium.com/@samueleresca/data-analysis-using-f-and-jupyter-notebook-samuele-resca-66a229e25306)

#  ******************** Be Safe **********************