<a href="https://colab.research.google.com/github/plaban1981/MachieHack/blob/master/COV19_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The Objective Of The Hackathon
In the coming weeks and months, we at MachineHack (an Analytics India Magazine initiative) along with our community members will ominously examine how the coronavirus could affect different nations.

Thereby, we invite MachineHackers to predict potential COVID-19 cases across all the globe on an everyday basis. The objective of the hackathon is to gauge COVID-19 on three metrics- confirmed cases, recovered cases and death events for the next day using historical data as on a given date.

As sad as it is to analyse the data around COVID-19 events, it is critical to keep a tab on the disease metrics to track the outbreak. The hackathon will be based on the data published by various agencies and the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), which can be found below-

##A Note For Hackathon Participants
The univariate time series knowledge is rendered based on all individual countries affected by COVID 19 from 22nd January 2020 onwards to the present date.

Here is an example for your reference. The provided .csv file comprises the count of confirmed COVID cases across countries till 10th March 2020 for the three target variables (confirmed cases, recovered cases and death cases).

The dataset would be updated daily at 00:00 UTC standard time with the prevailing forecast of the distinct target variables. It is to be noted that the published data is dynamic, and hence it will be renewed each day in a new column every day. The data in the rows will also fluctuate based on the reported changes for COVID-19 outbreak in various world geographies.

The submission file from participants must contain the projected count of incidents for the next day, i.e. 11th March 2020. as per the sample_submission.xlsx format.

## Data Description:
Features :

**Country/Region**: Name of Country/Region.

**Date Stamp:** The sequence of historical counts with since 22nd Jan 2020.

The 3 .csv files contain historic counts per country for (Confirmed, Recovered and Death).

* 1. **covid_confirmed_daily_updates.csv** – Contains the count of confirmed COVID cases.

* 2. **covid_deaths_daily_updates.csv** – Contains the count of COVID patient deaths.

* 3. **covid_recovered_daily_updates.csv** – Contains the count of recovered COVID patients.

* 4. **sample_submission.xls**x – submission format for the model evaluation.

The datasets are dynamic and will be automatically updated at 00:00 UTC standard time every day with the latest count of the respective target variables.

In [4]:
from google.colab import files
files.upload()

Saving Sample_Submission.xlsx to Sample_Submission.xlsx


{'Sample_Submission.xlsx': b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00?\x00a]I:O\x01\x00\x00\x8f\x04\x00\x00\x13\x00\x00\x00[Content_Types].xml\xad\x94\xcbn\xc20\x10E\xf7\xfd\x8a\xc8\xdb*1tQU\x15\x81E\x1f\xcb\x16\xa9\xf4\x03\\{B,\x1c\xdb\xf2\x0c\x14\xfe\xbe\x93\xf0P[Q\xa0\x82M\xacd\xee\xdcs\xc7\x8e<\x18-\x1b\x97- \xa1\r\xbe\x14\xfd\xa2\'2\xf0:\x18\xeb\xa7\xa5x\x9f<\xe7w"CR\xde(\x17<\x94b\x05(F\xc3\xab\xc1d\x15\x013n\xf6X\x8a\x9a(\xdeK\x89\xba\x86Fa\x11"x\xaeT!5\x8a\xf85MeTz\xa6\xa6 oz\xbd[\xa9\x83\'\xf0\x94S\xeb!\x86\x83G\xa8\xd4\xdcQ\xf6\xb4\xe4\xcf\xeb \t\x1c\x8a\xeca-lY\xa5P1:\xab\x15q].\xbc\xf9E\xc97\x84\x82;;\r\xd66\xe25\x0b\x84\xdcKh+\x7f\x036}\xaf\xbc3\xc9\x1a\xc8\xc6*\xd1\x8bjX%M\xd0\xe3\x14"J\xd6\x17\x87]\xf6\xc4\x0cUe5\xb0\xc7\xbc\xe1\x96\x02\xda@\x06L\x1e\xd9\x12\x12Y\xd8e>\xc8\xd6!\xc1\xff\xe1\xdb=j\xbbO$.\x9dDZ9\xc0\xb3G\xc5\x98@\x19\xac\x01\xa8q\xc5\xda\xf4\x08\x99\xf8\x7f\x82\xf5\xb3\x7f6\xbf\xb39\x02\xfc\x0ci\xf6\x11\xc2\xec\xd2\xc3\xb6k\xd1(\xebO\xe0wb\x94\xddr\xfe\xd

In [0]:
import requests, pandas as pd, numpy as np
from io import StringIO
import time, json
from datetime import date
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_squared_error
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

In [0]:
confirmed = pd.read_csv('/content/covid_confirmed_daily_updates.csv')
death = pd.read_csv('/content/covid_deaths_daily_updates.csv')
recovered = pd.read_csv('/content/covid_recovered_daily_updates.csv')

In [74]:
confirmed.head()

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20,2/26/20,2/27/20,2/28/20,2/29/20,3/1/20,3/2/20,3/3/20,3/4/20,3/5/20,3/6/20,3/7/20,3/8/20,3/9/20,3/10/20,3/11/20,3/12/20,3/13/20,3/14/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,4,4,5,7,7,7,0.0
1,Albania,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,10,12,23,33,0.0
2,Algeria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,3,5,12,12,17,17,19,20,20,20,24,26,0.0
3,Andorra,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0.0
4,Antigua and Barbuda,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.0


In [0]:
dt = confirmed.columns[1:].tolist()
#cols
reg = confirmed['Country/Region'].unique().tolist()

In [169]:
dt[0]

'1/22/20'

In [171]:
confirmed[confirmed['Country/Region'] == 'Afghanistan'][dt[0]][0]

0

In [0]:
region = []
date = []
counts = []
for regions in reg:
  for d in dt:
    #print(d)
    region.append(regions)
    date.append(d)
    counts.append(confirmed[confirmed['Country/Region'] == regions][d].values[0])

In [176]:
len(region),len(date),len(counts)

(6625, 6625, 6625)

In [179]:
confirmed_cases = pd.DataFrame({'Date':date,'Region':region,'Confirmed':counts})
confirmed_cases.head()

Unnamed: 0,Date,Region,Confirmed
0,1/22/20,Afghanistan,0.0
1,1/23/20,Afghanistan,0.0
2,1/24/20,Afghanistan,0.0
3,1/25/20,Afghanistan,0.0
4,1/26/20,Afghanistan,0.0


In [182]:
confirmed_cases.shape

(6625, 3)

## recovered

In [0]:
dt = recovered.columns[1:].tolist()
#cols
reg = recovered['Country/Region'].unique().tolist()
region = []
date = []
counts = []
for regions in reg:
  for d in dt:
    #print(d)
    region.append(regions)
    date.append(d)
    counts.append(recovered[recovered['Country/Region'] == regions][d].values[0])

In [181]:
recovered_cases = pd.DataFrame({'Date':date,'Region':region,'Recovered':counts})
recovered_cases.head()

Unnamed: 0,Date,Region,Recovered
0,1/22/20,Afghanistan,0.0
1,1/23/20,Afghanistan,0.0
2,1/24/20,Afghanistan,0.0
3,1/25/20,Afghanistan,0.0
4,1/26/20,Afghanistan,0.0


In [183]:
recovered_cases.shape

(6625, 3)

##death

In [0]:
dt = death.columns[1:].tolist()
#cols
reg = death['Country/Region'].unique().tolist()
region = []
date = []
counts = []
for regions in reg:
  for d in dt:
    #print(d)
    region.append(regions)
    date.append(d)
    counts.append(death[death['Country/Region'] == regions][d].values[0])

In [185]:
death_cases = pd.DataFrame({'Date':date,'Region':region,'Death':counts})
death_cases.head()

Unnamed: 0,Date,Region,Death
0,1/22/20,Afghanistan,0.0
1,1/23/20,Afghanistan,0.0
2,1/24/20,Afghanistan,0.0
3,1/25/20,Afghanistan,0.0
4,1/26/20,Afghanistan,0.0


In [0]:
data = pd.merge(confirmed_cases,recovered_cases,how='inner',on=['Date','Region'])

In [191]:
data.head()

Unnamed: 0,Date,Region,Confirmed,Recovered
0,1/22/20,Afghanistan,0.0,0.0
1,1/23/20,Afghanistan,0.0,0.0
2,1/24/20,Afghanistan,0.0,0.0
3,1/25/20,Afghanistan,0.0,0.0
4,1/26/20,Afghanistan,0.0,0.0


In [192]:
data.shape

(6625, 4)

In [0]:
final = pd.merge(data,death_cases,how='inner',on=['Date','Region'])

In [197]:
final.head()

Unnamed: 0,Date,Region,Confirmed,Recovered,Death
0,1/22/20,Afghanistan,0.0,0.0,0.0
1,1/23/20,Afghanistan,0.0,0.0,0.0
2,1/24/20,Afghanistan,0.0,0.0,0.0
3,1/25/20,Afghanistan,0.0,0.0,0.0
4,1/26/20,Afghanistan,0.0,0.0,0.0


In [0]:
final['Date'] = pd.to_datetime(final['Date'])

In [0]:
cov19_table = pd.pivot_table(final, index=['Date','Region'], aggfunc='mean').reset_index()


In [204]:
cov19_table.head()

Unnamed: 0,Date,Region,Confirmed,Death,Recovered
0,2020-01-22,Afghanistan,0.0,0.0,0.0
1,2020-01-22,Albania,0.0,0.0,0.0
2,2020-01-22,Algeria,0.0,0.0,0.0
3,2020-01-22,Andorra,0.0,0.0,0.0
4,2020-01-22,Antigua and Barbuda,0.0,0.0,0.0


In [206]:
cov19_table.set_index(['Date'],inplace=True)
cov19_table.head()

Unnamed: 0_level_0,Region,Confirmed,Death,Recovered
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-22,Afghanistan,0.0,0.0,0.0
2020-01-22,Albania,0.0,0.0,0.0
2020-01-22,Algeria,0.0,0.0,0.0
2020-01-22,Andorra,0.0,0.0,0.0
2020-01-22,Antigua and Barbuda,0.0,0.0,0.0


In [232]:
cov19_table[cov19_table['Death'] > 1]

Unnamed: 0_level_0,Region,Confirmed,Death,Recovered
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-22,74,548.0,17.0,28.0
2020-01-23,74,643.0,18.0,30.0
2020-01-24,74,920.0,26.0,36.0
2020-01-25,74,1406.0,42.0,39.0
2020-01-26,74,2075.0,56.0,49.0
...,...,...,...,...
2020-03-13,102,80.0,5.0,0.0
2020-03-13,110,5232.0,133.0,193.0
2020-03-13,114,1139.0,11.0,4.0
2020-03-13,120,2179.0,47.0,12.0


In [0]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
lb.fit(cov19_table['Region'])

cov19_table['Region'] = lb.transform(cov19_table['Region'])

In [210]:
cov19_table.head()

Unnamed: 0_level_0,Region,Confirmed,Death,Recovered
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-22,0,0.0,0.0,0.0
2020-01-22,1,0.0,0.0,0.0
2020-01-22,2,0.0,0.0,0.0
2020-01-22,3,0.0,0.0,0.0
2020-01-22,4,0.0,0.0,0.0
