# Global Covid-19 Forecasting using a Random Forest

This is a very simple starter submission kernel using a random forest. Feature engineering and tuning will help performance.

### As it turns out, it is very tough to make a RF algorithm properly extrapolate. Given the decision tree structure, conditional statements which recursively split the intpu space. There are ways to get random forests to predict values that fall outside the range of values of the targets in the training set, however I haven't become privy to these techniques. Take a look at the following: https://www.statworx.com/de/blog/time-series-forecasting-with-random-forest/

Nevertheless, I will leave this notebook posted as an illustration, and we can consider week 1's submission as a bit of an experiment :-)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/countrypopulations/CountryPopulations.csv
/kaggle/input/covid19-global-forecasting-week-1/submission.csv
/kaggle/input/covid19-global-forecasting-week-1/test.csv
/kaggle/input/covid19-global-forecasting-week-1/train.csv
/kaggle/input/covid19-global-forecasting-week-2/submission.csv
/kaggle/input/covid19-global-forecasting-week-2/test.csv
/kaggle/input/covid19-global-forecasting-week-2/train.csv
/kaggle/input/covid19-local-us-ca-forecasting-week-1/ca_submission.csv
/kaggle/input/covid19-local-us-ca-forecasting-week-1/ca_train.csv
/kaggle/input/covid19-local-us-ca-forecasting-week-1/ca_test.csv


In [2]:
import numpy as np
import pandas as pd

## Import Data

In [3]:
train = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-1/train.csv")
test = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-1/test.csv")
submission = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-1/submission.csv")
train.tail()


Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
17887,26378,,Zambia,-15.4167,28.2833,2020-03-20,2.0,0.0
17888,26379,,Zambia,-15.4167,28.2833,2020-03-21,2.0,0.0
17889,26380,,Zambia,-15.4167,28.2833,2020-03-22,3.0,0.0
17890,26381,,Zambia,-15.4167,28.2833,2020-03-23,3.0,0.0
17891,26382,,Zambia,-15.4167,28.2833,2020-03-24,3.0,0.0


# Data Cleaning

In [4]:
# Format date
train["Date"] = train["Date"].apply(lambda x: x.replace("-",""))
train["Date"]  = train["Date"].astype(int)
train.head()

Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,33.0,65.0,20200122,0.0,0.0
1,2,,Afghanistan,33.0,65.0,20200123,0.0,0.0
2,3,,Afghanistan,33.0,65.0,20200124,0.0,0.0
3,4,,Afghanistan,33.0,65.0,20200125,0.0,0.0
4,5,,Afghanistan,33.0,65.0,20200126,0.0,0.0


### Drop NaNs

In [5]:
# drop nan's
train = train.drop(['Province/State'],axis=1)
train = train.dropna()
train.isnull().sum()

Id                0
Country/Region    0
Lat               0
Long              0
Date              0
ConfirmedCases    0
Fatalities        0
dtype: int64

In [6]:
# Do same to Test data
test["Date"] = test["Date"].apply(lambda x: x.replace("-",""))
test["Date"]  = test["Date"].astype(int)
# deal with nan's for lat and lon
#test = test.dropna()
test.isnull().sum()



ForecastId           0
Province/State    6622
Country/Region       0
Lat                  0
Long                 0
Date                 0
dtype: int64

### Prepare Training

In [7]:
x = train[['Lat', 'Long', 'Date']]
y1 = train[['ConfirmedCases']]
y2 = train[['Fatalities']]
x_test = test[['Lat', 'Long', 'Date']]

In [8]:
from sklearn.ensemble import RandomForestClassifier
Tree_model = RandomForestClassifier(max_depth=200, random_state=0)

### Train Confirmed Cases Tree

In [9]:
##
Tree_model.fit(x,y1)
pred1 = Tree_model.predict(x_test)
pred1 = pd.DataFrame(pred1)
pred1.columns = ["ConfirmedCases_prediction"]

  


In [10]:
pred1.head()

Unnamed: 0,ConfirmedCases_prediction
0,7.0
1,7.0
2,11.0
3,21.0
4,21.0


### Train Deaths Tree

In [11]:


##
Tree_model.fit(x,y2)
pred2 = Tree_model.predict(x_test)
pred2 = pd.DataFrame(pred2)
pred2.columns = ["Death_prediction"]



  


### Prepare for Submission

In [12]:

Sub = pd.read_csv("../input/covid19-global-forecasting-week-1/submission.csv")
sub_new = Sub[["ForecastId"]]
sub_new

Unnamed: 0,ForecastId
0,1
1,2
2,3
3,4
4,5
...,...
12207,12208
12208,12209
12209,12210
12210,12211


In [13]:
# submit

submit = pd.concat([pred1,pred2,sub_new],axis=1)
submit.head()


Unnamed: 0,ConfirmedCases_prediction,Death_prediction,ForecastId
0,7.0,0.0,1
1,7.0,0.0,2
2,11.0,0.0,3
3,21.0,0.0,4
4,21.0,0.0,5


In [14]:
# Clean
submit.columns = ['ConfirmedCases', 'Fatalities', 'ForecastId']
submit = submit[['ForecastId','ConfirmedCases', 'Fatalities']]

submit["ConfirmedCases"] = submit["ConfirmedCases"].astype(int)
submit["Fatalities"] = submit["Fatalities"].astype(int)

In [15]:

submit.describe()


Unnamed: 0,ForecastId,ConfirmedCases,Fatalities
count,12212.0,12212.0,12212.0
mean,6106.5,1208.889125,53.222486
std,3525.445078,6234.287452,417.608734
min,1.0,0.0,0.0
25%,3053.75,6.0,0.0
50%,6106.5,81.0,0.0
75%,9159.25,367.0,3.0
max,12212.0,67800.0,6077.0


In [16]:
Sub = submit
Sub.to_csv('submission.csv', index=False)