# COVID19: EDA and forecasting
## Data
All the data in this notebook are extracted in from Kaggle with origin the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). 
Link: https://www.kaggle.com/c/covid19-global-forecasting-week-5/data?select=train.csv

## Aims
- **EAD**: I show several visualizations of the spreading of the COVID19 around the world. These visualizations include the evolution of deaths and cases for each country;
- **forecasting**: several world countries have already passed the peak of cases and deaths.Other still not. Here, my aim to find the peak of deaths in Brazil. 

## Documentation and links
The project is published at:
- **Medium link:** https://medium.com/@operti.felipe/covid19-brazilians-pandemic-peak-forecasting-885373cf22d6
- **Github:** https://github.com/felipeoperti/covid19

In [1]:
import pandas as pd

In [2]:
# Import data
data = pd.read_csv("../data/train.csv")

In [3]:
data.head()

Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
0,1,,,Afghanistan,27657145,0.058359,2020-01-23,ConfirmedCases,0.0
1,2,,,Afghanistan,27657145,0.583587,2020-01-23,Fatalities,0.0
2,3,,,Afghanistan,27657145,0.058359,2020-01-24,ConfirmedCases,0.0
3,4,,,Afghanistan,27657145,0.583587,2020-01-24,Fatalities,0.0
4,5,,,Afghanistan,27657145,0.058359,2020-01-25,ConfirmedCases,0.0


In [4]:
data.shape

(796490, 9)

In [5]:
data.columns

Index(['Id', 'County', 'Province_State', 'Country_Region', 'Population',
       'Weight', 'Date', 'Target', 'TargetValue'],
      dtype='object')

In [6]:
countries = data["Country_Region"].unique()
targets = data["Target"].unique()

In [7]:
countries

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burma', 'Burundi',
       'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Diamond Princess', 'Djibouti', 'Dominica', 'Dominican Republic',
       'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea',
       'Estonia', 'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France',
       'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece',
       'Grenada', 'Guatemala', 'Guin

In [8]:
# Additional info (mean age and GDP) for several countries
countries_dict = {"Italy":{
                      "mean_age":45.5,
                      "GDP":1988636},
                  "Brazil":{
                      "mean_age":32.6,
                      "GDP":1847020},
                  "US":{
                      "mean_age":38.1,
                      "GDP":21439453},
                  "United Kingdom":{
                      "mean_age":40.5,
                      "GDP":2743586},
                  "Portugal":{
                      "mean_age":42.2,
                      "GDP":236408},
                  "Canada":{
                      "mean_age":42.2,
                      "GDP":1730914},
                  "Austria":{
                      "mean_age":44.0,
                      "GDP":447718},
                  "Belgium":{
                      "mean_age":41.4,
                      "GDP":517609},
                  "Germany":{
                      "mean_age":47.1,
                      "GDP":3863344},
                  "Greece":{
                      "mean_age":44.5,
                      "GDP":214012},
                  "Finland":{
                      "mean_age":42.5,
                      "GDP":269654}, 
                  "France":{
                      "mean_age":41.4,
                      "GDP":2707074},
                  "Netherlands":{
                      "mean_age":42.6,
                      "GDP":902355},
                  "Russia":{
                      "mean_age":39.6,
                      "GDP":1637892},
                  "Poland":{
                      "mean_age":40.7,
                      "GDP":565854},
                  "Sweden":{
                      "mean_age":41.2,
                      "GDP":528929},
                  "Spain":{
                      "mean_age":42.7,
                      "GDP":1397870},
                  "Japan":{
                      "mean_age":47.3,
                      "GDP":5154475},
                  "Korea, South":{
                      "mean_age":41.8,
                      "GDP":1629532},
                  "China":{
                      "mean_age":37.4,
                      "GDP":14140163
                     
                  }
                  
                }

In [9]:
# Remove data from counties and province. I am only interested in Countries data
data['Date'] =  pd.to_datetime(data['Date'], format='%Y-%m-%d')
data = data[(data["County"].isnull()) & (data["Province_State"].isnull())] 

In [10]:
data

Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
0,1,,,Afghanistan,27657145,0.058359,2020-01-23,ConfirmedCases,0.0
1,2,,,Afghanistan,27657145,0.583587,2020-01-23,Fatalities,0.0
2,3,,,Afghanistan,27657145,0.058359,2020-01-24,ConfirmedCases,0.0
3,4,,,Afghanistan,27657145,0.583587,2020-01-24,Fatalities,0.0
4,5,,,Afghanistan,27657145,0.058359,2020-01-25,ConfirmedCases,0.0
...,...,...,...,...,...,...,...,...,...
796485,969586,,,Zimbabwe,14240168,0.607106,2020-05-14,Fatalities,0.0
796486,969587,,,Zimbabwe,14240168,0.060711,2020-05-15,ConfirmedCases,5.0
796487,969588,,,Zimbabwe,14240168,0.607106,2020-05-15,Fatalities,0.0
796488,969589,,,Zimbabwe,14240168,0.060711,2020-05-16,ConfirmedCases,0.0


In [11]:
def create_dataframe_country_target(data, country, target, countries_dict):
    """
    This function create a dataframe for a country
    Args:
        - data: dataframe with all the data
        - country: country to analyze
        - targte: target to add. Could be "Fatalities" or "ConfirmedCases"
        - countries_dict: dictionary with additional info
    Return:
        A dataframe for the country with the target value e additional info
    
    """
    
    df_country = data[(data.Country_Region == country) & (data.Target == target)]
    df_country = df_country.groupby(["Country_Region","Target","Date"]).agg({"Population":"max",
                                                            "TargetValue":"sum"
                                                           })
    df_country = df_country.reset_index().set_index("Date",drop=True)
    df_country["mean_age"] = countries_dict[country]["mean_age"]
    df_country["GDP"] = countries_dict[country]["GDP"]
    df_country["Rate_over_population"] =  df_country["TargetValue"]*100000/df_country["Population"]
    return df_country

In [12]:
# Some example
df_Italy = create_dataframe_country_target(data, "Italy", "Fatalities",countries_dict)
df_Brazil = create_dataframe_country_target(data, "Brazil", "Fatalities",countries_dict)
df_US = create_dataframe_country_target(data, "US", "Fatalities",countries_dict)

In [41]:
df_US.iloc[50:100]

Unnamed: 0_level_0,Country_Region,Target,Population,TargetValue,mean_age,GDP,Rate_over_population
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-03-13,US,Fatalities,324141489,7.0,38.1,21439453,0.00216
2020-03-14,US,Fatalities,324141489,7.0,38.1,21439453,0.00216
2020-03-15,US,Fatalities,324141489,9.0,38.1,21439453,0.002777
2020-03-16,US,Fatalities,324141489,22.0,38.1,21439453,0.006787
2020-03-17,US,Fatalities,324141489,23.0,38.1,21439453,0.007096
2020-03-18,US,Fatalities,324141489,10.0,38.1,21439453,0.003085
2020-03-19,US,Fatalities,324141489,82.0,38.1,21439453,0.025298
2020-03-20,US,Fatalities,324141489,44.0,38.1,21439453,0.013574
2020-03-21,US,Fatalities,324141489,63.0,38.1,21439453,0.019436
2020-03-22,US,Fatalities,324141489,119.0,38.1,21439453,0.036712


In [14]:
df_US["Rate_over_population"].max()

1.4163567934989032

In [15]:
df_Brazil.head()

Unnamed: 0_level_0,Country_Region,Target,Population,TargetValue,mean_age,GDP,Rate_over_population
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-23,Brazil,Fatalities,206135893,0.0,32.6,1847020,0.0
2020-01-24,Brazil,Fatalities,206135893,0.0,32.6,1847020,0.0
2020-01-25,Brazil,Fatalities,206135893,0.0,32.6,1847020,0.0
2020-01-26,Brazil,Fatalities,206135893,0.0,32.6,1847020,0.0
2020-01-27,Brazil,Fatalities,206135893,0.0,32.6,1847020,0.0


In [36]:
# Plot deaths for each country
import plotly.express as px
fig = px.line()
for country in list(countries_dict.keys()):
    df = create_dataframe_country_target(data, country, "Fatalities",countries_dict)
    fig.add_scatter(x=df.reset_index()['Date'], 
                y=df.reset_index()['TargetValue'],
                mode='lines',   name=country)
fig.update_layout(
    xaxis_title="Time",
    yaxis_title="N. of deaths",
    font=dict(
        family="Courier New, monospace",
        size=8,
        color="#7f7f7f"
    )
)

fig.show()

In [34]:
# Plot deaths over 100K inhabitants for each country
import plotly.express as px
fig = px.line()
for country in list(countries_dict.keys()):
    df = create_dataframe_country_target(data, country, "Fatalities",countries_dict)
    fig.add_scatter(x=df.reset_index()['Date'], 
                y=df.reset_index()['Rate_over_population'],
                mode='lines',   name=country)

fig.show()

In [20]:
import numpy as np
def create_training_df(df_country,n_days,target,population=True,mean_age=True,gdp=True):
    """
    This function create a dataframe for training the models based in the number of lookback days.
    Args:
        - df_country: dataframe of the country
        - n_days: number of days as lookback
        - target: target column
        - population: if True population data will be used
        - mean_age: if True mean age data will be used
        - gdp: if True gdp data will be used
    Return:
        A dataframe that could be trained in a machine learning model
    """
    df_country = df_country.sort_index().reset_index(drop=False)
    columns = ["Target","Date"]+list(np.arange(0,n_days,1))
    df_train = pd.DataFrame(columns=columns)
    j=0    
    for i in range(n_days,df_country.shape[0]):
        df_train = df_train.append({'Target': df_country.loc[i,target],
                                    'Date': df_country.loc[i,"Date"]
                                   }, ignore_index=True)
        for day in range(n_days):
            df_train.loc[j,day] = df_country.loc[i-n_days+day,target]
        j=j+1    
    if population:
        df_train["Population"] = df_country["Population"].max()
    if mean_age:
        df_train["mean_age"] = df_country["mean_age"].max()
    if gdp:
        df_train["GDP"] = df_country["GDP"].max()
    return df_train

In [21]:
# Prepare the training dataset using several countries. Brazil will be used as test.
# I decided to use 30 days to predict the number of death at day 31
countries_to_train = list(countries_dict.keys())
countries_to_train.remove("Brazil")
print(countries_to_train)
target = "Fatalities"
n_days = 30
target_deaths = "Rate_over_population"
def create_training_countries(countries_to_train, data, target, target_deaths, n_days):
    """
    This function create the training dataset for several countries
    Args:
        - countries_to_train: list with countries to use
        - data: original data
        - target: target to use. Ex: "Fatalities"
        - target_deaths: target of deaths to use
        - n_days: number of days to lookback
    Return:
        A dataframe with the traing data using several countries
    """
    df_train = create_training_df(create_dataframe_country_target(data, countries_to_train[0], target,countries_dict),n_days, target=target_deaths)
    for i in range(1, len(countries_to_train)):
        df_train = pd.concat([df_train,
                         create_training_df(create_dataframe_country_target(data, 
                                                                            countries_to_train[i], 
                                                                            target,
                                                                            countries_dict),
                                            n_days,
                                            target=target_deaths)])
    return df_train
df_train = create_training_countries(countries_to_train, data, target, target_deaths, n_days)

['Italy', 'US', 'United Kingdom', 'Portugal', 'Canada', 'Austria', 'Belgium', 'Germany', 'Greece', 'Finland', 'France', 'Netherlands', 'Russia', 'Poland', 'Sweden', 'Spain', 'Japan', 'Korea, South', 'China']


In [22]:
df_train.head()

Unnamed: 0,Target,Date,0,1,2,3,4,5,6,7,...,23,24,25,26,27,28,29,Population,mean_age,GDP
0,0.001648,2020-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.001648,60665551,45.5,1988636
1,0.001648,2020-02-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.001648,0.001648,60665551,45.5,1988636
2,0.006594,2020-02-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.001648,0.001648,0.001648,60665551,45.5,1988636
3,0.004945,2020-02-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001648,0.001648,0.001648,0.006594,60665551,45.5,1988636
4,0.003297,2020-02-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.001648,0.001648,0.001648,0.006594,0.004945,60665551,45.5,1988636


In [23]:
# Models
from sklearn.linear_model import Ridge
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
from sklearn.kernel_ridge import KernelRidge

def rfr(X,y):
    """
    Random forest regression model
    Args:
        - X: features
        - y: target
    Return:
        Trained model
    
    """
    param_grid = { 
        'rf__n_estimators': [100],
        'rf__max_depth' : [2,3,4,5,6,7,8],
        'rf__min_samples_leaf':[3,5]
    }    
    pipeline = Pipeline([
        ('rf', RandomForestRegressor())
    ])
    
    CV_regr = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
    CV_regr.fit(X, y)
    return CV_regr

def ridge(X,y):    
    """
    Linear Ridge regression model
    Args:
        - X: features
        - y: target
    Return:
        Trained model
    
    """
    parameters = {'ridge__alpha':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                  'ridge__solver':['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'],
                  'ridge__fit_intercept':[True,False],
                 }
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('ridge', Ridge())
    ])

    CV_regr = GridSearchCV(pipeline,  param_grid=parameters, cv=5)
    CV_regr.fit(X,y)
    return CV_regr
    
def lasso(X,y):    
    """
    Linear Lasso regression model
    Args:
        - X: features
        - y: target
    Return:
        Trained model
    
    """
    parameters = {'lasso__alpha':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                  'lasso__fit_intercept':[True,False],
                  'lasso__normalize':[True, False],
                  'lasso__positive':[True, False],
                  'lasso__selection':['cyclic', 'random']
                 }
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('lasso', Lasso())
    ])

    CV_regr = GridSearchCV(pipeline,  param_grid=parameters, cv=5)
    CV_regr.fit(X,y)
    return CV_regr
 
def kernel_ridge_regr(X,y):    
    """
    Not-linear Ridge regression model
    Args:
        - X: features
        - y: target
    Return:
        Trained model
    
    """
    parameters = {'kr__alpha':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                  'kr__gamma':[None,0.1,1,2.0],
                  'kr__kernel':['rbf','linear','polynomial'],
                  'kr__degree':[1,2,3,4,5]
                 }
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('kr', KernelRidge())
    ])

    CV_regr = GridSearchCV(pipeline,  param_grid=parameters, cv=5)
    CV_regr.fit(X,y)
    return CV_regr
 
    

In [24]:
def apply_model(model,X,y):
    """
    This function apply the model choosed
    Args:
        - model: model to apply
        - X: features
        - y: target
    Returns:
        Model trained
    """
    
    if model == "RFR":
        CV_regr = rfr(X,y)
    elif model == "ridge":
        CV_regr = ridge(X,y)
    elif model == "lasso":
        CV_regr = ridge(X,y)  
    elif model == "kernel_ridge":
        CV_regr = kernel_ridge_regr(X,y)          
    return CV_regr

In [25]:
# Apply model
CV_regr = apply_model("kernel_ridge",df_train.drop(["Target","Date"],axis=1), df_train["Target"])


Ill-conditioned matrix (rcond=7.607e-18): result may not be accurate.


Ill-conditioned matrix (rcond=2.29631e-17): result may not be accurate.


Ill-conditioned matrix (rcond=2.33866e-17): result may not be accurate.


Ill-conditioned matrix (rcond=1.69587e-17): result may not be accurate.


Ill-conditioned matrix (rcond=1.62092e-17): result may not be accurate.


Ill-conditioned matrix (rcond=5.68546e-17): result may not be accurate.


Ill-conditioned matrix (rcond=4.71702e-17): result may not be accurate.


Ill-conditioned matrix (rcond=4.46282e-17): result may not be accurate.


Ill-conditioned matrix (rcond=3.30269e-17): result may not be accurate.


Ill-conditioned matrix (rcond=5.64646e-17): result may not be accurate.


Ill-conditioned matrix (rcond=7.22455e-17): result may not be accurate.


Ill-conditioned matrix (rcond=6.03492e-17): result may not be accurate.


Ill-conditioned matrix (rcond=3.24956e-17): result may not be accurate.


Ill-conditioned matrix (rcond=7.57005e-

In [26]:
# Create base for test
df_Brazil_test = create_training_df(create_dataframe_country_target(data, "Brazil", target,countries_dict),n_days,target="Rate_over_population")

In [27]:
df_Brazil_test.head()

Unnamed: 0,Target,Date,0,1,2,3,4,5,6,7,...,23,24,25,26,27,28,29,Population,mean_age,GDP
0,0.0,2020-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020
1,0.0,2020-02-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020
2,0.0,2020-02-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020
3,0.0,2020-02-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020
4,0.0,2020-02-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020


In [28]:
# Prediction for the test data
df_Brazil_test["Prediction"] = CV_regr.predict(df_Brazil_test.drop(["Target","Date"],axis=1))

In [29]:
df_Brazil_test.head()

Unnamed: 0,Target,Date,0,1,2,3,4,5,6,7,...,24,25,26,27,28,29,Population,mean_age,GDP,Prediction
0,0.0,2020-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020,0.042617
1,0.0,2020-02-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020,0.042617
2,0.0,2020-02-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020,0.042617
3,0.0,2020-02-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020,0.042617
4,0.0,2020-02-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,206135893,32.6,1847020,0.042617


In [30]:
# Show prediction and grand true
fig = px.line()
fig.add_scatter(y=df_Brazil_test['Prediction'],
                mode='lines',   name="Prediction")
fig.add_scatter(y=df_Brazil_test['Target'],
                mode='lines',   name="Target")
fig.update_layout(
    xaxis_title="Time",
    yaxis_title="N. of deaths",
    font=dict(
        family="Courier New, monospace",
        size=11,
        color="#7f7f7f"
    )
)

fig.show()

In [31]:
import random
def roll_predictions(day_to_roll,start_row,df_test,noise,CV_regr):
    """
    This function extend the prediction for several days after. It uses the result of the prediction for the
    next prediction
    Args:
        - day_to_roll: number of days to predict
        - start_row: day where we start the prediction
        - df_test: dataframe with the data
        - noise: additional noise in the result
        - CV_regr: trained model
    Returns:
        Dataframe with the the result of the prediction and the comparison
    """
    cols = list(df_test.columns)
    cols.remove("Target")
    cols.remove("Population")
    cols.remove("mean_age")
    cols.remove("GDP")
    cols.remove("Date")
    df_test_to_compare = df_test.copy()
    df_test = df_test.drop("Date",axis=1).reset_index(drop=True)
    row = df_test.iloc[start_row,1:len(df_test.columns)].to_frame().transpose()
    preds = []
    for day in range(day_to_roll):
        prediction = CV_regr.predict(row)
        prediction = prediction[0]+ random.uniform(0, 1)*noise
        preds.append(prediction)
        for i in range(len(cols)-1):
            row[cols[i]] =row[cols[i+1]] 
        row[len(cols)-1] = prediction     
        
    predicted = pd.DataFrame(np.column_stack((list(np.arange(start_row,day_to_roll+start_row,1)),preds)),
                        columns=["Day","Target_predicted"])
    real = df_test_to_compare[["Target","Date"]].reset_index(drop=False)
    real.columns = [ "Day","Target_real","Date"]
    df_comparison = pd.merge(real,predicted,on="Day",how="outer")
    df_date_to_fill = pd.Series(pd.date_range(start=df_comparison["Date"].max(),periods=df_comparison["Date"].isnull().sum()+1)).iloc[1:]
    df_exists = df_comparison[df_comparison["Date"].isnull()==False]["Date"]
    df_comparison["Date"] = df_exists.append(df_date_to_fill).values
    
    return df_comparison

In [32]:
from datetime import datetime
# Calculate the prediction starting prediction from the 22th of April
df_test = create_training_df(create_dataframe_country_target(data, "Brazil", target,countries_dict),n_days,target="Rate_over_population")
day_to_roll=100
start_row=60
noise=0.00
df_comparison = roll_predictions(day_to_roll,start_row,df_test,noise,CV_regr)

In [33]:
# Plot the results
fig = px.line()
fig.add_scatter(x=df_comparison["Date"],
                y=df_comparison['Target_real'],
                mode='lines',   name="Real")
fig.add_scatter(x=df_comparison["Date"],
                y=df_comparison['Target_predicted'],
                mode='lines',   name="Predicted")
fig.update_layout(
    xaxis_title="Time",
    yaxis_title="N. of deaths",
    font=dict(
        family="Courier New, monospace",
        size=11,
        color="#7f7f7f"
    )
)

fig.show()

In [32]:
# Calculate the RMSE of the prediction
from sklearn.metrics import mean_squared_error
from math import sqrt
true = df_comparison.dropna()['Target_predicted']
pred = df_comparison.dropna()['Target_real']
rmse = sqrt(mean_squared_error(true, pred))
rmse

0.06252680629083622

In [33]:
# Show the peak of deaths in Brazil
df_comparison.iloc[df_comparison["Target_predicted"].argmax()]

Day                                 110
Target_real                         NaN
Date                2020-06-11 00:00:00
Target_predicted                0.53835
Name: 110, dtype: object

In [34]:
df_comparison.iloc[df_comparison["Target_predicted"].argmax()]["Date"].strftime("%Y-%m-%d")

'2020-06-11'