In [153]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import date
import math

In [154]:
# Jupyter configuration
pd.set_option('display.max_rows', 500)

In [155]:
dataset = pd.read_csv('owid-covid-data.csv')

Separate the data into two datasets

In [156]:
countries_dataset = dataset[dataset['location'] != 'World']
world_dataset = dataset[dataset['location'] == 'World']

countries_dataset.to_csv('countries-covid-data.csv')
world_dataset.to_csv('world-covid-data.csv')

dataset = countries_dataset


In [157]:
# The shape of the dataset
dataset.shape

(127167, 65)

### Variables description

| Variable                             | Description                                                                                                    |
|:-------------------------------------|:---------------------------------------------------------------------------------------------------------------|
| `icu_patients`                       | Number of COVID-19 patients in intensive care units (ICUs) on a given day                                      |
| `icu_patients_per_million`           | Number of COVID-19 patients in intensive care units (ICUs) on a given day per 1,000,000 people                 |
| `hosp_patients`                      | Number of COVID-19 patients in hospital on a given day                                                         |
| `hosp_patients_per_million`          | Number of COVID-19 patients in hospital on a given day per 1,000,000 people                                    |
| `weekly_icu_admissions`              | Number of COVID-19 patients newly admitted to intensive care units (ICUs) in a given week                      |
| `weekly_icu_admissions_per_million`  | Number of COVID-19 patients newly admitted to intensive care units (ICUs) in a given week per 1,000,000 people |
| `weekly_hosp_admissions`             | Number of COVID-19 patients newly admitted to hospitals in a given week                                        |
| `weekly_hosp_admissions_per_million` | Number of COVID-19 patients newly admitted to hospitals in a given week per 1,000,000 people                   |

### Policy responses
| Variable           | Description                                                                                                                                                                                                         |
|:-------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `stringency_index` | Government Response Stringency Index: composite measure based on 9 response indicators including school closures, workplace closures, and travel bans, rescaled to a value from 0 to 100 (100 = strictest response) |

### Reproduction rate
| Variable            | Description                                                                                                                                   |
|:--------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
| `reproduction_rate` | Real-time estimate of the effective reproduction rate (R) of COVID-19. See https://github.com/crondonm/TrackingR/tree/main/Estimates-Database |

### Tests & positivity
| Variable                          | Description                                                                                                                                                                                                                                                                                                          |
|:----------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `total_tests`                     | Total tests for COVID-19                                                                                                                                                                                                                                                                                             |
| `new_tests`                       | New tests for COVID-19 (only calculated for consecutive days)                                                                                                                                                                                                                                                        |
| `total_tests_per_thousand`        | Total tests for COVID-19 per 1,000 people                                                                                                                                                                                                                                                                            |
| `new_tests_per_thousand`          | New tests for COVID-19 per 1,000 people                                                                                                                                                                                                                                                                              |
| `new_tests_smoothed`              | New tests for COVID-19 (7-day smoothed). For countries that don't report testing data on a daily basis, we assume that testing changed equally on a daily basis over any periods in which no data was reported. This produces a complete series of daily figures, which is then averaged over a rolling 7-day window |
| `new_tests_smoothed_per_thousand` | New tests for COVID-19 (7-day smoothed) per 1,000 people                                                                                                                                                                                                                                                             |
| `positive_rate`                   | The share of COVID-19 tests that are positive, given as a rolling 7-day average (this is the inverse of tests_per_case)                                                                                                                                                                                              |
| `tests_per_case`                  | Tests conducted per new confirmed case of COVID-19, given as a rolling 7-day average (this is the inverse of positive_rate)                                                                                                                                                                                          |
| `tests_units`                     | Units used by the location to report its testing data 

Looking at these variables one can observe that for some variables there are pair variables that end wirh 'per_thousand' or 'per_million'. These variables generalize the information regarding the variable that is represented by tehm. We can exclude these variables because they have the same behaviour as the real measurements. For example, icu_patients in a way represents the same information as variable icu_patients_per_million and should have the same behaviour. The variables to be excluded are: 

* icu_patients_per_million
* hosp_patients_per_million
* weekly_icu_admissions_per_million
* weekly_hosp_admissions_per_million
* total_tests_per_thousand
* new_tests_per_thousand
* new_tests_smoothed
* new_tests_smoothed_per_thousand.

In [158]:
dataset.drop([
        'icu_patients_per_million',
        'hosp_patients_per_million', 
        'weekly_icu_admissions_per_million',
        'weekly_hosp_admissions_per_million',
        'total_tests_per_thousand',
        'new_tests_per_thousand',
        'new_tests_smoothed',
        'new_tests_smoothed_per_thousand'
    ],
    axis='columns',
    inplace=True
)

dataset.shape

(127167, 57)

Select only a subset of variables to work with

In [159]:
dataset = dataset[[
    'location',
    'date',
    'icu_patients',
    'hosp_patients',
    'weekly_icu_admissions', 
    'weekly_hosp_admissions', 
    'stringency_index', 
    'reproduction_rate', 
    'total_tests', 
    'new_tests', 
    'positive_rate', 
    'tests_per_case', 
    'tests_units'
]]

dataset.shape

(127167, 13)

In [160]:
# All available variables
dataset.columns.tolist()

['location',
 'date',
 'icu_patients',
 'hosp_patients',
 'weekly_icu_admissions',
 'weekly_hosp_admissions',
 'stringency_index',
 'reproduction_rate',
 'total_tests',
 'new_tests',
 'positive_rate',
 'tests_per_case',
 'tests_units']

In [161]:
# Fixing date column datatype
dataset['date'] = pd.to_datetime(dataset['date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['date'] = pd.to_datetime(dataset['date'])


In [162]:
dataset.describe()

Unnamed: 0,icu_patients,hosp_patients,weekly_icu_admissions,weekly_hosp_admissions,stringency_index,reproduction_rate,total_tests,new_tests,positive_rate,tests_per_case
count,15022.0,17713.0,1262.0,2186.0,106613.0,101931.0,54699.0,54508.0,61972.0,61316.0
mean,915.236786,4034.903066,214.015366,2976.750189,56.498595,0.999924,11600520.0,56475.72,0.086487,159.830328
std,2879.425099,11403.399705,494.954694,10333.090304,20.64832,0.341983,48439860.0,195016.6,0.095893,841.823266
min,0.0,0.0,0.0,0.0,0.0,-0.03,0.0,1.0,0.0,1.0
25%,25.0,100.0,5.091,43.085,42.59,0.83,250664.5,2098.0,0.016,8.0
50%,122.0,488.0,25.8565,250.761,57.41,1.01,1262217.0,7641.5,0.05,19.5
75%,525.75,2278.0,150.8335,1206.8665,72.22,1.17,5162169.0,29509.5,0.124,57.8
max,28891.0,133253.0,4002.456,116307.0,100.0,5.96,621566200.0,3740296.0,0.97,50000.0


In [163]:
# Variable types
dataset.dtypes

location                          object
date                      datetime64[ns]
icu_patients                     float64
hosp_patients                    float64
weekly_icu_admissions            float64
weekly_hosp_admissions           float64
stringency_index                 float64
reproduction_rate                float64
total_tests                      float64
new_tests                        float64
positive_rate                    float64
tests_per_case                   float64
tests_units                       object
dtype: object

#### There are two variables that are not numerical: date and tests_units. 

### Before dealing with missing values, let's take a look at the head and tail of the resulting dataset. 

In [164]:
dataset.head(10)

Unnamed: 0,location,date,icu_patients,hosp_patients,weekly_icu_admissions,weekly_hosp_admissions,stringency_index,reproduction_rate,total_tests,new_tests,positive_rate,tests_per_case,tests_units
0,Afghanistan,2020-02-24,,,,,8.33,,,,,,
1,Afghanistan,2020-02-25,,,,,8.33,,,,,,
2,Afghanistan,2020-02-26,,,,,8.33,,,,,,
3,Afghanistan,2020-02-27,,,,,8.33,,,,,,
4,Afghanistan,2020-02-28,,,,,8.33,,,,,,
5,Afghanistan,2020-02-29,,,,,8.33,,,,,,
6,Afghanistan,2020-03-01,,,,,27.78,,,,,,
7,Afghanistan,2020-03-02,,,,,27.78,,,,,,
8,Afghanistan,2020-03-03,,,,,27.78,,,,,,
9,Afghanistan,2020-03-04,,,,,27.78,,,,,,


In [165]:
dataset.tail(10)

Unnamed: 0,location,date,icu_patients,hosp_patients,weekly_icu_admissions,weekly_hosp_admissions,stringency_index,reproduction_rate,total_tests,new_tests,positive_rate,tests_per_case,tests_units
127807,Zimbabwe,2021-10-23,,,,,44.44,0.72,1350237.0,3183.0,0.013,78.7,tests performed
127808,Zimbabwe,2021-10-24,,,,,44.44,0.74,1351732.0,1495.0,0.014,71.9,tests performed
127809,Zimbabwe,2021-10-25,,,,,44.44,0.74,1355031.0,3299.0,0.016,61.9,tests performed
127810,Zimbabwe,2021-10-26,,,,,44.44,0.75,1357443.0,2412.0,0.017,60.3,tests performed
127811,Zimbabwe,2021-10-27,,,,,44.44,0.75,1361473.0,4030.0,0.018,56.8,tests performed
127812,Zimbabwe,2021-10-28,,,,,44.44,0.76,1365439.0,3966.0,0.015,65.5,tests performed
127813,Zimbabwe,2021-10-29,,,,,44.44,,1368179.0,2740.0,0.016,62.5,tests performed
127814,Zimbabwe,2021-10-30,,,,,44.44,,1371934.0,3755.0,0.016,64.2,tests performed
127815,Zimbabwe,2021-10-31,,,,,44.44,,,,,,
127816,Zimbabwe,2021-11-01,,,,,44.44,,,,,,


All of the numerical variables are represented by `pandas` library as floats64. This is fine because there are a lot of missing values and soon we will replace them with mean or median, depending on the skewness, and these will not be integer numbers. 

Let's for now take a look at all unique values from the `tests_units` column.

This variable since its not numerical, it might have a weak relation to the other numerical variables and will not help much further when the prediction will be made.

In [166]:
dataset['tests_units'].value_counts()
dataset.drop('tests_units', axis='columns', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


(127167, 12)