# Understanding Covid19 Case Counts using machine learning

In [72]:
# used library imports
import pandas as pd

### Covid Data from Our World in Data

Last downloaded: 16th of May 2024

https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv

In [73]:
df = pd.read_csv('resources/owid-covid-data.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397677 entries, 0 to 397676
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   iso_code                                    397677 non-null  object 
 1   continent                                   378618 non-null  object 
 2   location                                    397677 non-null  object 
 3   date                                        397677 non-null  object 
 4   total_cases                                 358581 non-null  float64
 5   new_cases                                   386444 non-null  float64
 6   new_cases_smoothed                          385214 non-null  float64
 7   total_deaths                                336105 non-null  float64
 8   new_deaths                                  386794 non-null  float64
 9   new_deaths_smoothed                         385564 non-null  float64
 

### Inspecting the Data

Findings from a first look at the data:
- need to decide wether to use smoothed variables or not, can't use both
- total tests probably redundant
- total vaccinations are redundant if we have people vaccinated, fully vaccinated and boosters
- some columns have high numbers of missing values, like handwashing_facilities and smokers
- for icu and hospital patients, per million might be better, due to comparability



In [74]:
df

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
1,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
2,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
3,AFG,Asia,Afghanistan,2020-01-08,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
4,AFG,Asia,Afghanistan,2020-01-09,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397672,ZWE,Africa,Zimbabwe,2024-04-24,266359.0,0.0,0.0,5740.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
397673,ZWE,Africa,Zimbabwe,2024-04-25,266359.0,0.0,0.0,5740.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
397674,ZWE,Africa,Zimbabwe,2024-04-26,266359.0,0.0,0.0,5740.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
397675,ZWE,Africa,Zimbabwe,2024-04-27,266359.0,0.0,0.0,5740.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,


In [75]:
df.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,358581.0,386444.0,385214.0,336105.0,386794.0,385564.0,358581.0,386444.0,385214.0,336105.0,...,228740.0,151351.0,272868.0,366406.0,299646.0,397677.0,13276.0,13276.0,13276.0,13276.0
mean,7560483.0,8503.695,8530.716,90969.52,76.169995,76.40955,115014.660831,129.532618,129.933273,929.946747,...,32.9109,50.790431,3.097428,73.710804,0.722574,129315500.0,55652.16,9.762172,10.98641,1778.714099
std,44764740.0,236536.3,89195.75,460885.5,1405.682671,527.419277,162125.095103,1526.641288,565.267399,1139.151038,...,13.572257,31.952834,2.548117,7.394839,0.149003,663527200.0,155499.0,11.990399,24.514719,1989.522648
min,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.7,1.188,0.1,53.28,0.394,47.0,-37726.1,-44.23,-95.92,-2936.4531
25%,9631.0,0.0,0.0,138.0,0.0,0.0,3040.311,0.0,0.0,66.82,...,22.6,20.859,1.3,69.59,0.602,449002.0,177.6251,2.12,-1.46,122.933152
50%,81581.0,0.0,15.571,1407.0,0.0,0.0,33973.672,0.0,3.801,433.799,...,33.1,49.839,2.5,75.05,0.74,5882259.0,6806.349,8.15,5.725,1259.3004
75%,902161.0,0.0,374.857,12753.0,0.0,3.714,157521.073,0.0,64.734,1493.483,...,41.3,82.502,4.2,79.46,0.829,28301700.0,38941.65,15.0225,15.68,2880.678125
max,775379800.0,44236230.0,6319461.0,7047396.0,103719.0,14817.0,770966.075,240325.866,34332.267,6485.57,...,78.1,100.0,13.8,86.75,0.957,7975105000.0,1345330.0,78.08,377.83,10293.515


Selecting a subset of columns that can be used for predictions

In [76]:
df_subset = df[['continent','location','date','new_cases_per_million','new_deaths_per_million','reproduction_rate','icu_patients_per_million','hosp_patients_per_million','new_tests_per_thousand','people_vaccinated_per_hundred','people_fully_vaccinated_per_hundred','total_boosters_per_hundred','stringency_index','population_density','median_age','aged_65_older','aged_70_older','gdp_per_capita','cardiovasc_death_rate','diabetes_prevalence','female_smokers','male_smokers','hospital_beds_per_thousand','life_expectancy','human_development_index',]]
df_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397677 entries, 0 to 397676
Data columns (total 25 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   continent                            378618 non-null  object 
 1   location                             397677 non-null  object 
 2   date                                 397677 non-null  object 
 3   new_cases_per_million                386444 non-null  float64
 4   new_deaths_per_million               386794 non-null  float64
 5   reproduction_rate                    184817 non-null  float64
 6   icu_patients_per_million             38837 non-null   float64
 7   hosp_patients_per_million            40377 non-null   float64
 8   new_tests_per_thousand               75403 non-null   float64
 9   people_vaccinated_per_hundred        79959 non-null   float64
 10  people_fully_vaccinated_per_hundred  76865 non-null   float64
 11  total_booster

In [77]:
df_subset.isnull().sum()

continent                               19059
location                                    0
date                                        0
new_cases_per_million                   11233
new_deaths_per_million                  10883
reproduction_rate                      212860
icu_patients_per_million               358840
hosp_patients_per_million              357300
new_tests_per_thousand                 322274
people_vaccinated_per_hundred          317718
people_fully_vaccinated_per_hundred    320812
total_boosters_per_hundred             345300
stringency_index                       200385
population_density                      59412
median_age                              83279
aged_65_older                           94084
aged_70_older                           86431
gdp_per_capita                          89356
cardiovasc_death_rate                   88789
diabetes_prevalence                     72815
female_smokers                         165785
male_smokers                      

In [78]:
df_subset.describe()

Unnamed: 0,new_cases_per_million,new_deaths_per_million,reproduction_rate,icu_patients_per_million,hosp_patients_per_million,new_tests_per_thousand,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,stringency_index,...,aged_65_older,aged_70_older,gdp_per_capita,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
count,386444.0,386794.0,184817.0,38837.0,40377.0,75403.0,79959.0,76865.0,52377.0,197292.0,...,303593.0,311246.0,308321.0,308888.0,324862.0,231892.0,228740.0,272868.0,366406.0,299646.0
mean,129.532618,0.806203,0.911495,15.767063,126.813391,3.272466,53.204774,48.368835,36.023039,42.791492,...,8.703831,5.500781,18961.090464,264.360275,8.560196,10.795876,32.9109,3.097428,73.710804,0.722574
std,1526.641288,7.211076,0.399925,22.829584,151.347732,9.033843,29.480966,29.139236,30.397027,24.86726,...,6.097274,4.138575,19875.492581,120.886118,4.940448,10.781817,13.572257,2.548117,7.394839,0.149003
min,0.0,0.0,-0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.144,0.526,661.24,79.37,0.99,0.1,7.7,0.1,53.28,0.394
25%,0.0,0.0,0.72,2.378,32.016,0.286,27.26,20.86,5.76,22.22,...,3.526,2.085,3823.194,175.695,5.35,1.9,22.6,1.3,69.59,0.602
50%,0.0,0.0,0.95,6.527,74.968,0.971,63.99,57.08,35.31,42.59,...,6.378,3.871,12294.876,245.465,7.2,6.3,33.1,2.5,75.05,0.74
75%,0.0,0.0,1.14,18.973,160.568,2.914,77.745,73.6,57.7,62.04,...,13.928,8.643,27216.445,333.436,10.79,19.3,41.3,4.2,79.46,0.829
max,240325.866,906.413,5.87,180.675,1526.846,531.062,129.07,126.89,150.47,100.0,...,27.049,18.493,116935.6,724.417,30.53,44.0,78.1,13.8,86.75,0.957


overall there are a lot of null values especially for vaccinations, so we need to check if that's because some countries just don't report vaccinations, or if those values are all from a time where there were no vaccinations, also if there are gaps in the reports, so fill them with 0 till the first vaccination and then last observation carried forward

In [79]:
df_subset['continent'].unique()

array(['Asia', nan, 'Europe', 'Africa', 'Oceania', 'North America',
       'South America'], dtype=object)

In [80]:
df_cleaned = df_subset.dropna(subset=['continent'])
df_cleaned.loc[:, 'new_cases_per_million']= df_cleaned['new_cases_per_million'].fillna(0)
df_cleaned.loc[:, 'new_deaths_per_million']= df_cleaned['new_deaths_per_million'].fillna(0)
df_cleaned.loc[:, 'reproduction_rate']= df_cleaned['reproduction_rate'].fillna(0)
df_cleaned.loc[:, 'icu_patients_per_million']= df_cleaned['icu_patients_per_million'].fillna(0)
df_cleaned.loc[:, 'hosp_patients_per_million']= df_cleaned['hosp_patients_per_million'].fillna(0)
df_cleaned.loc[:, 'new_tests_per_thousand']= df_cleaned['new_tests_per_thousand'].fillna(0)
df_cleaned.loc[:, 'stringency_index']= df_cleaned['stringency_index'].fillna(0)

df_cleaned

Unnamed: 0,continent,location,date,new_cases_per_million,new_deaths_per_million,reproduction_rate,icu_patients_per_million,hosp_patients_per_million,new_tests_per_thousand,people_vaccinated_per_hundred,...,aged_65_older,aged_70_older,gdp_per_capita,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,Asia,Afghanistan,2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
1,Asia,Afghanistan,2020-01-06,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
2,Asia,Afghanistan,2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
3,Asia,Afghanistan,2020-01-08,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
4,Asia,Afghanistan,2020-01-09,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397672,Africa,Zimbabwe,2024-04-24,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397673,Africa,Zimbabwe,2024-04-25,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397674,Africa,Zimbabwe,2024-04-26,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397675,Africa,Zimbabwe,2024-04-27,0.0,0.0,0.0,0.0,0.0,0.0,,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571


In [81]:
#sort by 'location' and 'date'
df_cleaned = df_cleaned.sort_values(by=['location', 'date'])

# Use groupby on 'location' and apply forward fill
df_cleaned['people_vaccinated_per_hundred'] = df_cleaned.groupby('location')['people_vaccinated_per_hundred'].ffill()
df_cleaned['people_fully_vaccinated_per_hundred'] = df_cleaned.groupby('location')['people_fully_vaccinated_per_hundred'].ffill()
df_cleaned['total_boosters_per_hundred'] = df_cleaned.groupby('location')['total_boosters_per_hundred'].ffill()

#fill the rest of values with 0 (those are values before the first value, therefore they are 0)
df_cleaned.loc[:, 'people_vaccinated_per_hundred']= df_cleaned['people_vaccinated_per_hundred'].fillna(0)
df_cleaned.loc[:, 'people_fully_vaccinated_per_hundred']= df_cleaned['people_fully_vaccinated_per_hundred'].fillna(0)
df_cleaned.loc[:, 'total_boosters_per_hundred']= df_cleaned['total_boosters_per_hundred'].fillna(0)

df_cleaned

Unnamed: 0,continent,location,date,new_cases_per_million,new_deaths_per_million,reproduction_rate,icu_patients_per_million,hosp_patients_per_million,new_tests_per_thousand,people_vaccinated_per_hundred,...,aged_65_older,aged_70_older,gdp_per_capita,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,Asia,Afghanistan,2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
1,Asia,Afghanistan,2020-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
2,Asia,Afghanistan,2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
3,Asia,Afghanistan,2020-01-08,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
4,Asia,Afghanistan,2020-01-09,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,,,0.5,64.83,0.511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397672,Africa,Zimbabwe,2024-04-24,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397673,Africa,Zimbabwe,2024-04-25,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397674,Africa,Zimbabwe,2024-04-26,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397675,Africa,Zimbabwe,2024-04-27,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571


In [82]:
#df_cleaned[df_cleaned['location'] == 'Austria']
#the locations with missing values for population_density are small islands, so we can drop them
print(df_cleaned[df_cleaned['population_density'].isna()]['location'].unique())
df_cleaned.dropna(subset=['population_density'],inplace=True)


['Anguilla' 'Bonaire Sint Eustatius and Saba' 'Cook Islands' 'England'
 'Falkland Islands' 'French Guiana' 'Guadeloupe' 'Guernsey' 'Jersey'
 'Martinique' 'Mayotte' 'Montserrat' 'Niue' 'Northern Cyprus'
 'Northern Ireland' 'Pitcairn' 'Reunion' 'Saint Barthelemy' 'Saint Helena'
 'Saint Pierre and Miquelon' 'Scotland' 'South Sudan' 'Syria' 'Taiwan'
 'Tokelau' 'Vatican' 'Wales' 'Wallis and Futuna' 'Western Sahara']


In [83]:
#same for median_age the locations, that have missing values are small islands, so we can drop them
print(df_cleaned[df_cleaned['median_age'].isna()]['location'].unique())
df_cleaned.dropna(subset=['median_age'],inplace=True)

['American Samoa' 'Andorra' 'Bermuda' 'British Virgin Islands'
 'Cayman Islands' 'Dominica' 'Faeroe Islands' 'Gibraltar' 'Greenland'
 'Isle of Man' 'Kosovo' 'Liechtenstein' 'Marshall Islands' 'Monaco'
 'Nauru' 'Northern Mariana Islands' 'Palau' 'Saint Kitts and Nevis'
 'Saint Martin (French part)' 'San Marino' 'Sint Maarten (Dutch part)'
 'Turks and Caicos Islands' 'Tuvalu']


In [84]:
#locations with missing gdp per capita are countries with limited information like, North Korea, Cuba or Somalia, so we drop them
print(df_cleaned[df_cleaned['gdp_per_capita'].isna()]['location'].unique())
df_cleaned.dropna(subset=['gdp_per_capita'],inplace=True)

['Cuba' 'Curacao' 'French Polynesia' 'Guam' 'New Caledonia' 'North Korea'
 'Somalia' 'United States Virgin Islands']


In [85]:
#only puerto rico has no human development index, so we drop it
print(df_cleaned[df_cleaned['human_development_index'].isna()]['location'].unique())
df_cleaned.dropna(subset=['human_development_index'],inplace=True)

['Aruba' 'Macao' 'Puerto Rico']


In [86]:
#missing values for 70 or older are assumed to be null
df_cleaned['aged_70_older'].fillna(0, inplace=True)
#same for cardiovasc_death_rate
df_cleaned.dropna(subset=['cardiovasc_death_rate'],inplace=True)
#smokers are filled with the median value (they are also mainly small 3rd world countries, so maybe should be dropped)
# Calculate the median of 'female_smokers' and 'male_smokers'
female_smokers_median = df_cleaned['female_smokers'].median()
male_smokers_median = df_cleaned['male_smokers'].median()

# Fill the missing values with the calculated median
df_cleaned['female_smokers'].fillna(female_smokers_median, inplace=True)
df_cleaned['male_smokers'].fillna(male_smokers_median, inplace=True)

#hospital beds are also small countries or poor countries, so we assume 0 
df_cleaned['hospital_beds_per_thousand'].fillna(0, inplace=True)
#only puerto rico has no human development index, so we drop it
df_cleaned.dropna(subset=['human_development_index'],inplace=True)

#df_cleaned.to_csv('resources/covid_data_cleaned.csv')
df_cleaned

Unnamed: 0,continent,location,date,new_cases_per_million,new_deaths_per_million,reproduction_rate,icu_patients_per_million,hosp_patients_per_million,new_tests_per_thousand,people_vaccinated_per_hundred,...,aged_65_older,aged_70_older,gdp_per_capita,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,Asia,Afghanistan,2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,6.2,31.4,0.5,64.83,0.511
1,Asia,Afghanistan,2020-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,6.2,31.4,0.5,64.83,0.511
2,Asia,Afghanistan,2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,6.2,31.4,0.5,64.83,0.511
3,Asia,Afghanistan,2020-01-08,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,6.2,31.4,0.5,64.83,0.511
4,Asia,Afghanistan,2020-01-09,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,2.581,1.337,1803.987,597.029,9.59,6.2,31.4,0.5,64.83,0.511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397672,Africa,Zimbabwe,2024-04-24,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397673,Africa,Zimbabwe,2024-04-25,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397674,Africa,Zimbabwe,2024-04-26,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571
397675,Africa,Zimbabwe,2024-04-27,0.0,0.0,0.0,0.0,0.0,0.0,39.45,...,2.822,1.882,1899.775,307.846,1.82,1.6,30.7,1.7,61.49,0.571


In [87]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 282307 entries, 0 to 397676
Data columns (total 25 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   continent                            282307 non-null  object 
 1   location                             282307 non-null  object 
 2   date                                 282307 non-null  object 
 3   new_cases_per_million                282307 non-null  float64
 4   new_deaths_per_million               282307 non-null  float64
 5   reproduction_rate                    282307 non-null  float64
 6   icu_patients_per_million             282307 non-null  float64
 7   hosp_patients_per_million            282307 non-null  float64
 8   new_tests_per_thousand               282307 non-null  float64
 9   people_vaccinated_per_hundred        282307 non-null  float64
 10  people_fully_vaccinated_per_hundred  282307 non-null  float64
 11  total_boosters_per

convert date to datetime

In [88]:
df_cleaned['date'] = pd.to_datetime(df_cleaned['date'])

In [89]:
df_cleaned.to_csv('resources/covid_data_cleaned.csv', index=False)

confounder selection:
basesd on the pretreatment criterion, the following variables should be controlled for:
- population_density
- median_age
- gdp_per_capita
- life_expectancy
- human_development
- cardiovasc_death_rate
- diabetes_prevalence
- aged_65_older
- aged_70_older
- female_smokers
- male_smokers
based on the disjunctive cause criterion, the following variables should be controlled for:
maybe:
- new_deaths_per_million
proxies for unmeasured confounders:
- maybe use the season as a proxy for unmeasured confounders (like weather, holidays, etc.)

DoubleML to estimate the causal effect of the vaccination rate and stringency on the number of new cases per million

TODO for each continent 

This is an attempt at doing this doubleml manually

The value for theta for the custom implementation is 1.578 and theta for the doubleml implementation is 1.618, so the results are the quite similar, espacially when looking at the 2.5% and 97.5% quantiles, which are 1.517 and 1.720

trying out the econml package

Analysis with time shift