In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler

### Q3 What are some factors that contribute more to a country having more COVID cases

In [6]:
df = pd.read_csv('../data/CovidData.csv')
df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,reproduction_rate,icu_patients,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,2/24/20,1.0,1.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
1,AFG,Asia,Afghanistan,2/25/20,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
2,AFG,Asia,Afghanistan,2/26/20,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
3,AFG,Asia,Afghanistan,2/27/20,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
4,AFG,Asia,Afghanistan,2/28/20,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511


In [3]:
df.drop(df[df.continent.isnull()].index, inplace=True) # drop all continents
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year

In [5]:
m = smf.ols("total_cases ~ total_tests + median_age + population_density + population + cardiovasc_death_rate + handwashing_facilities + life_expectancy + human_development_index"
            ,data=df).fit()
m.summary()

0,1,2,3
Dep. Variable:,total_cases,R-squared:,0.949
Model:,OLS,Adj. R-squared:,0.949
Method:,Least Squares,F-statistic:,37550.0
Date:,"Sat, 08 Jul 2023",Prob (F-statistic):,0.0
Time:,14:37:15,Log-Likelihood:,-226710.0
No. Observations:,16204,AIC:,453400.0
Df Residuals:,16195,BIC:,453500.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.371e+05,4.66e+04,-9.389,0.000,-5.28e+05,-3.46e+05
total_tests,0.0539,0.000,354.734,0.000,0.054,0.054
median_age,-8662.2677,641.122,-13.511,0.000,-9918.938,-7405.597
population_density,-60.9175,8.579,-7.101,0.000,-77.734,-44.101
population,0.0006,1.54e-05,39.603,0.000,0.001,0.001
cardiovasc_death_rate,-459.0328,27.671,-16.589,0.000,-513.272,-404.794
handwashing_facilities,-1492.7441,141.907,-10.519,0.000,-1770.897,-1214.591
life_expectancy,794.8825,831.194,0.956,0.339,-834.349,2424.114
human_development_index,1.338e+06,5.03e+04,26.619,0.000,1.24e+06,1.44e+06

0,1,2,3
Omnibus:,9286.013,Durbin-Watson:,0.022
Prob(Omnibus):,0.0,Jarque-Bera (JB):,187838.981
Skew:,2.347,Prob(JB):,0.0
Kurtosis:,19.005,Cond. No.,5150000000.0


#### Intepreting Results
We built a mutiple regression model with total cases as our dependent variable and total tests, median age, population density, population, cardiovascular death rate, handwashing facilities, life expectancy and human development index (HDI) as our independent variables.

Our initial multi-regression model looks promising, our R^2 value is close to 1 telling us that the variables we have chose explain a large proportion of our data and our F statistic is low which indicates that our results are statistically significant. An increase in cases seems to be statistically associated with a larger population as well as more testing, common sense can tell us this. We can also see that locations with more cases have less handwashing facilities. What is more interesting is that health indicators like life expectancy, median age and HDI all show that countries with a higher life expectancy, median age and HDI have more cases compared. This can most probably be explained by the undereporting or under testing of the other countries, either voluntarily or because there was a shortage of COVID-19 tests as there was a global shortage of tests.

In [None]:
m = smf.ols("total_deaths ~ total_tests + median_age + population_density + population + cardiovasc_death_rate + handwashing_facilities + life_expectancy + human_development_index"
            ,data=df).fit()
m.summary()

0,1,2,3
Dep. Variable:,total_deaths,R-squared:,0.528
Model:,OLS,Adj. R-squared:,0.528
Method:,Least Squares,F-statistic:,2136.0
Date:,"Thu, 06 Jul 2023",Prob (F-statistic):,0.0
Time:,17:55:29,Log-Likelihood:,-170980.0
No. Observations:,15254,AIC:,342000.0
Df Residuals:,15245,BIC:,342000.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.395e+04,2927.500,-4.764,0.000,-1.97e+04,-8207.761
total_tests,0.0007,9.42e-06,69.262,0.000,0.001,0.001
median_age,-732.6809,40.183,-18.234,0.000,-811.444,-653.918
population_density,-6.4975,0.548,-11.854,0.000,-7.572,-5.423
population,2.435e-05,9.56e-07,25.483,0.000,2.25e-05,2.62e-05
cardiovasc_death_rate,-38.9279,1.757,-22.160,0.000,-42.371,-35.485
handwashing_facilities,0.5174,9.042,0.057,0.954,-17.207,18.242
life_expectancy,-54.5708,52.475,-1.040,0.298,-157.427,48.286
human_development_index,7.824e+04,3233.935,24.194,0.000,7.19e+04,8.46e+04

0,1,2,3
Omnibus:,18270.407,Durbin-Watson:,0.011
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2063851.992
Skew:,6.514,Prob(JB):,0.0
Kurtosis:,58.475,Cond. No.,5340000000.0


Now examining total deaths as our dependent variable it reflects the same sentiment as we have found before using the same independent variables but with total cases as our dependent variable.