**Comparing actual Covid-19 cases (per million) in Pakistan to those predicted by Machine Learning**

In this notebook, I have used data from Our World in Data (https://ourworldindata.org/) on country-wise Covid-19 cases, with multiple variables for each country. 

In [51]:
#Importing from the required libraries
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline

In [52]:
#The dataset has been downloaded from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data)
#The codebook is available at https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv
df=pd.read_csv("test.csv")


In [53]:
df.head()


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
0,AFG,Asia,Afghanistan,2019-12-31,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
1,AFG,Asia,Afghanistan,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
2,AFG,Asia,Afghanistan,2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
3,AFG,Asia,Afghanistan,2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
4,AFG,Asia,Afghanistan,2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83


In [54]:
# For simplification, I have replaced all the NaN with a zero
df1=df.fillna(0)
df1.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
0,AFG,Asia,Afghanistan,2019-12-31,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
1,AFG,Asia,Afghanistan,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2,AFG,Asia,Afghanistan,2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
3,AFG,Asia,Afghanistan,2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
4,AFG,Asia,Afghanistan,2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83


In [55]:
#The irrelevant columns have been deleted
df1.drop(labels='continent', axis=1, inplace=True)
df1.drop(labels='location', axis=1, inplace=True)

In [56]:
df1.head()

Unnamed: 0,iso_code,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
0,AFG,2019-12-31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
1,AFG,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2,AFG,2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
3,AFG,2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
4,AFG,2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83


In [57]:
df1.fillna(0)

Unnamed: 0,iso_code,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
0,AFG,2019-12-31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
1,AFG,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2,AFG,2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
3,AFG,2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
4,AFG,2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33622,0,2020-02-28,705.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.000,0.000,0.0,0.000,0.00,0.0,0.0,0.000,0.0,0.00
33623,0,2020-02-29,705.0,0.0,6.0,2.0,0.0,0.0,0.0,0.0,...,0.000,0.000,0.0,0.000,0.00,0.0,0.0,0.000,0.0,0.00
33624,0,2020-03-01,705.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.000,0.000,0.0,0.000,0.00,0.0,0.0,0.000,0.0,0.00
33625,0,2020-03-02,705.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.000,0.000,0.0,0.000,0.00,0.0,0.0,0.000,0.0,0.00


In [58]:
#The cleaned datframe is then grouped according to the mean values of each variable, for each country. 
df2=df1.groupby('iso_code').mean()

In [59]:
df2.head()

Unnamed: 0_level_0,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,new_tests,total_tests,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
iso_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,183.390625,10.875,0.796875,0.109375,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABW,94.046154,0.915385,2.261538,0.023077,880.862408,8.573746,21.182369,0.216138,0.0,0.0,...,7.452,35973.781,0.0,0.0,11.62,0.0,0.0,0.0,0.0,76.29
AFG,9903.497537,179.660099,255.098522,6.261084,254.40331,4.615167,6.55303,0.160837,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
AGO,185.396947,7.633588,8.770992,0.358779,5.640931,0.23226,0.266947,0.010863,0.0,0.0,...,1.362,5819.495,0.0,276.045,3.94,0.0,0.0,26.664,0.0,61.15
AIA,2.944444,0.02381,0.0,0.0,196.269833,1.587095,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,81.88


In [60]:
df2.drop(index=0, axis=0, inplace=True)


In [61]:
#This shows the cleaned dataframe
df2.head()

Unnamed: 0_level_0,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,new_tests,total_tests,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
iso_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,94.046154,0.915385,2.261538,0.023077,880.862408,8.573746,21.182369,0.216138,0.0,0.0,...,7.452,35973.781,0.0,0.0,11.62,0.0,0.0,0.0,0.0,76.29
AFG,9903.497537,179.660099,255.098522,6.261084,254.40331,4.615167,6.55303,0.160837,0.0,0.0,...,1.337,1803.987,0.0,597.029,9.59,0.0,0.0,37.746,0.5,64.83
AGO,185.396947,7.633588,8.770992,0.358779,5.640931,0.23226,0.266947,0.010863,0.0,0.0,...,1.362,5819.495,0.0,276.045,3.94,0.0,0.0,26.664,0.0,61.15
AIA,2.944444,0.02381,0.0,0.0,196.269833,1.587095,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,81.88
ALB,1435.944444,35.451389,41.736111,1.041667,498.973,12.318875,14.502743,0.36184,0.0,0.0,...,8.643,11803.431,1.1,304.195,10.08,7.1,51.2,0.0,2.89,78.57


In [62]:
#df3 will be the training set used to train the model
df3=df2

In [63]:
df3.reset_index(inplace=True)
df3.head(10)
df3.shape

(211, 30)

In [64]:
#Isolating the dataframe for Pakistan's mean
df5=df3[df3['iso_code']=='PAK']

In [65]:
#The data for Pakistan must be removed from the training set
df3.drop(index=151, axis=0, inplace=True)

The variables that we use to train the model are 'new_cases_per_million','total_deaths_per_million', 'new_deaths_per_million', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed_per_thousand','stringency_index', 'population_density', 'median_age','aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers','male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',and 'life_expectancy'

In [67]:
#Linear Regression is used as the Machine Learning algorithm
#The data from all the countries, except for Pakistan, is used to train the model.

regr = linear_model.LinearRegression()
x_train1 = np.asanyarray(df3[['new_cases_per_million','total_deaths_per_million', 'new_deaths_per_million', 'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed_per_thousand','stringency_index', 'population_density', 'median_age','aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers','male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy']])
y_train1 = np.asanyarray(df3[['total_cases_per_million']])
regr.fit(x_train1, y_train1)
# The coefficients
print ('Coefficients: ', regr.coef_)
regr.intercept_, regr.coef_

Coefficients:  [[ 6.65134966e+01  1.57977684e+01 -1.32467153e+03  1.55410717e+01
  -5.07522743e+02 -3.11620779e+02 -1.82580456e+00 -5.05604427e-03
  -2.36989014e+01  4.23888119e+01 -1.11210448e+01 -3.52787243e-03
  -1.81079874e+00 -2.72888726e-01 -1.67786982e+01  7.75267918e-01
   6.31229019e-01 -9.99952875e-01 -3.68338447e+00  4.61955341e+00]]


(array([373.11437332]),
 array([[ 6.65134966e+01,  1.57977684e+01, -1.32467153e+03,
          1.55410717e+01, -5.07522743e+02, -3.11620779e+02,
         -1.82580456e+00, -5.05604427e-03, -2.36989014e+01,
          4.23888119e+01, -1.11210448e+01, -3.52787243e-03,
         -1.81079874e+00, -2.72888726e-01, -1.67786982e+01,
          7.75267918e-01,  6.31229019e-01, -9.99952875e-01,
         -3.68338447e+00,  4.61955341e+00]]))

In [73]:
#The test data is from Pakistan's mean values for each variable.

x_test1 = np.asanyarray(df5[['new_cases_per_million','total_deaths_per_million', 'new_deaths_per_million', 'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed_per_thousand','stringency_index', 'population_density', 'median_age','aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers','male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy']])
y_test1 = np.asanyarray(df5[['total_cases_per_million']])
y_hat1= regr.predict(df5[['new_cases_per_million','total_deaths_per_million', 'new_deaths_per_million', 'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed_per_thousand','stringency_index', 'population_density', 'median_age','aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers','male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat1 - y_test1) ** 2))



Residual sum of squares: 9452.88
Variance score: nan




In [74]:
#Comparing the actual mean total cases per million to the predicted total cases per million
print("Actual mean total cases per million",y_test1)
print("Predicted total cases per million",y_hat1)

Actual mean total cases per million [[291.25709615]]
Predicted total cases per million [[194.03119047]]


A very rudimentary ML analysis suggests that, based on the various different indicators for each of the 200+ countries, Pakistan's actual total cases per million were almost 50% higher than predicted by the model.  

There are a few limitations to this approach. Firstly, and most obviously, a more advanced model should perhaps be used that incorporates a polynomial function of the variables, particularly in the case where the dependent variable (in this cases total cases per million) seemed to follow a more quadratic trajectory. I have used a very simple linear regression model but more accuracy can be achieved by applying a better form of the model. Secondly, the training dataset had about 200 data points. In a more advanced form of analysis, perhaps the variation of the cases at every date can be used to predict how a country's (in this case, Pakistan) Covid-19 cases should vary day by day. 