topic: does education and healthcare spending affect quality of life?

### dataset links

life expectancy dataset: https://data.worldbank.org/indicator/SP.DYN.LE00.IN

education: https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS

Government health spending as a share of GDP: https://ourworldindata.org/grapher/public-health-expenditure-share-gdp?country=SWE~FRA~DEU~JPN~GBR~ESP~AUS~NZL~CAN~USA~AUT~FIN~SVN~NOR~NLD~BEL~CHE~DNK~ISL~CZE~SVK~PRT~ARG~ITA~POL~CYP~COL~IND~IDN~CHN~MEX~TUR~THA~PER~ZAF~UKR~BRA~LVA~ROU~CRI~HUN~BGR~GRC~KOR~LUX~LTU~EST~HRV~MLT~CHL~IRL~ISR

### code

#### datasets

In [1]:
import pandas as pd

In [2]:
eduDf = pd.read_csv("API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_11528.csv")
lifeDf = pd.read_csv("API_SP.DYN.LE00.IN_DS2_en_csv_v2_11567.csv")
healthDf = pd.read_csv("public-health-expenditure-share-gdp.csv")

In [3]:
eduDf.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,,,,,,...,5.49136,4.45582,4.548764,4.435037,,3.618558,,,,
1,Africa Eastern and Southern,AFE,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,,,,,,...,4.692,4.43051,4.73975,4.51141,4.090565,4.368379,3.697668,3.962293,,
2,Afghanistan,AFG,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,,,,,,...,4.54397,4.34319,,,,,,,,
3,Africa Western and Central,AFW,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,,,,,,...,2.615035,3.29663,3.051252,3.047399,3.398741,3.096926,2.891687,3.21562,,
4,Angola,AGO,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,,,,,,...,2.754937,2.466879,2.183513,2.073064,2.667447,2.297197,2.385359,2.512737,,


In [4]:
lifeDf.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,64.049,64.215,64.602,64.944,65.303,65.615,...,75.54,75.62,75.88,76.019,75.406,73.655,76.226,76.353,,
1,Africa Eastern and Southern,AFE,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,44.169658,44.468838,44.87789,45.160583,45.535695,45.770723,...,62.167981,62.591275,63.330691,63.857261,63.766484,62.979999,64.48702,65.146154,,
2,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.799,33.291,33.757,34.201,34.673,35.124,...,62.646,62.406,62.443,62.941,61.454,60.417,65.617,66.035,,
3,Africa Western and Central,AFW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.779636,38.058956,38.681792,38.936918,39.19458,39.479784,...,56.392452,56.626439,57.036976,57.149847,57.364425,57.362572,57.987813,58.855722,,
4,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.933,36.902,37.168,37.419,37.704,37.968,...,61.619,62.122,62.622,63.051,63.116,62.958,64.246,64.617,,


In [5]:
healthDf.head()

Unnamed: 0,Entity,Code,Year,Public health expenditure as a share of GDP
0,Argentina,ARG,1880,0.0
1,Argentina,ARG,1890,0.0
2,Argentina,ARG,1900,0.0
3,Argentina,ARG,1910,0.0
4,Argentina,ARG,1920,0.0


In [6]:
eduDf_2019 = eduDf[["Country Name", "2019"]].copy()
eduDf_2019 = eduDf_2019.rename(columns={
    "Country Name": "Country",
    "2019": "EducationSpending"
})
eduDf_2019.head()

Unnamed: 0,Country,EducationSpending
0,Aruba,4.435037
1,Africa Eastern and Southern,4.51141
2,Afghanistan,
3,Africa Western and Central,3.047399
4,Angola,2.073064


In [7]:
lifeDf_2019 = lifeDf[["Country Name", "2019"]].copy()
lifeDf_2019 = lifeDf_2019.rename(columns={
    "Country Name": "Country",
    "2019": "LifeExpectancy"
})
lifeDf_2019.head()

Unnamed: 0,Country,LifeExpectancy
0,Aruba,76.019
1,Africa Eastern and Southern,63.857261
2,Afghanistan,62.941
3,Africa Western and Central,57.149847
4,Angola,63.051


In [8]:
healthDf_2019 = healthDf[healthDf["Year"] == 2019][
    ["Entity", "Public health expenditure as a share of GDP"]
].copy()

healthDf_2019 = healthDf_2019.rename(columns={
    "Entity": "Country",
    "Public health expenditure as a share of GDP": "HealthSpending"
})
healthDf_2019.head()

Unnamed: 0,Country,HealthSpending
25,Argentina,6.245
95,Australia,7.349
166,Austria,7.906
205,Belgium,8.076
236,Brazil,3.93


In [9]:
df = eduDf_2019.merge(lifeDf_2019, on="Country") \
               .merge(healthDf_2019, on="Country")
df = df.dropna()

In [10]:
df.head()

Unnamed: 0,Country,EducationSpending,LifeExpectancy,HealthSpending
0,Argentina,4.77165,76.847,6.245
2,Austria,5.23678,81.895122,7.906
3,Belgium,6.32381,81.995122,8.076
4,Bulgaria,4.21429,75.112195,4.313
5,Brazil,5.96347,75.809,3.93


#### regression

In [11]:
X = df[["EducationSpending", "HealthSpending"]]
y = df["LifeExpectancy"]

In [12]:
from sklearn.linear_model import LinearRegression

In [13]:
model = LinearRegression()
model.fit(X, y)

In [14]:
coef_df = pd.DataFrame({
    "Variable": X.columns,
    "Coefficient": model.coef_
})

intercept = model.intercept_

coef_df, intercept

(            Variable  Coefficient
 0  EducationSpending     0.140736
 1     HealthSpending     1.006350,
 np.float64(72.75616047331856))

LifeExpectancyi​ = β0​ + β1​ х EducationSpendingi ​+ β2​ х HealthSpendingi ​+ εi

LifeExpectancy ​= 72.8 ​+ 0.14 х EducationSpending + 1.006 х HealthSpending

intercept represents the theoretical life expectancy when all explanatory variables are equal to zero and is equal to 72-73 years.

In [15]:
r2 = model.score(X, y)
r2

0.37832216777273364

The coefficient of determination R² equals 0.38, which means that approximately 38% of the variation in life expectancy across countries is explained by education and healthcare expenditures included in the model.

#### correlation

Multicollinearity refers to a situation where independent variables in a regression model are highly correlated with each other, which can make coefficient estimates unstable and difficult to interpret.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[["EducationSpending", "HealthSpending"]].dropna()

vif = pd.DataFrame()
vif["Variable"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif

Unnamed: 0,Variable,VIF
0,EducationSpending,8.98379
1,HealthSpending,8.98379


In [18]:
X = df[["EducationSpending", "HealthSpending"]]

X.corr()

Unnamed: 0,EducationSpending,HealthSpending
EducationSpending,1.0,0.452715
HealthSpending,0.452715,1.0
