## Dataset collection: World Health Organization
In this note, I collected a selection of data from World Health Organization (WHO). I used pandas library for data manipulation and stored the final dataset into a csv file.

Loading pandas library as well as url handling module.

In [1]:
import pandas as pd
import urllib

Defined a function that download the particular data (with codename 'code') and covert them into pandas dataframe. GHO OData API provides the data of WHO in form of JSON, and we parse it to covert them into the dataframe.

In [2]:
def getDF(code):
    url="https://ghoapi.azureedge.net/api/"+code
    jraw=urllib.request.urlopen(url).read()
    jstr=str(jraw)
    jstr="["+jstr[jstr.find('[')+1:jstr.find(']')]+"]"
    return pd.read_json(jstr)

Collecting the data of interest. Here I chose life expectancy, infant mortality per 1000 live births, adult (between age 15 and 60) mortality per 1000 population, total expenditure on health as a percentage of GDP, and the percentage of population below the poverty line ($1.25 a day). 'code_list' is the list of codes for these data. To make the dataset more comprehensible, I also wrote uncoded column names in 'name_list'. I selected data collected for both sexes and the data collected most recently for each country. After processing each sort of data, I printed out the sample data and exploratory data anaysis contains mean, standar deviation, minimum/maximum, and quartiles.

In [3]:
code_list=["WHOSIS_000001","EQ_INFANTMORT","WHOSIS_000004",
           "WHS7_143","CCO_1"]
name_list=["life_expectancy","infant_mortality","adult_mortality",
           "health_expenditure","poverty"]

df=list(range(len(code_list)))
c=0
for code in code_list:
    da=getDF(code)
    da=da[(da["SpatialDimType"]=="COUNTRY")
                &((da["Dim1"]=="BTSX")|pd.isna(da["Dim1"]))
               ][{"TimeDim","SpatialDim","NumericValue"}]
    da=da[da.groupby("SpatialDim")["TimeDim"]
             .transform(max)==da["TimeDim"]]
    df[c]=da.rename(columns={
        "SpatialDim":"country","TimeDim":"yr"+str(c),
        "NumericValue":"val"+str(c)
    })
    print("===== "+code+" =====")
    print(df[c].head())
    print(df[c].describe())
    c+=1

===== WHOSIS_000001 =====
   country      val0   yr0
12     AFG  63.20990  2019
24     AGO  63.06044  2019
36     ALB  78.00018  2019
48     ARE  76.07952  2019
60     ARG  76.57514  2019
             val0     yr0
count  183.000000   183.0
mean    72.540583  2019.0
std      7.130252     0.0
min     50.748810  2019.0
25%     66.550295  2019.0
50%     73.741260  2019.0
75%     77.730825  2019.0
max     84.261380  2019.0
===== EQ_INFANTMORT =====
   country  val1   yr1
7      ALB  18.0  2008
19     AFG  65.9  2010
31     ARM  13.0  2010
39     AZE  43.0  2006
87     BGD  43.0  2011
             val1          yr1
count   94.000000    94.000000
mean    45.914894  2008.638298
std     24.362067     3.964496
min      7.900000  1990.000000
25%     26.775000  2006.000000
50%     45.000000  2010.000000
75%     60.200000  2011.000000
max    129.600000  2013.000000
===== WHOSIS_000004 =====
    country       val2   yr2
356     AFG  245.22490  2016
399     AGO  237.96940  2016
458     ALB   96.40514

Merging different data into a single dataset with outer join. I also renamed the columns with names in 'name_list'. After merging the dataset, I printed a sample of the dataset.

In [4]:
df_merged=df[0][["country","val0"]]
df_merged=df_merged.rename(columns={"val0":name_list[0]})
for c in range(1,len(code_list)):
    da=df[c][["country","val"+str(c)]]
    df_merged=pd.merge(df_merged,da,how="outer",on="country")
    df_merged=df_merged.rename(columns={"val"+str(c):name_list[c]})
df_merged.head(10)

Unnamed: 0,country,life_expectancy,infant_mortality,adult_mortality,health_expenditure,poverty
0,AFG,63.2099,65.9,245.2249,8.18227,
1,AGO,63.06044,,237.9694,3.30698,43.4
2,ALB,78.00018,18.0,96.40514,5.88311,0.6
3,ARE,76.07952,,73.95345,3.64399,
4,ARG,76.57514,,111.4288,4.78592,0.9
5,ARM,76.02519,13.0,116.4358,4.48015,2.5
6,ATG,76.45393,,119.8657,5.54168,
7,AUS,83.04064,,60.72528,9.4223,
8,AUT,81.64519,,61.88845,11.20547,
9,AZE,71.42894,43.0,117.6489,6.03686,0.4


For later uses, I stored this data set into a csv file.

In [5]:
df_merged.to_csv("dataset_WHO.csv")