## Collection of data from WHO sources on relevant maternal health indicators

This notebook highlights an example of future work that could collate similiar data to that on Bangladesh, but for all countries. Note all sources are downloaded from the WHO website, which is publicly available via the following links:

NMR: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/neonatal-mortality-rate-(per-1000-live-births)

Caesarean Sections: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/births-by-caesarean-section-(-)

Antenatal Care Visits: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/antenatal-care-coverage-at-least-four-visits

Hospital density: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/total-density-per-100-000-population-hospitals

Tetanus Toxoid Vaccination Coverage: https://immunizationdata.who.int/pages/coverage/tt2plus.html?GROUP=Countries&ANTIGEN=TT2PLUS&YEAR=&CODE=

Following collection and cleaning of the code, a short explanation of how machine learning could be used is provided. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Function for reading csv into dataframe

def get_csv(x):
    y = pd.read_csv(x,index_col = 0)
    return(y)

In [3]:
#Load in Neonatal Mortality Rate global data
nmr_df = get_csv(r"C:\Users\wtaylor\Downloads\NMR_global.csv")

#Load in Caesarian Section global data
csec_df = get_csv(r"C:\Users\wtaylor\Downloads\caesarean_section_global.csv")
               
#Load in Antenatal Care (4 visits+) data
anc_df = get_csv(r"C:\Users\wtaylor\Downloads\ANC_4_global.csv")

#Load in Hospital Denisty data
hosp_df = get_csv(r"C:\Users\wtaylor\Downloads\hospital_density_global.csv")

#Load in Tetanus Vaccine global data. Note use xlsl but changed to csv to make it easier for data management 
tvac_df = pd.read_excel(r"C:\Users\wtaylor\Downloads\Protection at birth (PAB) against neonatal tetanus and Tetanus toxoid-containing vaccine (TT2+_Td2+) vaccination coverage.xlsx")

In [4]:
#First need to ensure data only contains NMR values for both sexes
nmr_df = nmr_df.loc[(nmr_df['Indicator'] == 'Neonatal mortality rate (per 1000 live births)') & (nmr_df['Dim1'] == 'Both sexes')]

#Obtain NMR values and their respective country
nmr = nmr_df[['Location', 'Period','FactValueNumeric']]

#We only want data from 2000-2019
nmr = nmr.loc[(nmr['Period'] >= 2000) & (nmr['Period'] <= 2019)]

#Change names of columns to make them more understandable for reader
nmr = nmr.rename({'Period': 'Year', 'FactValueNumeric': 'NMR'}, axis=1)

In [5]:
#Obtain caesarian section values and their respective country
csec = csec_df[['Location', 'Period','FactValueNumeric']]

#Change names of columns to make them more understandable for reader
csec = csec.rename({'Period': 'Year', 'FactValueNumeric': 'Percentage of C-sections'}, axis=1)

#Now get average C-section for each region
csec = csec.groupby('Location').mean().reset_index()

In [6]:
#Obtain caesarian section values and their respective country
anc = anc_df[['Location', 'Period','FactValueNumeric']]

#Change names of columns to make them more understandable for reader
anc = anc.rename({'Period': 'Year', 'FactValueNumeric': 'Percentage of >4 ANC visits'}, axis=1)

##Now get average of ANC for each region
anc = anc.groupby('Location').mean().reset_index()

In [7]:
#First need to ensure Official Coverage is recorded
tvac_df = tvac_df.loc[tvac_df['COVERAGE_CATEGORY'] == 'OFFICIAL']

#Obtain vaccination coverages and their respective country
tvac = tvac_df[['NAME', 'YEAR','COVERAGE']]

#Change names of columns to make them more understandable for reader
tvac = tvac.rename({'NAME': 'Location','YEAR': 'Year', 'COVERAGE': 'Vaccination Coverage'}, axis=1)

#Ensure only data from 2000-2019 is present
tvac = tvac.loc[(tvac['Year'] >= 2000) & (tvac['Year'] <= 2019)]


#Now do sense check of data. Importantly, max vaccination coverage goes above 100 so will need to clean this up. 
for value in tvac['Vaccination Coverage']:
    if value > 100.0:
        tvac['Vaccination Coverage']=tvac['Vaccination Coverage'].replace(value, np.nan)

#Now drop rows which contain empty values
tvac = tvac[tvac['Vaccination Coverage'].notna()]   

In [8]:
#Obtain hospital denisty for each region
hosp_no = hosp_df[['Location','Value']]

#Change names of columns to make them more understandable for reader
hosp_no = hosp_no.rename({'Value': 'Hospital Density/100,000'}, axis=1)

#Get average of hospital density for each regiion
hosp_no = hosp_no.groupby('Location').mean().reset_index()

In [9]:
##Now merge datasets into one

new_df = pd.merge(tvac, nmr, left_on=['Location', 'Year'], right_on = ['Location', 'Year'],how='left')

##Merge on C-section data
new_df2 = pd.merge(new_df,csec[['Percentage of C-sections','Location']],on='Location', how='left')

#Now merge on ANC
new_df3 = pd.merge(new_df2,anc[['Percentage of >4 ANC visits','Location']],on='Location', how='left')

#Now merge on hospital density
new_df4 = pd.merge(new_df3,hosp_no[['Hospital Density/100,000','Location']],on = 'Location', how = 'left')

In [10]:
###Fill empty values with means

new_df4['Percentage of C-sections'] = new_df4['Percentage of C-sections'].fillna(new_df4.groupby('Location')['Percentage of C-sections'].transform('mean'))
new_df4['Percentage of >4 ANC visits'] = new_df4['Percentage of >4 ANC visits'].fillna(new_df4.groupby('Location')['Percentage of >4 ANC visits'].transform('mean'))
new_df4['NMR'] = new_df4['NMR'].fillna(new_df4.groupby('Location')['NMR'].transform('mean'))
new_df4['Hospital Density/100,000'] = new_df4['Hospital Density/100,000'].fillna(new_df4.groupby('Location')['Hospital Density/100,000'].transform('mean'))

In [11]:
##Now drop rows which still have remaining empty values. This may happen because not all databases had the same number of countries and therefore some columns will have missing data still
new_df4.dropna(subset=['NMR','Percentage of C-sections','Percentage of >4 ANC visits','Hospital Density/100,000'], inplace=True)

Now we have a similiar dataset to the one had been webscraped from the Bangladesh Public Health Bulletins. Although this dataset has fewer health indicators, the dataset is a lot larger.

In [12]:
new_df4.describe()

Unnamed: 0,Year,Vaccination Coverage,NMR,Percentage of C-sections,Percentage of >4 ANC visits,"Hospital Density/100,000"
count,1213.0,1213.0,1213.0,1213.0,1213.0,1213.0
mean,2009.725474,66.080495,23.862968,15.007131,61.245257,2.208545
std,5.733463,24.078557,11.935757,13.508374,21.647323,7.024349
min,2000.0,0.0,1.35,1.4,16.9,0.0
25%,2005.0,50.0,13.21,4.5,45.583333,0.41
50%,2010.0,69.0,24.11,9.3,64.016667,0.7
75%,2015.0,86.0,32.42,26.7,77.077778,1.81
max,2019.0,100.0,60.63,51.8,99.8,56.45


Let's inspect the Bangladesh data. Note that values for 'Percentage of C-sections', 'Percentage of >4 ANC visits' and 'Hospital Density/100,000' are constant for every year owing to the fact the WHO datasets has limited data. 

In [13]:
new_df4.loc[new_df4['Location'] == 'Bangladesh']

Unnamed: 0,Location,Year,Vaccination Coverage,NMR,Percentage of C-sections,Percentage of >4 ANC visits,"Hospital Density/100,000"
106,Bangladesh,2019,94.0,19.06,30.7,22.776923,0.17
107,Bangladesh,2018,97.0,19.83,30.7,22.776923,0.17
108,Bangladesh,2017,97.0,20.66,30.7,22.776923,0.17
109,Bangladesh,2016,96.3,21.56,30.7,22.776923,0.17
110,Bangladesh,2015,98.0,22.51,30.7,22.776923,0.17
111,Bangladesh,2014,97.5,23.52,30.7,22.776923,0.17
112,Bangladesh,2013,96.3,24.61,30.7,22.776923,0.17
113,Bangladesh,2012,96.0,25.76,30.7,22.776923,0.17
114,Bangladesh,2011,96.0,26.99,30.7,22.776923,0.17
115,Bangladesh,2010,95.0,28.28,30.7,22.776923,0.17


Following this, classification methods could be implemented using packages from the sklearn module. An example of the packages that could be used is given below

In [14]:
# Packages for machine learning usign sklearn
# To avoid learning, have a training and a testing set. Training modules are three below
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score #Run model several different times, gives idea how model works on average
from sklearn.model_selection import StratifiedKFold 

#Testing modules are all below
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

The dataset would then be split to ensure the variable of interest, the MMR, is not in the validation dataset.The train_test_split function could then be run to get the features for training and testing.

These features can then be run through the different testing modules (as above) to see which algorithm works best for this data.

The model with the highest cross-validation score could then be used to make predictions on the validation dataset. 

Note, this is just one form of classification method and others do exist (such as Neural Networks). 