In [42]:
import pandas as pd

Read in the data directly from an archived version of UN IGME's website.

In [58]:
df = pd.read_csv('https://childmortality.org/wp-content/uploads/2021/09/UNIGME-2021.csv')
df.to_csv('data/input/un_igme_youth_mortality_input_data.csv', index = False)

  df = pd.read_csv('https://childmortality.org/wp-content/uploads/2021/09/UNIGME-2021.csv')


A list of the available indicators:

In [59]:
df.Indicator.drop_duplicates()

0                                 Neonatal mortality rate
69                             Mortality rate 1-59 months
100                                 Infant mortality rate
450                             Under-five mortality rate
989                              Mortality rate age 10-14
1020                             Mortality rate age 10-19
1051                             Mortality rate age 15-19
1095                             Mortality rate age 15-24
1139                         Child Mortality rate age 1-4
1319                             Mortality rate age 20-24
1350                              Mortality rate age 5-14
1395                              Mortality rate age 5-24
1426                               Mortality rate age 5-9
1471      Progress towards SDG in neonatal mortality rate
1472    Progress towards SDG in under-five mortality rate
1473                                          Stillbirths
1493                                      Stillbirth rate
1525          

In [45]:
df.head()

Unnamed: 0,Geographic area,Indicator,Sex,Wealth Quintile,Series Name,Series Year,Regional group,TIME_PERIOD,OBS_VALUE,COUNTRY_NOTES,...,Age Group of Women,Time Since First Birth,DEFINITION,INTERVAL,Series Method,LOWER_BOUND,UPPER_BOUND,STATUS,YEAR_TO_ACHIEVE,Model Used
0,Afghanistan,Neonatal mortality rate,Total,Total,Multiple Indicator Cluster Survey 2003 (Direct),2003,,1981-01,36.0,,...,,,,5.0,Survey/Census with Full Birth Histories,,,,,
1,Afghanistan,Neonatal mortality rate,Total,Total,Multiple Indicator Cluster Survey 2003 (Direct),2003,,1986-01,25.0,,...,,,,5.0,Survey/Census with Full Birth Histories,,,,,
2,Afghanistan,Neonatal mortality rate,Total,Total,Multiple Indicator Cluster Survey 2003 (Direct),2003,,1991-01,18.9,,...,,,,5.0,Survey/Census with Full Birth Histories,,,,,
3,Afghanistan,Neonatal mortality rate,Total,Total,Multiple Indicator Cluster Survey 2003 (Direct),2003,,1996-01,19.1,,...,,,,5.0,Survey/Census with Full Birth Histories,,,,,
4,Afghanistan,Neonatal mortality rate,Total,Total,Multiple Indicator Cluster Survey 2003 (Direct),2003,,2001-01,20.7,,...,,,,5.0,Survey/Census with Full Birth Histories,,,,,


We only want data which:
* Is a UN IGME estimate
* For both sexes (Total)
* For all wealth quintiles (Total)
* Is one of the following variables: Neonatal mortality rate, Infant mortality rate, Under-five mortality rate, Mortality rate age 5-9, Mortality rate age 5-14, Mortality rate age 15-19, Mortality rate age 15-24, Mortality rate age 5-24

In [60]:
df_sel = df[(df['Series Name'] == 'UN IGME estimate')  & (df['Sex'] == 'Total') & (df['Wealth Quintile'] == 'Total') &  (df['Indicator'].isin(['Neonatal mortality rate','Infant mortality rate','Under-five mortality rate','Mortality rate age 5-9','Mortality rate age 5-14','Mortality rate age 15-19','Mortality rate age 15-24','Mortality rate age 5-24']))]

Adding the regional group as a suffix to the Geographic area variable where necessary. To distinguish between institutional definitions of regions with the same name, e.g. Sub-Saharan Africa (UNICEF) and Sub-Saharan Africa (UN SDG)

In [47]:

df_sel.replace({'Regional group': {'UNICEF': ' (UNICEF)', 'SDG': ' (UN SDG)', 'World bank': ' (WB)'}}, inplace=True)
df_sel['Geographic area'] = df_sel[['Geographic area', 'Regional group']].fillna('').sum(axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sel.replace({'Regional group': {'UNICEF': ' (UNICEF)', 'SDG': ' (UN SDG)', 'World bank': ' (WB)'}}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sel['Geographic area'] = df_sel[['Geographic area', 'Regional group']].fillna('').sum(axis=1)


Select out only the columns we're interested in and convert the time-period to a year integer. The time-period is currently given as June in the year of the data.


In [48]:
df_fil = df_sel[['Geographic area', 'Indicator', 'TIME_PERIOD', 'OBS_VALUE']]
df_fil['year'] = df_fil['TIME_PERIOD'].str[:4].astype(int)
df_fil.drop(columns = ['TIME_PERIOD'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fil['year'] = df_fil['TIME_PERIOD'].str[:4].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fil.drop(columns = ['TIME_PERIOD'], inplace = True)


Pivot the data so there is a column for each variable

In [49]:
df_piv = df_fil.pivot(index =['Geographic area','year'] , columns= 'Indicator' , values= 'OBS_VALUE').reset_index()

df_piv.head()

Indicator,Geographic area,year,Infant mortality rate,Mortality rate age 15-19,Mortality rate age 15-24,Mortality rate age 5-14,Mortality rate age 5-24,Mortality rate age 5-9,Neonatal mortality rate,Under-five mortality rate
0,Afghanistan,1957,,,,,,,,377.841228
1,Afghanistan,1958,,,,,,,,370.901556
2,Afghanistan,1959,,,,,,,,364.407994
3,Afghanistan,1960,,,,,,,,358.205145
4,Afghanistan,1961,237.460139,,,,,,,352.211995


Adjusting the mortality rates, so that we can combine age groups together.

For example if we want to calculate the mortality rate of under-tens then we need to combine the under-five mortality rate and the 5-9 year old age group. If there are 100 deaths per 1000 under fives, then we need to adjust the denominator of the 5-9 age group to take account of this. 

In [50]:
df_piv['Adjusted rate age 5-9'] = ((1000 - df_piv['Under-five mortality rate'])/1000) * df_piv['Mortality rate age 5-9']
df_piv['Under-ten mortality rate'] = df_piv['Under-five mortality rate'] + df_piv['Adjusted rate age 5-9']

df_piv['Adjusted rate age 5-14'] = ((1000 - df_piv['Under-five mortality rate'])/1000 * df_piv['Mortality rate age 5-14'])
df_piv['Under-fifteen mortality rate'] = df_piv['Under-five mortality rate'] + df_piv['Adjusted rate age 5-14']

df_piv['Adjusted rate age 15-19'] = ((1000 - df_piv['Under-fifteen mortality rate'])/1000 * df_piv['Mortality rate age 15-19'])
df_piv['Under-twenty mortality rate'] = df_piv['Under-fifteen mortality rate'] + df_piv['Adjusted rate age 15-19']

df_piv['Adjusted rate age 5-24'] = ((1000 - df_piv['Under-five mortality rate'])/1000 * df_piv['Mortality rate age 5-24'])
df_piv['Under-twenty-five mortality rate'] = df_piv['Under-five mortality rate'] + df_piv['Adjusted rate age 5-24']

Standardising the country names

In [51]:
df_piv.rename(columns = {'Geographic area':'Country'}, inplace = True)
countries = df_piv['Country'].drop_duplicates()
countries.to_csv('data/input/youth_mortality_countries_to_standardise.csv')

In [52]:
country_stan = pd.read_csv('data/input/youth_mortality_countries_to_standardise_country_standardized.csv')
country_stan = country_stan[['Country', 'Our World In Data Name']]

In [53]:
df_piv_stan = df_piv.merge(country_stan, on = 'Country')
df_piv_stan.drop(columns = ['Country'], inplace = True)
df_piv_stan.rename(columns = {'Our World In Data Name':'Country'}, inplace = True)

Moving the new standardised country column into the first column

In [54]:
cols = df_piv_stan.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_piv_stan = df_piv_stan[cols]

Convert the newly calculated variables to percentages, rather than per 1000. Also round to two decimal places at the same time. 

In [55]:
df_piv_stan[['Neonatal mortality rate', 'Under-five mortality rate', 'Under-ten mortality rate', 'Under-fifteen mortality rate', 'Under-twenty mortality rate', 'Under-twenty-five mortality rate']] = round(df_piv_stan[['Neonatal mortality rate', 'Under-five mortality rate', 'Under-ten mortality rate', 'Under-fifteen mortality rate', 'Under-twenty mortality rate', 'Under-twenty-five mortality rate']]/10, 2)

df_piv_stan = df_piv_stan[['Country', 'year','Neonatal mortality rate','Infant mortality rate', 'Under-five mortality rate', 'Under-ten mortality rate', 'Under-fifteen mortality rate', 'Under-twenty mortality rate', 'Under-twenty-five mortality rate']]

In [56]:
df_piv_stan

Unnamed: 0,Country,year,Neonatal mortality rate,Under-five mortality rate,Under-ten mortality rate,Under-fifteen mortality rate,Under-twenty mortality rate,Under-twenty-five mortality rate
0,Afghanistan,1957,,37.78,,,,
1,Afghanistan,1958,,37.09,,,,
2,Afghanistan,1959,,36.44,,,,
3,Afghanistan,1960,,35.82,,,,
4,Afghanistan,1961,,35.22,,,,
...,...,...,...,...,...,...,...,...
13120,Zimbabwe,2016,2.74,5.87,6.36,7.19,8.16,9.62
13121,Zimbabwe,2017,2.70,5.70,6.16,7.01,7.97,9.40
13122,Zimbabwe,2018,2.66,5.48,5.91,6.76,7.72,9.12
13123,Zimbabwe,2019,2.62,5.42,5.83,6.68,7.61,8.99


In [57]:

df_piv_stan.to_csv('data/output/un_igme_youth_mortality_out.csv', index = False)