# Contents

- [Load libraries](#Load-libraries)
- [Data Cleaning](#Data-Cleaning)
- [Combine dataset](Combine-dataset)
- [Export Normalised Data to SqlLite3](#Export-Normalised-Data-to-SqlLite3)
- [Basic Understanding of metrics ](#Basic-Understanding-of-metrics )

# Load libraries

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [3]:
# load csv

adult_mortality = pd.read_csv('../data/Adult mortality.csv')
maternal_mortality = pd.read_csv('../data/Maternal mortality.csv')
num_death = pd.read_csv('../data/Number of deaths (thousands).csv')
prob_dying = pd.read_csv('../data/Probability of dying per 1000 live births.csv')

# Data Cleaning

## Check contents of dataframes 


In [4]:
adult_mortality.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Adult mortality rate (probability of dying between 15 and 60 years per 1000 population),Adult mortality rate (probability of dying between 15 and 60 years per 1000 population).1,Adult mortality rate (probability of dying between 15 and 60 years per 1000 population).2
0,Country,Year,Both sexes,Male,Female
1,Afghanistan,2016,245,272,216
2,Afghanistan,2015,233,254,210
3,Afghanistan,2014,234,254,213
4,Afghanistan,2013,235,254,215


In [5]:
adult_mortality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3112 entries, 0 to 3111
Data columns (total 5 columns):
 #   Column                                                                                     Non-Null Count  Dtype 
---  ------                                                                                     --------------  ----- 
 0   Unnamed: 0                                                                                 3112 non-null   object
 1   Unnamed: 1                                                                                 3112 non-null   object
 2   Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)    3112 non-null   object
 3   Adult mortality rate (probability of dying between 15 and 60 years per 1000 population).1  3112 non-null   object
 4   Adult mortality rate (probability of dying between 15 and 60 years per 1000 population).2  3112 non-null   object
dtypes: object(5)
memory usage: 121.7+ KB


Columns # 1,2,3,4 contains object likely because of the Row 0 as it contains gender as a `sub-header`

Also dtype for columns 2,3,4 are in Object. after cleaning, should be converted to integers


In [6]:
maternal_mortality

Unnamed: 0,Country,Year,Maternal mortality ratio (per 100 000 live births),Number of maternal deaths
0,Afghanistan,2017,638 [ 427 - 1 010 ],7 700 [ 5 100 - 12 000 ]
1,Afghanistan,2016,673 [ 457 - 1 040 ],8 100 [ 5 500 - 12 000 ]
2,Afghanistan,2015,701 [ 501 - 1 020 ],8 400 [ 6 000 - 12 000 ]
3,Afghanistan,2014,786 [ 592 - 1 080 ],9 300 [ 7 000 - 13 000 ]
4,Afghanistan,2013,810 [ 617 - 1 080 ],9 600 [ 7 300 - 13 000 ]
...,...,...,...,...
3289,Zimbabwe,2004,686 [ 597 - 784 ],2 800 [ 2 400 - 3 100 ]
3290,Zimbabwe,2003,680 [ 590 - 779 ],2 700 [ 2 300 - 3 100 ]
3291,Zimbabwe,2002,666 [ 577 - 766 ],2 600 [ 2 200 - 3 000 ]
3292,Zimbabwe,2001,629 [ 544 - 723 ],2 400 [ 2 100 - 2 800 ]


If needed, we can establish range for maternal mortality ratio. 

For now we can take in the first value outside the square brackets 

There is no sex column to wide long 

In [7]:
num_death.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Number of under-five deaths (thousands),Number of under-five deaths (thousands).1,Number of under-five deaths (thousands).2,Number of infant deaths (thousands),Number of infant deaths (thousands).1,Number of infant deaths (thousands).2,Number of neonatal deaths (thousands)
0,Country,Year,Both sexes,Male,Female,Both sexes,Male,Female,Both sexes
1,Afghanistan,2018,74278,40312,33966,57182,31394,25788,44725
2,Afghanistan,2017,76877,41631,35246,58846,32244,26602,45771
3,Afghanistan,2016,79770,43134,36636,60673,33222,27451,46963
4,Afghanistan,2015,82918,44733,38185,62652,34257,28395,48237


In [8]:
prob_dying.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Infant mortality rate (probability of dying between birth and age 1 per 1000 live births),Infant mortality rate (probability of dying between birth and age 1 per 1000 live births).1,Infant mortality rate (probability of dying between birth and age 1 per 1000 live births).2,Neonatal mortality rate (per 1000 live births),Under-five mortality rate (probability of dying by age 5 per 1000 live births),Under-five mortality rate (probability of dying by age 5 per 1000 live births).1,Under-five mortality rate (probability of dying by age 5 per 1000 live births).2
0,Country,Year,Both sexes,Male,Female,Both sexes,Both sexes,Male,Female
1,Afghanistan,2018,47.9,51.1,44.5,37.1,62.3,65.7,58.7
2,Afghanistan,2017,49.5,52.7,46,38.1,64.7,68.1,61.1
3,Afghanistan,2016,51.2,54.5,47.7,39.3,67.5,70.9,63.7
4,Afghanistan,2015,53.1,56.5,49.6,40.5,70.4,73.8,66.7


## Check for missing values

In [9]:
adult_mortality.isnull().sum()

Unnamed: 0                                                                                   0
Unnamed: 1                                                                                   0
Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)      0
Adult mortality rate (probability of dying between 15 and 60 years per 1000 population).1    0
Adult mortality rate (probability of dying between 15 and 60 years per 1000 population).2    0
dtype: int64

In [10]:
maternal_mortality.isnull().sum()

Country                                               0
Year                                                  0
Maternal mortality ratio (per 100 000 live births)    0
Number of maternal deaths                             0
dtype: int64

In [11]:
num_death.isnull().sum()

Unnamed: 0                                   0
Unnamed: 1                                   0
Number of under-five deaths (thousands)      0
Number of under-five deaths (thousands).1    0
Number of under-five deaths (thousands).2    0
Number of infant deaths (thousands)          0
Number of infant deaths (thousands).1        0
Number of infant deaths (thousands).2        0
Number of neonatal deaths (thousands)        0
dtype: int64

In [12]:
prob_dying.isnull().sum()

Unnamed: 0                                                                                     0
Unnamed: 1                                                                                     0
Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)      0
Infant mortality rate (probability of dying between birth and age 1 per 1000 live births).1    0
Infant mortality rate (probability of dying between birth and age 1 per 1000 live births).2    0
Neonatal mortality rate (per 1000 live births)                                                 0
Under-five mortality rate (probability of dying by age 5 per 1000 live births)                 0
Under-five mortality rate (probability of dying by age 5 per 1000 live births).1               0
Under-five mortality rate (probability of dying by age 5 per 1000 live births).2               0
dtype: int64

**No missing values**

## Compress data (Wide to Long)

We can see that the column of interest is `Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)` split into 3 categories based on gender 

Primary keys identified are `Country`, `Year`, and `Sex`

In [13]:
# define a function that creates compressed table 
# assuming column 2 is the only main column of interest (already filtered by only One topic)

def compress_df(df, col_of_interest):
    # copy out original
    df_compress = df.copy()

    # make first row the header, and reset index
    df_compress.columns = df.iloc[0,:]
    df_compress.drop(0, inplace = True )
    df_compress.reset_index(inplace = True, drop = True)
    
    # use pd.melt to to make wide to long format
    df_compress = pd.melt(df_compress, id_vars=['Country','Year'], value_vars=['Both sexes','Male','Female'], var_name='sex', value_name=col_of_interest)
    
    # convert all to lower caps for columns 
    df_compress.columns = [x.lower() for x in df_compress.columns.tolist()]
    
    return df_compress

In [14]:
# define function that converts year to datetime.year and converts values to int/float

def convert_values(df,val_type):
    df['year'] = pd.to_datetime(df['year'])
    df['year'] = df['year'].dt.year  
    df.iloc[:,3] = df.iloc[:,3].astype(val_type)
    print(df.info())

### Compress Adult Mortality

In [15]:
adult_mort_comp = compress_df(adult_mortality, 'adult_mortality')
print(adult_mort_comp.info())
adult_mort_comp

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9333 entries, 0 to 9332
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          9333 non-null   object
 1   year             9333 non-null   object
 2   sex              9333 non-null   object
 3   adult_mortality  9333 non-null   object
dtypes: object(4)
memory usage: 291.8+ KB
None


Unnamed: 0,country,year,sex,adult_mortality
0,Afghanistan,2016,Both sexes,245
1,Afghanistan,2015,Both sexes,233
2,Afghanistan,2014,Both sexes,234
3,Afghanistan,2013,Both sexes,235
4,Afghanistan,2012,Both sexes,242
...,...,...,...,...
9328,Zimbabwe,2004,Female,670
9329,Zimbabwe,2003,Female,671
9330,Zimbabwe,2002,Female,667
9331,Zimbabwe,2001,Female,656


In [16]:
# convert year to datetime.year
# convert values to int

convert_values(adult_mort_comp,int)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9333 entries, 0 to 9332
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          9333 non-null   object
 1   year             9333 non-null   int64 
 2   sex              9333 non-null   object
 3   adult_mortality  9333 non-null   int32 
dtypes: int32(1), int64(1), object(2)
memory usage: 255.3+ KB
None


### Compress Number of Death under 5 years old

In [17]:
num_underfivedeath_df = compress_df(num_death[['Unnamed: 0', 'Unnamed: 1', 'Number of under-five deaths (thousands)',
       'Number of under-five deaths (thousands).1',
       'Number of under-five deaths (thousands).2']], 'no_underfivedeath')

print(num_underfivedeath_df.info())
num_underfivedeath_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   country            3492 non-null   object
 1   year               3492 non-null   object
 2   sex                3492 non-null   object
 3   no_underfivedeath  3492 non-null   object
dtypes: object(4)
memory usage: 109.2+ KB
None


Unnamed: 0,country,year,sex,no_underfivedeath
0,Afghanistan,2018,Both sexes,74278
1,Afghanistan,2017,Both sexes,76877
2,Afghanistan,2016,Both sexes,79770
3,Afghanistan,2015,Both sexes,82918
4,Afghanistan,2014,Both sexes,86378
...,...,...,...,...
3487,Zimbabwe,2017,Female,10100
3488,Zimbabwe,2016,Female,10459
3489,Zimbabwe,2015,Female,11432
3490,Zimbabwe,2014,Female,12192


In [18]:
# convert year to datetime.year
# convert values to int

convert_values(num_underfivedeath_df,int)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   country            3492 non-null   object
 1   year               3492 non-null   int64 
 2   sex                3492 non-null   object
 3   no_underfivedeath  3492 non-null   int32 
dtypes: int32(1), int64(1), object(2)
memory usage: 95.6+ KB
None


### Compress Number of Infant Death

In [19]:
num_infantDeath_df = compress_df(num_death[['Unnamed: 0', 'Unnamed: 1','Number of infant deaths (thousands)',
       'Number of infant deaths (thousands).1',
       'Number of infant deaths (thousands).2']], 'no_infantdeath')
print(num_infantDeath_df.info())
num_infantDeath_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   country         3492 non-null   object
 1   year            3492 non-null   object
 2   sex             3492 non-null   object
 3   no_infantdeath  3492 non-null   object
dtypes: object(4)
memory usage: 109.2+ KB
None


Unnamed: 0,country,year,sex,no_infantdeath
0,Afghanistan,2018,Both sexes,57182
1,Afghanistan,2017,Both sexes,58846
2,Afghanistan,2016,Both sexes,60673
3,Afghanistan,2015,Both sexes,62652
4,Afghanistan,2014,Both sexes,64808
...,...,...,...,...
3487,Zimbabwe,2017,Female,7005
3488,Zimbabwe,2016,Female,7297
3489,Zimbabwe,2015,Female,7885
3490,Zimbabwe,2014,Female,8344


In [20]:
# convert year to datetime.year
# convert values to int

convert_values(num_infantDeath_df,int)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   country         3492 non-null   object
 1   year            3492 non-null   int64 
 2   sex             3492 non-null   object
 3   no_infantdeath  3492 non-null   int32 
dtypes: int32(1), int64(1), object(2)
memory usage: 95.6+ KB
None


### Compress Number of Neonatal Death

In [21]:
# create df for neonatal death
num_neonatal_df = num_death[['Unnamed: 0', 'Unnamed: 1','Number of neonatal deaths (thousands)']]

# make first row the header, and reset index
num_neonatal_df.columns = num_neonatal_df.iloc[0,:]
num_neonatal_df.drop(0, inplace = True )
num_neonatal_df.reset_index(inplace = True, drop = True)

# use pd.melt to to make wide to long format
num_neonatal_df = pd.melt(num_neonatal_df, id_vars=['Country','Year'], value_vars=['Both sexes'], var_name='sex', value_name='no_neonataldeath')

# convert all to lower caps for columns 
num_neonatal_df.columns = [x.lower() for x in num_neonatal_df.columns.tolist()]

print(num_neonatal_df.info())
num_neonatal_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   country           1164 non-null   object
 1   year              1164 non-null   object
 2   sex               1164 non-null   object
 3   no_neonataldeath  1164 non-null   object
dtypes: object(4)
memory usage: 36.5+ KB
None


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,country,year,sex,no_neonataldeath
0,Afghanistan,2018,Both sexes,44725
1,Afghanistan,2017,Both sexes,45771
2,Afghanistan,2016,Both sexes,46963
3,Afghanistan,2015,Both sexes,48237
4,Afghanistan,2014,Both sexes,49715
...,...,...,...,...
1159,Zimbabwe,2017,Both sexes,9696
1160,Zimbabwe,2016,Both sexes,10235
1161,Zimbabwe,2015,Both sexes,10815
1162,Zimbabwe,2014,Both sexes,11447


In [22]:
# convert year to datetime.year
# convert values to int

convert_values(num_neonatal_df,int)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   country           1164 non-null   object
 1   year              1164 non-null   int64 
 2   sex               1164 non-null   object
 3   no_neonataldeath  1164 non-null   int32 
dtypes: int32(1), int64(1), object(2)
memory usage: 32.0+ KB
None


### Compress Prob of Infant Mortality Rate

Found spaces in `Probability of dying per 1000 live births.csv`

In [23]:
# strip off spaces before and after words in columns 

prob_dying.iloc[0,:] = prob_dying.iloc[0,:].apply(lambda x: ' '.join(x.split()))

In [24]:
prob_infantDeath_df = compress_df(prob_dying[['Unnamed: 0', 'Unnamed: 1',
       'Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)',
       'Infant mortality rate (probability of dying between birth and age 1 per 1000 live births).1',
       'Infant mortality rate (probability of dying between birth and age 1 per 1000 live births).2']], 'prob_infantdeath')

In [25]:
print(prob_infantDeath_df.info())
prob_infantDeath_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   country           3492 non-null   object
 1   year              3492 non-null   object
 2   sex               3492 non-null   object
 3   prob_infantdeath  3492 non-null   object
dtypes: object(4)
memory usage: 109.2+ KB
None


Unnamed: 0,country,year,sex,prob_infantdeath
0,Afghanistan,2018,Both sexes,47.9
1,Afghanistan,2017,Both sexes,49.5
2,Afghanistan,2016,Both sexes,51.2
3,Afghanistan,2015,Both sexes,53.1
4,Afghanistan,2014,Both sexes,55.1
...,...,...,...,...
3487,Zimbabwe,2017,Female,31.2
3488,Zimbabwe,2016,Female,32
3489,Zimbabwe,2015,Female,34
3490,Zimbabwe,2014,Female,35.6


In [26]:
# convert year to datetime.year
# convert values to float

convert_values(prob_infantDeath_df,float)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           3492 non-null   object 
 1   year              3492 non-null   int64  
 2   sex               3492 non-null   object 
 3   prob_infantdeath  3492 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 109.2+ KB
None


### Compress Prob of under 5 years old Mortality Rate

In [27]:
prob_underfivedeath_df = compress_df(prob_dying[['Unnamed: 0', 'Unnamed: 1',
       'Under-five mortality rate (probability of dying by age 5 per 1000 live births)',
       'Under-five mortality rate (probability of dying by age 5 per 1000 live births).1',
       'Under-five mortality rate (probability of dying by age 5 per 1000 live births).2']], 'prob_underfivedeath')

In [28]:
print(prob_underfivedeath_df.info())
prob_underfivedeath_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   country              3492 non-null   object
 1   year                 3492 non-null   object
 2   sex                  3492 non-null   object
 3   prob_underfivedeath  3492 non-null   object
dtypes: object(4)
memory usage: 109.2+ KB
None


Unnamed: 0,country,year,sex,prob_underfivedeath
0,Afghanistan,2018,Both sexes,62.3
1,Afghanistan,2017,Both sexes,64.7
2,Afghanistan,2016,Both sexes,67.5
3,Afghanistan,2015,Both sexes,70.4
4,Afghanistan,2014,Both sexes,73.6
...,...,...,...,...
3487,Zimbabwe,2017,Female,44.5
3488,Zimbabwe,2016,Female,45.5
3489,Zimbabwe,2015,Female,49.2
3490,Zimbabwe,2014,Female,52.3


In [29]:
# convert year to datetime.year
# convert values to float

convert_values(prob_underfivedeath_df,float)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3492 entries, 0 to 3491
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   country              3492 non-null   object 
 1   year                 3492 non-null   int64  
 2   sex                  3492 non-null   object 
 3   prob_underfivedeath  3492 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 109.2+ KB
None


### Compress Prob of Neonatal Mortality Rate

In [30]:
# create df for neonatal death
prob_neonatal_df = prob_dying[['Unnamed: 0', 'Unnamed: 1','Neonatal mortality rate (per 1000 live births)']]

# make first row the header, and reset index
prob_neonatal_df.columns = prob_neonatal_df.iloc[0,:]
prob_neonatal_df.drop(0, inplace = True )
prob_neonatal_df.reset_index(inplace = True, drop = True)

# use pd.melt to to make wide to long format
prob_neonatal_df = pd.melt(prob_neonatal_df, id_vars=['Country','Year'], value_vars=['Both sexes'], var_name='sex', value_name='prob_neonataldeath')

# convert all to lower caps for columns 
prob_neonatal_df.columns = [x.lower() for x in prob_neonatal_df.columns.tolist()]

print(prob_neonatal_df.info())
prob_neonatal_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   country             1164 non-null   object
 1   year                1164 non-null   object
 2   sex                 1164 non-null   object
 3   prob_neonataldeath  1164 non-null   object
dtypes: object(4)
memory usage: 36.5+ KB
None


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,country,year,sex,prob_neonataldeath
0,Afghanistan,2018,Both sexes,37.1
1,Afghanistan,2017,Both sexes,38.1
2,Afghanistan,2016,Both sexes,39.3
3,Afghanistan,2015,Both sexes,40.5
4,Afghanistan,2014,Both sexes,41.9
...,...,...,...,...
1159,Zimbabwe,2017,Both sexes,21.5
1160,Zimbabwe,2016,Both sexes,22.3
1161,Zimbabwe,2015,Both sexes,23.1
1162,Zimbabwe,2014,Both sexes,24.2


In [31]:
# convert year to datetime.year
# convert values to float

convert_values(prob_neonatal_df,float)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             1164 non-null   object 
 1   year                1164 non-null   int64  
 2   sex                 1164 non-null   object 
 3   prob_neonataldeath  1164 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 36.5+ KB
None


## Clean Maternal Mortality

In [32]:
# how many unique years 
maternal_mortality.Year.unique()

array([2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007,
       2006, 2005, 2004, 2003, 2002, 2001, 2000], dtype=int64)

In [33]:
maternal_mortality.sort_values(by = 'Number of maternal deaths')

Unnamed: 0,Country,Year,Maternal mortality ratio (per 100 000 live births),Number of maternal deaths
1348,Iceland,2001,6 [ 4 - 9 ],[ - ]
1865,Malta,2006,8 [ 5 - 13 ],[ - ]
1349,Iceland,2000,6 [ 4 - 9 ],[ - ]
1863,Malta,2008,8 [ 5 - 12 ],[ - ]
1862,Malta,2009,7 [ 5 - 12 ],[ - ]
...,...,...,...,...
2449,Rwanda,2016,260 [ 194 - 357 ],990 [ 740 - 1 400 ]
681,Congo,2002,769 [ 604 - 974 ],990 [ 770 - 1 200 ]
903,Egypt,2014,39 [ 31 - 47 ],990 [ 790 - 1 200 ]
906,Egypt,2011,42 [ 35 - 49 ],990 [ 830 - 1 200 ]


In [34]:
maternal_mortality.groupby('Number of maternal deaths').count()

Unnamed: 0_level_0,Country,Year,Maternal mortality ratio (per 100 000 live births)
Number of maternal deaths,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
[ - ],32,32,32
[ - 1 ],30,30,30
1 000 [ 640 - 1 700 ],1,1,1
1 000 [ 700 - 1 400 ],2,2,2
1 000 [ 710 - 1 400 ],2,2,2
...,...,...,...
990 [ 740 - 1 400 ],1,1,1
990 [ 770 - 1 200 ],1,1,1
990 [ 790 - 1 200 ],1,1,1
990 [ 830 - 1 200 ],1,1,1


In [35]:
maternal_mortality.groupby('Number of maternal deaths').count().reset_index()['Number of maternal deaths'][0:2]

0      [  -  ]
1     [  - 1 ]
Name: Number of maternal deaths, dtype: object

We can look into the 32 countries that has no number of maternal deaths and see how to impute them 

In [36]:
maternal_mortality[maternal_mortality['Number of maternal deaths'] == ' [  -  ]']

Unnamed: 0,Country,Year,Maternal mortality ratio (per 100 000 live births),Number of maternal deaths
1332,Iceland,2017,4 [ 2 - 6 ],[ - ]
1333,Iceland,2016,4 [ 2 - 7 ],[ - ]
1334,Iceland,2015,4 [ 2 - 6 ],[ - ]
1335,Iceland,2014,5 [ 3 - 8 ],[ - ]
1336,Iceland,2013,4 [ 3 - 7 ],[ - ]
1337,Iceland,2012,4 [ 2 - 6 ],[ - ]
1338,Iceland,2011,5 [ 3 - 7 ],[ - ]
1339,Iceland,2010,5 [ 3 - 7 ],[ - ]
1340,Iceland,2009,5 [ 3 - 8 ],[ - ]
1341,Iceland,2008,5 [ 3 - 7 ],[ - ]


In [37]:
maternal_mortality[maternal_mortality['Number of maternal deaths'] == ' [  - 1 ]']

Unnamed: 0,Country,Year,Maternal mortality ratio (per 100 000 live births),Number of maternal deaths
1188,Grenada,2017,25 [ 15 - 39 ],[ - 1 ]
1189,Grenada,2016,25 [ 16 - 38 ],[ - 1 ]
1190,Grenada,2015,25 [ 16 - 38 ],[ - 1 ]
1191,Grenada,2014,26 [ 16 - 39 ],[ - 1 ]
1192,Grenada,2013,27 [ 17 - 40 ],[ - 1 ]
1747,Luxembourg,2016,5 [ 3 - 9 ],[ - 1 ]
1749,Luxembourg,2014,6 [ 4 - 9 ],[ - 1 ]
1750,Luxembourg,2013,6 [ 4 - 10 ],[ - 1 ]
1751,Luxembourg,2012,7 [ 5 - 11 ],[ - 1 ]
1752,Luxembourg,2011,8 [ 5 - 12 ],[ - 1 ]


In [38]:
maternal_mortality[maternal_mortality['Country'] == 'Montenegro']

Unnamed: 0,Country,Year,Maternal mortality ratio (per 100 000 live births),Number of maternal deaths
1962,Montenegro,2017,6 [ 3 - 10 ],[ - 1 ]
1963,Montenegro,2016,6 [ 3 - 10 ],[ - 1 ]
1964,Montenegro,2015,6 [ 3 - 10 ],[ - 1 ]
1965,Montenegro,2014,6 [ 3 - 11 ],[ - 1 ]
1966,Montenegro,2013,6 [ 4 - 11 ],[ - 1 ]
1967,Montenegro,2012,7 [ 4 - 11 ],1 [ - 1 ]
1968,Montenegro,2011,7 [ 4 - 12 ],1 [ - 1 ]
1969,Montenegro,2010,7 [ 4 - 12 ],1 [ - 1 ]
1970,Montenegro,2009,8 [ 4 - 13 ],1 [ - 1 ]
1971,Montenegro,2008,8 [ 4 - 13 ],1 [ - 1 ]


We can impute the number of maternal death as 0 since mortality ratio super low indicating that the risk of dying by after giving birth is low. 

In [39]:
# rename columns to lowercase
maternal_mortality.columns = ['country','year','maternal_ratio','num_maternaldeath']

In [40]:
# change maternal ratio to integers taking only numbers outside of square brackets 
maternal_mortality['maternal_ratio'] = maternal_mortality['maternal_ratio'].apply(lambda x: int(x.split()[0]))

In [41]:
# change maternal death taking only numbers outside of square brackets 
maternal_mortality['num_maternaldeath'] = maternal_mortality['num_maternaldeath'].apply(lambda x:  \
        # if two elements after split are digits, returns first 2 numbers, combine and convert to integer  
        int(''.join(x.split()[:2])) if (x.split()[1].isdigit() and x.split()[0].isdigit())\

        # impute 0 if first element starts with '['
        else 0 if (x.split()[0] == '[')\

        # returns first number if first two elements after split is NOT digits (this applies for numbers below 1000)
        else int(x.split()[0]))

In [42]:
# Check if has anymore [-]
maternal_mortality.groupby('num_maternaldeath').count()

Unnamed: 0_level_0,country,year,maternal_ratio
num_maternaldeath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,62,62,62
1,123,123,123
2,135,135,135
3,155,155,155
4,125,125,125
...,...,...,...
84000,1,1,1
89000,1,1,1
94000,1,1,1
99000,1,1,1


In [43]:
# check if values imputed correctly 
maternal_mortality[maternal_mortality.country == 'Montenegro']

Unnamed: 0,country,year,maternal_ratio,num_maternaldeath
1962,Montenegro,2017,6,0
1963,Montenegro,2016,6,0
1964,Montenegro,2015,6,0
1965,Montenegro,2014,6,0
1966,Montenegro,2013,6,0
1967,Montenegro,2012,7,1
1968,Montenegro,2011,7,1
1969,Montenegro,2010,7,1
1970,Montenegro,2009,8,1
1971,Montenegro,2008,8,1


In [44]:
maternal_mortality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3294 entries, 0 to 3293
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   country            3294 non-null   object
 1   year               3294 non-null   int64 
 2   maternal_ratio     3294 non-null   int64 
 3   num_maternaldeath  3294 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 103.1+ KB


In [45]:
maternal_mortality

Unnamed: 0,country,year,maternal_ratio,num_maternaldeath
0,Afghanistan,2017,638,7700
1,Afghanistan,2016,673,8100
2,Afghanistan,2015,701,8400
3,Afghanistan,2014,786,9300
4,Afghanistan,2013,810,9600
...,...,...,...,...
3289,Zimbabwe,2004,686,2800
3290,Zimbabwe,2003,680,2700
3291,Zimbabwe,2002,666,2600
3292,Zimbabwe,2001,629,2400


## Check duplicates

In [46]:
data_list = [adult_mort_comp, num_infantDeath_df, num_underfivedeath_df, prob_infantDeath_df, prob_underfivedeath_df]

In [47]:
adult_mort_comp.index[-1] + 1

# checking for data_list
print('Checking duplicates from data_list...')
for data in data_list:
    checker = data.index[-1] + 1
    last_row_checker = data.drop_duplicates(subset=['country', 'year','sex']).index[-1] + 1
    if last_row_checker - checker == 0:
        print('No duplicates')
    else:
        print('Have duplicates')

print('\nChecking for Maternal Mortality...')
#checking for maternal mortality
checker = maternal_mortality.index[-1] + 1
last_row_checker = maternal_mortality.drop_duplicates(subset=['country', 'year']).index[-1] + 1
if last_row_checker - checker == 0:
        print('No duplicates')
else:
    print('Have duplicates')

Checking duplicates from data_list...
No duplicates
No duplicates
No duplicates
No duplicates
No duplicates

Checking for Maternal Mortality...
No duplicates


# Combine dataset

This portion onwards will consider composite primary keys and primary keys 

I choose to combine everything together via `outer join` in order to sieve out missingness and include all into database.

Once a master combined file is ready, I will proceed to normalise the tables

In [52]:
# using indicator to check missingness from both ends 
df_temp = pd.merge(adult_mort_comp,num_underfivedeath_df, how='outer', left_on=['country', 'year', 'sex'], right_on=['country', 'year', 'sex'], indicator = True)
df_temp

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,_merge
0,Afghanistan,2016,Both sexes,245.0,79770.0,both
1,Afghanistan,2015,Both sexes,233.0,82918.0,both
2,Afghanistan,2014,Both sexes,234.0,86378.0,both
3,Afghanistan,2013,Both sexes,235.0,90103.0,both
4,Afghanistan,2012,Both sexes,242.0,,left_only
...,...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0,right_only
10625,Zambia,2018,Female,,16012.0,right_only
10626,Zambia,2017,Female,,16232.0,right_only
10627,Zimbabwe,2018,Female,,9281.0,right_only


In [53]:
df_temp[df_temp._merge == 'right_only'].head(5)

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,_merge
9333,Afghanistan,2018,Both sexes,,74278.0,right_only
9334,Afghanistan,2017,Both sexes,,76877.0,right_only
9335,Albania,2018,Both sexes,,302.0,right_only
9336,Albania,2017,Both sexes,,313.0,right_only
9337,Algeria,2018,Both sexes,,23950.0,right_only


From the column: `adult_mortality`, we can see that there are missing values despite having alot of data initially. 

Hence there are some missing composite primary keys `(country, year, sex)`. I will be continue using outer join for all dataframe and insert into database

In [54]:
outer_join_df = pd.merge(adult_mort_comp, num_underfivedeath_df, on = ['country', 'year', 'sex'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath
0,Afghanistan,2016,Both sexes,245.0,79770.0
1,Afghanistan,2015,Both sexes,233.0,82918.0
2,Afghanistan,2014,Both sexes,234.0,86378.0
3,Afghanistan,2013,Both sexes,235.0,90103.0
4,Afghanistan,2012,Both sexes,242.0,
...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0
10625,Zambia,2018,Female,,16012.0
10626,Zambia,2017,Female,,16232.0
10627,Zimbabwe,2018,Female,,9281.0


In [55]:
outer_join_df = pd.merge(outer_join_df, num_infantDeath_df, on = ['country', 'year', 'sex'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0
1,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0
2,Afghanistan,2014,Both sexes,234.0,86378.0,64808.0
3,Afghanistan,2013,Both sexes,235.0,90103.0,67154.0
4,Afghanistan,2012,Both sexes,242.0,,
...,...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0,16244.0
10625,Zambia,2018,Female,,16012.0,11170.0
10626,Zambia,2017,Female,,16232.0,11293.0
10627,Zimbabwe,2018,Female,,9281.0,6574.0


In [56]:
outer_join_df = pd.merge(outer_join_df, num_neonatal_df, on = ['country', 'year', 'sex'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath,no_neonataldeath
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0,46963.0
1,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0,48237.0
2,Afghanistan,2014,Both sexes,234.0,86378.0,64808.0,49715.0
3,Afghanistan,2013,Both sexes,235.0,90103.0,67154.0,51219.0
4,Afghanistan,2012,Both sexes,242.0,,,
...,...,...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0,16244.0,
10625,Zambia,2018,Female,,16012.0,11170.0,
10626,Zambia,2017,Female,,16232.0,11293.0,
10627,Zimbabwe,2018,Female,,9281.0,6574.0,


In [57]:
outer_join_df = pd.merge(outer_join_df, prob_infantDeath_df, on = ['country', 'year', 'sex'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath,no_neonataldeath,prob_infantdeath
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0,46963.0,51.2
1,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0,48237.0,53.1
2,Afghanistan,2014,Both sexes,234.0,86378.0,64808.0,49715.0,55.1
3,Afghanistan,2013,Both sexes,235.0,90103.0,67154.0,51219.0,57.3
4,Afghanistan,2012,Both sexes,242.0,,,,
...,...,...,...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0,16244.0,,39.1
10625,Zambia,2018,Female,,16012.0,11170.0,,36.6
10626,Zambia,2017,Female,,16232.0,11293.0,,37.5
10627,Zimbabwe,2018,Female,,9281.0,6574.0,,29.9


In [58]:
outer_join_df = pd.merge(outer_join_df, prob_neonatal_df, on = ['country', 'year', 'sex'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath,no_neonataldeath,prob_infantdeath,prob_neonataldeath
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0,46963.0,51.2,39.3
1,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0,48237.0,53.1,40.5
2,Afghanistan,2014,Both sexes,234.0,86378.0,64808.0,49715.0,55.1,41.9
3,Afghanistan,2013,Both sexes,235.0,90103.0,67154.0,51219.0,57.3,43.3
4,Afghanistan,2012,Both sexes,242.0,,,,,
...,...,...,...,...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0,16244.0,,39.1,
10625,Zambia,2018,Female,,16012.0,11170.0,,36.6,
10626,Zambia,2017,Female,,16232.0,11293.0,,37.5,
10627,Zimbabwe,2018,Female,,9281.0,6574.0,,29.9,


In [59]:
outer_join_df = pd.merge(outer_join_df, prob_underfivedeath_df, on = ['country', 'year', 'sex'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath,no_neonataldeath,prob_infantdeath,prob_neonataldeath,prob_underfivedeath
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0,46963.0,51.2,39.3,67.5
1,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0,48237.0,53.1,40.5,70.4
2,Afghanistan,2014,Both sexes,234.0,86378.0,64808.0,49715.0,55.1,41.9,73.6
3,Afghanistan,2013,Both sexes,235.0,90103.0,67154.0,51219.0,57.3,43.3,76.9
4,Afghanistan,2012,Both sexes,242.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...
10624,Yemen,2017,Female,,21194.0,16244.0,,39.1,,51.2
10625,Zambia,2018,Female,,16012.0,11170.0,,36.6,,52.9
10626,Zambia,2017,Female,,16232.0,11293.0,,37.5,,54.4
10627,Zimbabwe,2018,Female,,9281.0,6574.0,,29.9,,41.7


In [60]:
outer_join_df = pd.merge(outer_join_df, maternal_mortality, on = ['country', 'year'], how = 'outer')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath,no_neonataldeath,prob_infantdeath,prob_neonataldeath,prob_underfivedeath,maternal_ratio,num_maternaldeath
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0,46963.0,51.2,39.3,67.5,673.0,8100.0
1,Afghanistan,2016,Male,272.0,43134.0,33222.0,,54.5,,70.9,673.0,8100.0
2,Afghanistan,2016,Female,216.0,36636.0,27451.0,,47.7,,63.7,673.0,8100.0
3,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0,48237.0,53.1,40.5,70.4,701.0,8400.0
4,Afghanistan,2015,Male,254.0,44733.0,34257.0,,56.5,,73.8,701.0,8400.0
...,...,...,...,...,...,...,...,...,...,...,...,...
10624,Zimbabwe,2018,Male,,11475.0,8469.0,,37.8,,50.6,,
10625,Zimbabwe,2018,Female,,9281.0,6574.0,,29.9,,41.7,,
10626,Zimbabwe,2017,Both sexes,,22519.0,16015.0,9696.0,35.4,21.5,49.3,458.0,2100.0
10627,Zimbabwe,2017,Male,,12419.0,9010.0,,39.4,,53.8,458.0,2100.0


## Add column for continents 

In [61]:
continents_df = pd.read_csv('../data/country_continents.csv')
continents_df.columns = ['continents', 'country']
continents_df

Unnamed: 0,continents,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina
...,...,...
214,South America,Suriname
215,South America,Uruguay
216,South America,Venezuela
217,South America,Bolivia (Plurinational State of)


In [62]:
outer_join_df = pd.merge(outer_join_df, continents_df, on = ['country'], how = 'left')
outer_join_df

Unnamed: 0,country,year,sex,adult_mortality,no_underfivedeath,no_infantdeath,no_neonataldeath,prob_infantdeath,prob_neonataldeath,prob_underfivedeath,maternal_ratio,num_maternaldeath,continents
0,Afghanistan,2016,Both sexes,245.0,79770.0,60673.0,46963.0,51.2,39.3,67.5,673.0,8100.0,Asia
1,Afghanistan,2016,Male,272.0,43134.0,33222.0,,54.5,,70.9,673.0,8100.0,Asia
2,Afghanistan,2016,Female,216.0,36636.0,27451.0,,47.7,,63.7,673.0,8100.0,Asia
3,Afghanistan,2015,Both sexes,233.0,82918.0,62652.0,48237.0,53.1,40.5,70.4,701.0,8400.0,Asia
4,Afghanistan,2015,Male,254.0,44733.0,34257.0,,56.5,,73.8,701.0,8400.0,Asia
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10624,Zimbabwe,2018,Male,,11475.0,8469.0,,37.8,,50.6,,,Africa
10625,Zimbabwe,2018,Female,,9281.0,6574.0,,29.9,,41.7,,,Africa
10626,Zimbabwe,2017,Both sexes,,22519.0,16015.0,9696.0,35.4,21.5,49.3,458.0,2100.0,Africa
10627,Zimbabwe,2017,Male,,12419.0,9010.0,,39.4,,53.8,458.0,2100.0,Africa


This can be our main database. 

No additional rows when outer joining the rest of the tables together. 

# Export Normalised Data to SqlLite3

We will be splitting by tables and save as new csv

In [63]:
# define function that creates dataframe only for each column from main database and save csv file 
# this function only deals with tables that needs country, continents, sex 

def auto_save_percolumn(df, list_columns, subset_col, name_file, drop_dup = None):
    # create separate dataframe and drop na based on main value/key
    df_new = df[list_columns].dropna(subset = [subset_col])
    # if need to drop duplicates;
    if drop_dup != None:
        df_new.drop_duplicates(subset = drop_dup, inplace = True)
    # sort based on subset_col
    df_new = df_new.sort_values(by = df_new.columns[0]).reset_index(drop = True)
    # create new string 
    string_csv = '../new_csv/' + name_file + '.csv'
    # save to csv
    df_new.to_csv(string_csv, index = False)

    return df_new

## Table #1 : ***`country`***

Columns
1) country - PRIMARY KEY, `TEXT`
2) continents - PRIMARY KEY, `TEXT`

In [64]:
auto_save_percolumn(outer_join_df, list_columns = ['country','continents'], subset_col = 'country', name_file = 'country', drop_dup = 'country')

Unnamed: 0,country,continents
0,Afghanistan,Asia
1,Albania,Europe
2,Algeria,Africa
3,Andorra,Europe
4,Angola,Africa
...,...,...
189,Venezuela (Bolivarian Republic of),South America
190,Viet Nam,Asia
191,Yemen,Asia
192,Zambia,Africa


## Table #2 : ***`year`***

Columns:
1) year - PRIMARY KEY, `INT`

In [65]:
auto_save_percolumn(outer_join_df, list_columns= ['year'], subset_col='year', name_file= 'year', drop_dup = 'year')

Unnamed: 0,year
0,2000
1,2001
2,2002
3,2003
4,2004
5,2005
6,2006
7,2007
8,2008
9,2009


## Table #3 : ***`sex`***

Columns:
1) sex - PRIMARY KEY, `TEXT`

In [66]:
auto_save_percolumn(outer_join_df, list_columns= ['sex'], subset_col='sex', name_file= 'sex', drop_dup = 'sex')

Unnamed: 0,sex
0,Both sexes
1,Female
2,Male


## Table #4 : ***`adult_mortality`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) adult_mortality - `FLOAT`

In [67]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','adult_mortality'], subset_col='adult_mortality', name_file= 'adult_mortality')

Unnamed: 0,country,year,sex,adult_mortality
0,Afghanistan,2016,Both sexes,245.0
1,Afghanistan,2007,Male,294.0
2,Afghanistan,2007,Female,250.0
3,Afghanistan,2006,Both sexes,276.0
4,Afghanistan,2006,Male,296.0
...,...,...,...,...
9328,Zimbabwe,2010,Female,474.0
9329,Zimbabwe,2009,Both sexes,554.0
9330,Zimbabwe,2009,Male,586.0
9331,Zimbabwe,2008,Both sexes,596.0


## Table #5 : ***`no_infantdeath`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) no_infantdeath - `FLOAT`

In [68]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','no_infantdeath'], subset_col='no_infantdeath', name_file= 'no_infantdeath')

Unnamed: 0,country,year,sex,no_infantdeath
0,Afghanistan,2016,Both sexes,60673.0
1,Afghanistan,2018,Female,25788.0
2,Afghanistan,2017,Both sexes,58846.0
3,Afghanistan,2017,Male,32244.0
4,Afghanistan,2017,Female,26602.0
...,...,...,...,...
3487,Zimbabwe,2014,Male,10707.0
3488,Zimbabwe,2014,Female,8344.0
3489,Zimbabwe,2013,Both sexes,20265.0
3490,Zimbabwe,2017,Male,9010.0


## Table #6 : ***`no_neonataldeath`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) no_neonataldeath - `FLOAT`

In [69]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','no_neonataldeath'], subset_col='no_neonataldeath', name_file= 'no_neonataldeath')

Unnamed: 0,country,year,sex,no_neonataldeath
0,Afghanistan,2016,Both sexes,46963.0
1,Afghanistan,2017,Both sexes,45771.0
2,Afghanistan,2013,Both sexes,51219.0
3,Afghanistan,2018,Both sexes,44725.0
4,Afghanistan,2015,Both sexes,48237.0
...,...,...,...,...
1159,Zimbabwe,2013,Both sexes,12063.0
1160,Zimbabwe,2016,Both sexes,10235.0
1161,Zimbabwe,2015,Both sexes,10815.0
1162,Zimbabwe,2018,Both sexes,9241.0


## Table #7 : ***`no_underfivedeath`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) no_underfivedeath - `FLOAT`

In [70]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','no_underfivedeath'], subset_col='no_underfivedeath', name_file= 'no_underfivedeath')

Unnamed: 0,country,year,sex,no_underfivedeath
0,Afghanistan,2016,Both sexes,79770.0
1,Afghanistan,2018,Female,33966.0
2,Afghanistan,2017,Both sexes,76877.0
3,Afghanistan,2017,Male,41631.0
4,Afghanistan,2017,Female,35246.0
...,...,...,...,...
3487,Zimbabwe,2014,Male,14866.0
3488,Zimbabwe,2014,Female,12192.0
3489,Zimbabwe,2013,Both sexes,29200.0
3490,Zimbabwe,2017,Male,12419.0


## Table #8 : ***`prob_infantdeath`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) prob_infantdeath - `FLOAT`

In [71]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','prob_infantdeath'], subset_col='prob_infantdeath', name_file= 'prob_infantdeath')

Unnamed: 0,country,year,sex,prob_infantdeath
0,Afghanistan,2016,Both sexes,51.2
1,Afghanistan,2018,Female,44.5
2,Afghanistan,2017,Both sexes,49.5
3,Afghanistan,2017,Male,52.7
4,Afghanistan,2017,Female,46.0
...,...,...,...,...
3487,Zimbabwe,2014,Male,44.9
3488,Zimbabwe,2014,Female,35.6
3489,Zimbabwe,2013,Both sexes,42.8
3490,Zimbabwe,2017,Male,39.4


## Table #9 : ***`prob_neonataldeath`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) prob_neonataldeath - `FLOAT`

In [72]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','prob_neonataldeath'], subset_col='prob_neonataldeath', name_file= 'prob_neonataldeath')

Unnamed: 0,country,year,sex,prob_neonataldeath
0,Afghanistan,2016,Both sexes,39.3
1,Afghanistan,2017,Both sexes,38.1
2,Afghanistan,2013,Both sexes,43.3
3,Afghanistan,2018,Both sexes,37.1
4,Afghanistan,2015,Both sexes,40.5
...,...,...,...,...
1159,Zimbabwe,2013,Both sexes,25.3
1160,Zimbabwe,2016,Both sexes,22.3
1161,Zimbabwe,2015,Both sexes,23.1
1162,Zimbabwe,2018,Both sexes,20.9


## Table #10 : ***`prob_underfivedeath`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) sex - `composite key`, `TEXT`
4) prob_underfivedeath - `FLOAT`

In [73]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','sex','prob_underfivedeath'], subset_col='prob_underfivedeath', name_file= 'prob_underfivedeath')

Unnamed: 0,country,year,sex,prob_underfivedeath
0,Afghanistan,2016,Both sexes,67.5
1,Afghanistan,2018,Female,58.7
2,Afghanistan,2017,Both sexes,64.7
3,Afghanistan,2017,Male,68.1
4,Afghanistan,2017,Female,61.1
...,...,...,...,...
3487,Zimbabwe,2014,Male,62.5
3488,Zimbabwe,2014,Female,52.3
3489,Zimbabwe,2013,Both sexes,62.3
3490,Zimbabwe,2017,Male,53.8


## Table #11 : ***`maternal_mortality`***

Contents
1) country - `composite key`, `TEXT`
2) continents - `composite key`, `TEXT`
3) maternal_ratio - `FLOAT`
4) num_maternaldeath - `FLOAT`

In [74]:
auto_save_percolumn(outer_join_df, list_columns= ['country','year','maternal_ratio', 'num_maternaldeath'], subset_col='num_maternaldeath', name_file= 'maternal_mortality', drop_dup=['country','year'])

Unnamed: 0,country,year,maternal_ratio,num_maternaldeath
0,Afghanistan,2016,673.0,8100.0
1,Afghanistan,2000,1.0,15000.0
2,Afghanistan,2001,1.0,15000.0
3,Afghanistan,2002,1.0,14000.0
4,Afghanistan,2003,1.0,14000.0
...,...,...,...,...
3289,Zimbabwe,2013,509.0,2400.0
3290,Zimbabwe,2014,494.0,2300.0
3291,Zimbabwe,2015,480.0,2200.0
3292,Zimbabwe,2009,632.0,2900.0


# Basic Understanding of metrics 

**Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)**

- Important indicator for checking developing countries due to ageing and health transitions
- ***Definition:***
Probability that a 15 year old person will die before reaching his/her 60th birthday. The probability of dying between the ages of 15 and 60 years (per 1 000 population) per year among a hypothetical cohort of 100 000 people that would experience the age-specific mortality rate of the reporting year ([WHO](https://www.who.int/data/gho/indicator-metadata-registry/imr-details/64))

- ***Age-specific mortality rate:*** Mortality rate limited to a particular age group. The numerator is the number of deaths in that age group; the denominator is the number of persons in that age group in the population ([CDC](https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section3.html#:~:text=An%20age%2Dspecific%20mortality%20rate,age%20group%20in%20the%20population.))


**Maternal Mortality Ratio**

***Definition:***
The maternal mortality ratio (MMR) is defined as the number of maternal deaths during a given time period per 100,000 live births during the same time period. It depicts the risk of maternal death relative to the number of live births and essentially captures the risk of death in a single pregnancy or a single live birth.

For the purpose of international reporting of maternal mortality, only those maternal deaths occurring before the end of the 42-day reference period should be included in the calculation of the various ratios and rates.
([WHO](https://www.who.int/data/gho/indicator-metadata-registry/imr-details/64))


**Child Mortality**

- This metric is key on ending all forms of malnutrition, as malnutrition is a frequent contributing cause of death for under-5 children.

***Background***: 
Globally, infectious diseases, including pneumonia, diarrhoea and malaria, along with pre-term birth complications, birth asphyxia and trauma and congenital anomalies remain the leading causes of death for children under 5 years

Access to basic lifesaving interventions such as skilled delivery at birth, postnatal care, breastfeeding and adequate nutrition, vaccinations and treatment for common childhood diseases can save many young lives ([WHO](https://www.who.int/data/gho/indicator-metadata-registry/imr-details/64))

**Under-five mortality rate (probability of dying by age 5 per 1000 live births)**

- Essentially it measures the child survival, reflecting social, economic and environmental conditions. 
