# Global Life Expectancy Study

**What this notebook shows**
- Interactive visualization with Plotly
- Exploratory visualization with Seaborn
- Modeling with scikit-learn

**Data**
- Local files: data/Life_Expectancy_Data.csv


##### Focus Area 0: Import Libraries and Read Data

In [5]:
# import libraries
import pandas as pd
import seaborn as sns

In [6]:
# read the data file and create a Pandas dataframe
df = pd.read_csv('../../data/Life_Expectancy_Data.csv')

##### Focus Area 1: Get Simple Statistics of the Dataset

**Exploration.** Get the size of the dataframe and simple statistics of the dataset like - count, max, min, std, mean

In [7]:
print(df.shape)
print(df.size)
df.describe(include='all')

(2938, 21)
61698


Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2938,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,...,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
unique,,2,,,,,,,,,...,,,,,,,,,,
top,,Developing,,,,,,,,,...,,,,,,,,,,
freq,,2426,,,,,,,,,...,,,,,,,,,,
mean,2007.51872,,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,...,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,...,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,...,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,...,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,...,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,...,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3


**Exploration.** Get the first 8 rows of the dataset

In [8]:
df.head(8)

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
5,2010,Developing,58.8,279.0,74,0.01,79.679367,66.0,1989,16.7,...,66.0,9.2,66.0,0.1,553.32894,2883167.0,18.4,18.4,0.448,9.2
6,2009,Developing,58.6,281.0,77,0.01,56.762217,63.0,2861,16.2,...,63.0,9.42,63.0,0.1,445.893298,284331.0,18.6,18.7,0.434,8.9
7,2008,Developing,58.1,287.0,80,0.03,25.873925,64.0,1599,15.7,...,64.0,8.33,64.0,0.1,373.361116,2729431.0,18.8,18.9,0.433,8.7


**Exploration.** Get the last 6 rows of the dataset

In [9]:
df.tail(6)

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
2932,2005,Developing,44.6,717.0,28,4.14,8.717409,65.0,420,27.5,...,69.0,6.44,68.0,30.3,444.76575,129432.0,9.0,9.0,0.406,9.3
2933,2004,Developing,44.3,723.0,27,4.36,0.0,68.0,31,27.1,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,2003,Developing,44.5,715.0,26,4.06,0.0,7.0,998,26.7,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,2002,Developing,44.8,73.0,25,4.43,0.0,73.0,304,26.3,...,73.0,6.53,71.0,39.8,57.34834,125525.0,1.2,1.3,0.427,10.0
2936,2001,Developing,45.3,686.0,25,1.72,0.0,76.0,529,25.9,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8
2937,2000,Developing,46.0,665.0,24,1.68,0.0,79.0,1483,25.5,...,78.0,7.1,78.0,43.5,547.358878,12222251.0,11.0,11.2,0.434,9.8


**Exploration.** Get the columns of the dataset and their types

In [10]:
df.columns, df.dtypes

(Index(['Year', 'Status', 'Life expectancy', 'Adult Mortality', 'infant deaths',
        'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ',
        'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ',
        ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years',
        ' thinness 5-9 years', 'Income composition of resources', 'Schooling'],
       dtype='object'),
 Year                                 int64
 Status                              object
 Life expectancy                    float64
 Adult Mortality                    float64
 infant deaths                        int64
 Alcohol                            float64
 percentage expenditure             float64
 Hepatitis B                        float64
 Measles                              int64
  BMI                               float64
 under-five deaths                    int64
 Polio                              float64
 Total expenditure                  float64
 Diphtheria         

##### Focus Area 2: Work with Missing Data

**Exploration.** Locate rows that have Null values by running df.isnull()


In [11]:
df.isnull()
#df[df.isnull().any(axis=1)]

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2934,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2935,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2936,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


**Exploration.** Find the total number of missing elements per column

In [12]:
df.isnull().sum()

Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

**Exploration.** Make a heatmap of the missing values using plotly (https://plotly.com/python-api-reference/generated/plotly.express.imshow.html). Make sure to add labels for the x and y axis.

In [13]:
import plotly.graph_objects as go

missing_mask = df.isnull().astype(int)

# graph the heatmap
# fig =
fig = go.Figure(
    data=go.Heatmap(
        z=missing_mask.values,
        x=missing_mask.columns,
        y=missing_mask.index,
        showscale=False
    )
)

# label the axes appropriately
fig.update_layout(
    title="Missing Values Heatmap",
    xaxis_title="Columns",
    yaxis_title="Rows"
)

# hide the color axes (given to you below)
fig.update_coloraxes(showscale=False)

# show the graph
fig.show()

**Exploration.** Pick two columns in the dataset which have correlating missing values (i.e. values which are missing in the same row for different columns). Why do you think correlated missing values occurred for these columns in particular?

Two columns in the dataset with correlated missing values are ' BMI ' and ' thinness 1-19 years' with a correlation of 1.0. I think correlated missing values happened for these columns in particular beccause they were collected together in a single survey like a health survey, so if a certain country doesn't do the survey, then both fields will be missing.

In [14]:
null_corr = df.isnull().astype(int).corr()
pairs = (null_corr.unstack().rename('corr').reset_index()
         .rename(columns={'level_0':'col1','level_1':'col2'}))
pairs = pairs[pairs.col1 < pairs.col2].sort_values('corr', ascending=False)
pairs.head(10)

Unnamed: 0,col1,col2,corr
65,Adult Mortality,Life expectancy,1.0
206,BMI,thinness 1-19 years,1.0
207,BMI,thinness 5-9 years,1.0
375,thinness 1-19 years,thinness 5-9 years,1.0
284,Diphtheria,Polio,1.0
419,Income composition of resources,Schooling,0.987239
117,Alcohol,Total expenditure,0.895367
331,GDP,Population,0.744116
335,GDP,Schooling,0.558969
334,GDP,Income composition of resources,0.554228


**Exploration.** Wellness can be defined in practical terms as a holistic integration of physical, mental, and spiritual well-being. Let's say you're trying to make predictions of the wellness of a person in this dataset who has some missing datapoints. Do you think the missing values in this dataset affect the quality of your wellness predictions? Why or why not?

Yes, I think the missing values affect quality of wellness predictions, because they often go missing together across other important health features so if rows are not there, then the data will be biased towards well-reported areas, which lowers accuracy and predictive strength.

**Exploration.** Drop any row that contains a Null value. Reset the index after the drop.

In [15]:
df_clean = df.dropna().reset_index(drop=True)

Dropping any row that has a single missing value is extreme. Let us say that we were going to study the relationship between 'Life expectancy' and 'GDP' and 'Schooling'. Our expectation is that there will be a higher life expectancy with higher GDP and higher schooling.

In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Life_Expectancy_Data.csv'

**Exploration.** Drop rows that have missing 'Life expectancy', 'GDP', and 'Schooling'. These data are essential to our study.

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df_study = df.dropna(subset=cols).reset_index(drop=True)

**Exploration.** Check that there are no missing values for the columns 'Life expectancy', 'GDP', and 'Schooling'.

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df_study[cols].isna().sum()  # should all be 0

Unnamed: 0,0
Life expectancy,0
GDP,0
Schooling,0


An alternative to dropping the missing values would be to replace the missing values with the median for that column since the median is not that sensitive to outliers.

In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

**Exploration.** Replace the missing values in columns 'Life expectancy', 'GDP', and 'Schooling' with the median values for these columns

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df[cols] = df[cols].fillna(df[cols].median())

**Exploration.** Check that the replacements took place

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df[cols].isna().sum()      # expect all zeros
df[cols].isna().any()      # expect all False

Unnamed: 0,0
Life expectancy,False
GDP,False
Schooling,False


##### Focus Area 3: One-Hot Encoding
One-Hot Encoding is a process of transforming categorical data into numerical data. Research how this is done in Pandas.

In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

The column 'Status' has categorical data. That data classifies countries as either 'Developing' or 'Developed'.

In [None]:
# check the actual categorical values used
df['Status'].unique()

array(['Developing', 'Developed'], dtype=object)

In [None]:
# research the function get_dummies()
Status_Encoded = pd.get_dummies(df['Status'])
print(Status_Encoded)

      Developed  Developing
0         False        True
1         False        True
2         False        True
3         False        True
4         False        True
...         ...         ...
2933      False        True
2934      False        True
2935      False        True
2936      False        True
2937      False        True

[2938 rows x 2 columns]


**Exploration.** If the output of the previous cell shows 'True' and 'False' values, convert them to 1 and 0 respectively. Add the two columns 'Developed' and 'Developing' (which should have values of 1s or 0s) after 'Status.' Check that the columns were added correctly.

In [None]:
enc = pd.get_dummies(df['Status']).astype(int)

i = df.columns.get_loc('Status') + 1
for c in ['Developed', 'Developing']:
    df.insert(i, c, enc[c].values)
    i += 1

df[['Status', 'Developed', 'Developing']].head(), \
df[['Developed','Developing']].isin([0,1]).all().all(), \
(df['Developed'] + df['Developing']).eq(1).all()

(       Status  Developed  Developing
 0  Developing          0           1
 1  Developing          0           1
 2  Developing          0           1
 3  Developing          0           1
 4  Developing          0           1,
 np.True_,
 np.True_)

**Exploration.** Print the number of 'Developed' and 'Developing' countries

In [None]:
df['Status'].value_counts()

Unnamed: 0_level_0,count
Status,Unnamed: 1_level_1
Developing,2426
Developed,512


**Exploration.** Print the mean 'Life expectancy' of 'Developed' countries and the the mean 'Life expectancy' of 'Developing' countries

In [None]:
df.groupby('Status')['Life expectancy'].mean()

df.loc[df['Developed'] == 1, 'Life expectancy'].mean(), \
df.loc[df['Developing'] == 1, 'Life expectancy'].mean()

(np.float64(79.1978515625), np.float64(67.11146523178809))

##### Focus Area 4: Normalize Data

Normalization is peformed to have data values in the range from 0 to 1.

x_normal = (x_raw - x_min) / (x_max - x_min)

Use MinMaxScaler in scikit-learn to perform the transformation

In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

In [None]:
# show the values
df['Life expectancy'].values

array([65. , 59.9, 59.9, ..., 44.8, 45.3, 46. ])

In [None]:
# normalization makes the transformed values range from 0 to 1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Life expectancy'] = scaler.fit_transform(df['Life expectancy'].values.reshape(-1,1))

In [None]:
# show the values
df['Life expectancy']

Unnamed: 0,Life expectancy
0,0.544592
1,0.447818
2,0.447818
3,0.440228
4,0.434535
...,...
2933,0.151803
2934,0.155598
2935,0.161290
2936,0.170778


In [None]:
# redo the statistics
df['Life expectancy'].describe()

Unnamed: 0,Life expectancy
count,2928.0
mean,0.624762
std,0.180719
min,0.0
25%,0.508539
50%,0.679317
75%,0.747628
max,1.0


In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

**Exploration.** Normalize the data for 'Life expectancy', 'GDP', and 'Schooling' instead of over-writing the raw values create three columns: 'life_expectancy_normal', 'gdp_normal', and 'schooling_normal' and fill them with normalized data


In [None]:
from sklearn.preprocessing import MinMaxScaler

cols = ['Life expectancy', 'GDP', 'Schooling']
new_cols = ['life_expectancy_normal', 'gdp_normal', 'schooling_normal']

for c, nc in zip(cols, new_cols):
    s = df[[c]]
    m = s[c].notna()
    df.loc[m, nc] = MinMaxScaler().fit_transform(s.loc[m])

**Exploration.** Check that these three columns were created with normalized data


In [None]:
cols = ['life_expectancy_normal', 'gdp_normal', 'schooling_normal']

all(c in df.columns for c in cols), \
df[cols].describe(), \
((df[cols] >= 0) & (df[cols] <= 1)).all()

(True,
        life_expectancy_normal   gdp_normal  schooling_normal
 count             2928.000000  2490.000000       2775.000000
 mean                 0.624762     0.062779          0.579362
 std                  0.180719     0.119745          0.162267
 min                  0.000000     0.000000          0.000000
 25%                  0.508539     0.003879          0.487923
 50%                  0.679317     0.014813          0.594203
 75%                  0.747628     0.049585          0.690821
 max                  1.000000     1.000000          1.000000,
 life_expectancy_normal    False
 gdp_normal                False
 schooling_normal          False
 dtype: bool)

##### Focus Area 5: Perform Standardization

Standardization is performed to transform data to have a mean of zero  
and standard deviation of 1.

The standardized value is also called the z-score.

x_z = (x_raw - x_mean) / x_std

Use StandardScaler in scikit-learn to perform the standardization.

In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

In [None]:
# get the life expectancy values
df['Life expectancy'].values

array([65. , 59.9, 59.9, ..., 44.8, 45.3, 46. ])

In [None]:
# Standardization transforms values to have a mean of 0 and standard
# deviation of 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Life expectancy'] = scaler.fit_transform(df['Life expectancy'].values.reshape(-1,1))

In [None]:
# check the standardized values
df['Life expectancy']

Unnamed: 0,Life expectancy
0,-0.443691
1,-0.979279
2,-0.979279
3,-1.021286
4,-1.052791
...,...
2933,-2.617549
2934,-2.596545
2935,-2.565040
2936,-2.512532


In [None]:
# run simple statistics on that column
df['Life expectancy'].describe()

Unnamed: 0,Life expectancy
count,2928.0
mean,-6.891876e-16
std,1.000171
min,-3.457687
25%,-0.6432238
50%,0.3019319
75%,0.6799942
max,2.076724


In [None]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

**Exploration.** Standardize the data for 'Life expectancy', 'GDP', and 'Schooling' instead of over-writing the raw values create three columns: 'life_expectancy_z', 'gdp_z', and 'schooling_z' and fill them with standardized data

In [None]:
from sklearn.preprocessing import StandardScaler

cols = ['Life expectancy', 'GDP', 'Schooling']
new_cols = ['life_expectancy_z', 'gdp_z', 'schooling_z']

for c, nc in zip(cols, new_cols):
    m = df[c].notna()
    df.loc[m, nc] = StandardScaler().fit_transform(df.loc[m, [c]])

**Exploration.** Check that these three columns were created with standardized data


In [None]:
cols = ['life_expectancy_z', 'gdp_z', 'schooling_z']

all(c in df.columns for c in cols)

df[cols].mean().round(3), df[cols].std(ddof=0).round(3), df[cols].describe()

(life_expectancy_z   -0.0
 gdp_z               -0.0
 schooling_z         -0.0
 dtype: float64,
 life_expectancy_z    1.0
 gdp_z                1.0
 schooling_z          1.0
 dtype: float64,
        life_expectancy_z         gdp_z   schooling_z
 count       2.928000e+03  2.490000e+03  2.775000e+03
 mean       -6.891876e-16 -1.997510e-17 -2.048411e-17
 std         1.000171e+00  1.000201e+00  1.000180e+00
 min        -3.457687e+00 -5.243792e-01 -3.571075e+00
 25%        -6.432238e-01 -4.919796e-01 -5.636139e-01
 50%         3.019319e-01 -4.006511e-01  9.147661e-02
 75%         6.799942e-01 -1.102067e-01  6.870135e-01
 max         2.076724e+00  7.828360e+00  2.592731e+00)

##### Focus Area 6: Applying Custom Function to Transform Data

In [17]:
# reread the raw data again
df = pd.read_csv ('../../data/Life_Expectancy_Data.csv')

In [None]:
# asuume that percentage expenditure has increased by 5%
# define a function that increases all elements by a fixed value of 5% (for simplicity sake)
def percentage_expenditure_update(balance):
    return balance + 5

In [None]:
# apply that function to the DataFrame
df['percentage expenditure'] = df['percentage expenditure'].apply(percentage_expenditure_update)

**Exploration.** Check that the values in 'percentage expenditure' did change

In [None]:
import numpy as np
orig = pd.read_csv('../../data/Life_Expectancy_Data.csv')['percentage expenditure']
changed = df['percentage expenditure']

print(orig.head())
print(changed.head())

np.allclose((changed - orig).dropna().values, 5.0, atol=1e-12)

0    71.279624
1    73.523582
2    73.219243
3    78.184215
4     7.097109
Name: percentage expenditure, dtype: float64
0    76.279624
1    78.523582
2    78.219243
3    83.184215
4    12.097109
Name: percentage expenditure, dtype: float64


True

**Exploration.** Create your own function to perform standardization

In [None]:
# def standardize (mean, std, value):

def standardize(mean, std, value):
    return (value - mean) / std

**Exploration.** Apply the function standardize() to the GDP values.
Store the results in a new column called z_gdp


In [None]:
mu  = df['GDP'].mean()
sig = df['GDP'].std(ddof=0)
df['z_gdp'] = standardize(mu, sig, df['GDP'])

**Exploration.** Compare your standardized values with those obtained from scikit-learn

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

m = df['GDP'].notna()
sk = StandardScaler().fit_transform(df.loc[m, ['GDP']]).ravel()
mine = df.loc[m, 'z_gdp'].values
np.allclose(mine, sk)

np.allclose(mine, sk), np.max(np.abs(mine - sk))

(True, np.float64(7.105427357601002e-15))