# Global Life Expectancy Study

**What this notebook shows**
- Interactive visualization with Plotly
- Exploratory visualization with Seaborn
- Modeling with scikit-learn

**Data**
- Local files: data/Life_Expectancy_Data.csv


##### Focus Area 0: Import Libraries and Read Data

In [None]:
# import libraries
import pandas as pd
import seaborn as sns

In [None]:
# read the data file and create a Pandas dataframe
df = pd.read_csv('../data/Life_Expectancy_Data.csv')

##### Focus Area 1: Get Simple Statistics of the Dataset

**Exploration.** Get the size of the dataframe and simple statistics of the dataset like - count, max, min, std, mean

In [None]:
print(df.shape)
print(df.size)
df.describe(include='all')

**Exploration.** Get the first 8 rows of the dataset

In [None]:
df.head(8)

**Exploration.** Get the last 6 rows of the dataset

In [None]:
df.tail(6)

**Exploration.** Get the columns of the dataset and their types

In [None]:
df.columns, df.dtypes

##### Focus Area 2: Work with Missing Data

**Exploration.** Locate rows that have Null values by running df.isnull()


In [None]:
df.isnull()
#df[df.isnull().any(axis=1)]

**Exploration.** Find the total number of missing elements per column

In [None]:
df.isnull().sum()

**Exploration.** Make a heatmap of the missing values using plotly (https://plotly.com/python-api-reference/generated/plotly.express.imshow.html). Make sure to add labels for the x and y axis.

In [None]:
import plotly.graph_objects as go

missing_mask = df.isnull().astype(int)

# graph the heatmap
# fig =
fig = go.Figure(
    data=go.Heatmap(
        z=missing_mask.values,
        x=missing_mask.columns,
        y=missing_mask.index,
        showscale=False
    )
)

# label the axes appropriately
fig.update_layout(
    title="Missing Values Heatmap",
    xaxis_title="Columns",
    yaxis_title="Rows"
)

# hide the color axes (given to you below)
fig.update_coloraxes(showscale=False)

# show the graph
fig.show()

**Exploration.** Pick two columns in the dataset which have correlating missing values (i.e. values which are missing in the same row for different columns). Why do you think correlated missing values occurred for these columns in particular?

Two columns in the dataset with correlated missing values are ' BMI ' and ' thinness 1-19 years' with a correlation of 1.0. I think correlated missing values happened for these columns in particular beccause they were collected together in a single survey like a health survey, so if a certain country doesn't do the survey, then both fields will be missing.

In [None]:
null_corr = df.isnull().astype(int).corr()
pairs = (null_corr.unstack().rename('corr').reset_index()
         .rename(columns={'level_0':'col1','level_1':'col2'}))
pairs = pairs[pairs.col1 < pairs.col2].sort_values('corr', ascending=False)
pairs.head(10)

**Exploration.** Wellness can be defined in practical terms as a holistic integration of physical, mental, and spiritual well-being. Let's say you're trying to make predictions of the wellness of a person in this dataset who has some missing datapoints. Do you think the missing values in this dataset affect the quality of your wellness predictions? Why or why not?

Yes, I think the missing values affect quality of wellness predictions, because they often go missing together across other important health features so if rows are not there, then the data will be biased towards well-reported areas, which lowers accuracy and predictive strength.

**Exploration.** Drop any row that contains a Null value. Reset the index after the drop.

In [None]:
df_clean = df.dropna().reset_index(drop=True)

Dropping any row that has a single missing value is extreme. Let us say that we were going to study the relationship between 'Life expectancy' and 'GDP' and 'Schooling'. Our expectation is that there will be a higher life expectancy with higher GDP and higher schooling.

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

**Exploration.** Drop rows that have missing 'Life expectancy', 'GDP', and 'Schooling'. These data are essential to our study.

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df_study = df.dropna(subset=cols).reset_index(drop=True)

**Exploration.** Check that there are no missing values for the columns 'Life expectancy', 'GDP', and 'Schooling'.

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df_study[cols].isna().sum()  # should all be 0

An alternative to dropping the missing values would be to replace the missing values with the median for that column since the median is not that sensitive to outliers.

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

**Exploration.** Replace the missing values in columns 'Life expectancy', 'GDP', and 'Schooling' with the median values for these columns

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df[cols] = df[cols].fillna(df[cols].median())

**Exploration.** Check that the replacements took place

In [None]:
cols = ['Life expectancy', 'GDP', 'Schooling']
df[cols].isna().sum()      # expect all zeros
df[cols].isna().any()      # expect all False

##### Focus Area 3: One-Hot Encoding
One-Hot Encoding is a process of transforming categorical data into numerical data. Research how this is done in Pandas.

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

The column 'Status' has categorical data. That data classifies countries as either 'Developing' or 'Developed'.

In [None]:
# check the actual categorical values used
df['Status'].unique()

In [None]:
# research the function get_dummies()
Status_Encoded = pd.get_dummies(df['Status'])
print(Status_Encoded)

**Exploration.** If the output of the previous cell shows 'True' and 'False' values, convert them to 1 and 0 respectively. Add the two columns 'Developed' and 'Developing' (which should have values of 1s or 0s) after 'Status.' Check that the columns were added correctly.

In [None]:
enc = pd.get_dummies(df['Status']).astype(int)

i = df.columns.get_loc('Status') + 1
for c in ['Developed', 'Developing']:
    df.insert(i, c, enc[c].values)
    i += 1

df[['Status', 'Developed', 'Developing']].head(), \
df[['Developed','Developing']].isin([0,1]).all().all(), \
(df['Developed'] + df['Developing']).eq(1).all()

**Exploration.** Print the number of 'Developed' and 'Developing' countries

In [None]:
df['Status'].value_counts()

**Exploration.** Print the mean 'Life expectancy' of 'Developed' countries and the the mean 'Life expectancy' of 'Developing' countries

In [None]:
df.groupby('Status')['Life expectancy'].mean()

df.loc[df['Developed'] == 1, 'Life expectancy'].mean(), \
df.loc[df['Developing'] == 1, 'Life expectancy'].mean()

##### Focus Area 4: Normalize Data

Normalization is peformed to have data values in the range from 0 to 1.

x_normal = (x_raw - x_min) / (x_max - x_min)

Use MinMaxScaler in scikit-learn to perform the transformation

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

In [None]:
# show the values
df['Life expectancy'].values

In [None]:
# normalization makes the transformed values range from 0 to 1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Life expectancy'] = scaler.fit_transform(df['Life expectancy'].values.reshape(-1,1))

In [None]:
# show the values
df['Life expectancy']

In [None]:
# redo the statistics
df['Life expectancy'].describe()

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

**Exploration.** Normalize the data for 'Life expectancy', 'GDP', and 'Schooling' instead of over-writing the raw values create three columns: 'life_expectancy_normal', 'gdp_normal', and 'schooling_normal' and fill them with normalized data


In [None]:
from sklearn.preprocessing import MinMaxScaler

cols = ['Life expectancy', 'GDP', 'Schooling']
new_cols = ['life_expectancy_normal', 'gdp_normal', 'schooling_normal']

for c, nc in zip(cols, new_cols):
    s = df[[c]]
    m = s[c].notna()
    df.loc[m, nc] = MinMaxScaler().fit_transform(s.loc[m])

**Exploration.** Check that these three columns were created with normalized data


In [None]:
cols = ['life_expectancy_normal', 'gdp_normal', 'schooling_normal']

all(c in df.columns for c in cols), \
df[cols].describe(), \
((df[cols] >= 0) & (df[cols] <= 1)).all()

##### Focus Area 5: Perform Standardization

Standardization is performed to transform data to have a mean of zero  
and standard deviation of 1.

The standardized value is also called the z-score.

x_z = (x_raw - x_mean) / x_std

Use StandardScaler in scikit-learn to perform the standardization.

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

In [None]:
# get the life expectancy values
df['Life expectancy'].values

In [None]:
# Standardization transforms values to have a mean of 0 and standard
# deviation of 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Life expectancy'] = scaler.fit_transform(df['Life expectancy'].values.reshape(-1,1))

In [None]:
# check the standardized values
df['Life expectancy']

In [None]:
# run simple statistics on that column
df['Life expectancy'].describe()

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

**Exploration.** Standardize the data for 'Life expectancy', 'GDP', and 'Schooling' instead of over-writing the raw values create three columns: 'life_expectancy_z', 'gdp_z', and 'schooling_z' and fill them with standardized data

In [None]:
from sklearn.preprocessing import StandardScaler

cols = ['Life expectancy', 'GDP', 'Schooling']
new_cols = ['life_expectancy_z', 'gdp_z', 'schooling_z']

for c, nc in zip(cols, new_cols):
    m = df[c].notna()
    df.loc[m, nc] = StandardScaler().fit_transform(df.loc[m, [c]])

**Exploration.** Check that these three columns were created with standardized data


In [None]:
cols = ['life_expectancy_z', 'gdp_z', 'schooling_z']

all(c in df.columns for c in cols)

df[cols].mean().round(3), df[cols].std(ddof=0).round(3), df[cols].describe()

##### Focus Area 6: Applying Custom Function to Transform Data

In [None]:
# reread the raw data again
df = pd.read_csv ('../data/Life_Expectancy_Data.csv')

In [None]:
# asuume that percentage expenditure has increased by 5%
# define a function that increases all elements by a fixed value of 5% (for simplicity sake)
def percentage_expenditure_update(balance):
    return balance + 5

In [None]:
# apply that function to the DataFrame
df['percentage expenditure'] = df['percentage expenditure'].apply(percentage_expenditure_update)

**Exploration.** Check that the values in 'percentage expenditure' did change

In [None]:
import numpy as np
orig = pd.read_csv('../data/Life_Expectancy_Data.csv')['percentage expenditure']
changed = df['percentage expenditure']

print(orig.head())
print(changed.head())

np.allclose((changed - orig).dropna().values, 5.0, atol=1e-12)

**Exploration.** Create your own function to perform standardization

In [None]:
# def standardize (mean, std, value):

def standardize(mean, std, value):
    return (value - mean) / std

**Exploration.** Apply the function standardize() to the GDP values.
Store the results in a new column called z_gdp


In [None]:
mu  = df['GDP'].mean()
sig = df['GDP'].std(ddof=0)
df['z_gdp'] = standardize(mu, sig, df['GDP'])

**Exploration.** Compare your standardized values with those obtained from scikit-learn

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

m = df['GDP'].notna()
sk = StandardScaler().fit_transform(df.loc[m, ['GDP']]).ravel()
mine = df.loc[m, 'z_gdp'].values
np.allclose(mine, sk)

np.allclose(mine, sk), np.max(np.abs(mine - sk))