# Life Expectancy Analysis & Modeling Using WHO & UN Data

## Initial Data Analysis

To:&nbsp;&nbsp;&nbsp;&nbsp; [Magnimind](https://magnimindacademy.com/)

From: Matt Curcio, matt.curcio.us@gmail.com

Date: 2023-01-29

Re:&nbsp;&nbsp;&nbsp; Initial Data Analysis

---

## Executive Summary

# TO DO
add where the data is from? 
What website?
Any background


Purpose: This notebook investigates *missing values, imputing any missing values by column mean*.

Input: `Life_Expectancy_Data.csv`

Output: `Clean_LE_Data_w_Means_1.csv`

---

In [2]:
# Common Python Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Load data

- Rename columns for clarity

In [3]:
path = '../data/raw/'
filename = 'Life_Expectancy_Data.csv'

column_names = ['Country','Year','Status','LifeExpectancy','AdultMort',
                'InfD','EtOH','PercExpen','HepB','Measles',
                'BMI','lt5yD','Polio','TotalExpen','DTP','HIV',
                'GDP','Population','Thin1_19y','Thin5_9y','Income',
                'Education']

df = pd.read_csv(path+filename, names=column_names, header=0)
df.shape

(2938, 22)

## Check for null values

In [5]:
print('='*38)
print('Shape:',df.shape)
print('='*38)
print('Column              No. missing values')
print('='*38)

df.isnull().sum()

Shape: (2938, 22)
Column              No. missing values


Country             0
Year                0
Status              0
LifeExpectancy     10
AdultMort          10
InfD                0
EtOH              194
PercExpen           0
HepB              553
Measles             0
BMI                34
lt5yD               0
Polio              19
TotalExpen        226
DTP                19
HIV                 0
GDP               448
Population        652
Thin1_19y          34
Thin5_9y           34
Income            167
Education         163
dtype: int64

### NOTE 1: 
- The feature 'LifeExpectancy' has 10 missing values. 
 
- Therefore the 10 rows that have NAN values will be deleted. Because 'LifeExpectancy' is a Dependent variable, I will delete those 10 observations with NO labels rather than impute them.

- Drop 10 rows containing null in `LifeExpectancy` column
 
- The 'LifeExpectancy' feature appear to be **Missing Completely at Random(MCAR)**. The main advantage of **MCAR** is that the analysis is unbiased. Data lost with design fault do not impact other parameters in the model. 

In [7]:
df.dropna(subset=['LifeExpectancy'], inplace=True) # 10 rows deleted

In [14]:
print('='*38)
print('Shape:',df.shape)
print('='*38)

Shape: (2928, 22)


### NOTE 2: 
- The three features with the highest percent of mising vlaues are:

|  | Feature | Number Missing | % Missing |
|--|:--------|---------------:|----------:|
|1 | Population | 644/2928 | 22.0% |
|2 | HepB | 553/2928 |  18.9% |
|3 | GDP | 448/2928 | 15.3% |

### NOTE 3: 
- Drop feature columns ['Population', 'HepB', 'GDP'] where % Missing is greater than 15%.


- More data scraping or gathering needs to be done in at least 5 areas.
   - 1 	Country Population
   - 2 	Hepititus B Vaccination rates
   - 3 	Gross Domestic Product
   - 4 	Total Expenditure of Country Funds: Health Related
   - 5 	Ethanol Comsumption per capita

In [6]:
df.drop(['Population', 'HepB', 'GDP'], axis=1, inplace=True)

## Imputation using column means

In [7]:
df['InfD'].fillna(np.mean(df.InfD), inplace=True)
df['EtOH'].fillna(np.mean(df.EtOH), inplace=True)
df['PercExpen'].fillna(np.mean(df.PercExpen), inplace=True)
df['Measles'].fillna(np.mean(df.Measles), inplace=True)
df['BMI'].fillna(np.mean(df.BMI), inplace=True)
df['Polio'].fillna(np.mean(df.Polio), inplace=True)
df['TotalExpen'].fillna(np.mean(df.TotalExpen), inplace=True)
df['DTP'].fillna(np.mean(df.DTP), inplace=True)
df['Thin1_19y'].fillna(np.mean(df.Thin1_19y), inplace=True)
df['Thin5_9y'].fillna(np.mean(df.Thin5_9y), inplace=True)
df['Income'].fillna(np.mean(df.Income), inplace=True)
df['Education'].fillna(np.mean(df.Education), inplace=True)

## Re-Check null data points

In [8]:
print('\nShape of Cleaned and Imputed dataframe:', df.shape)
      
df.isnull().sum()


Shape of Cleaned and Imputed dataframe: (2928, 19)


Country           0
Year              0
Status            0
LifeExpectancy    0
AdultMort         0
InfD              0
EtOH              0
PercExpen         0
Measles           0
BMI               0
lt5yD             0
Polio             0
TotalExpen        0
DTP               0
HIV               0
Thin1_19y         0
Thin5_9y          0
Income            0
Education         0
dtype: int64

## Save intermediate dataframe

In [9]:
path = '../data/processed/'
fileName = 'Clean_LE_Data_w_Means_1.csv'

df.to_csv(path+fileName, index=False)