#### Challenge: Factors that affect life expectancy

data source: https://www.kaggle.com/kumarajarshi/life-expectancy-who/home

##### Challenge:
Your goal in this challenge is to find the factors that affect the life expectancy. Specifically, you need to find out which factors increase the expected life in the countries and which factors decrease it.

1. Detect the problems with the data such as missing values and outliers. Are there any nonsense values that seem to be stemmed from the data collection? For the missing values, discuss which technique would be the most suitable one in filling out these values. Regarding the outliers, discuss their potential effects on your analysis and select an appropriate method to deal with them.

2. Explore the data using univariate and multivariate exploration techniques. You should pay special attention to your target variable. In this regard, your focus should be on finding the relevant variables that may affect life expectancy.

3. In the feature engineering step, you need to select a suite of variables that you think would be ideal in the modeling phase. More concretely, you may discard some variables that are very correlated with the other ones or the variables that you think irrelevant with the life expectancy.

4. Summarize your findings. One of the most important skills of a data scientist is to convey ideas and findings to nontechnical people using understandable language. In this regard, one of the most effective ways to communicate your ideas is to do it using effective visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy.stats as stats
from matplotlib.mlab import PCA as mlabPCA
from scipy.stats import zscore
from scipy.stats.mstats import winsorize
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from scipy.stats import boxcox
from scipy.stats.stats import ttest_ind
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sqlalchemy import create_engine
import warnings


warnings.filterwarnings('ignore')
sns.set_style("whitegrid")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'lifeexpectancy'
table_name = 'lifeexpectancy'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

lifeexpectancy_df = pd.read_sql_query('select * from lifeexpectancy',con=engine)


engine.dispose()

##### question 1
Detect the problems with the data such as missing values and outliers. Are there any nonsense values that seem to be stemmed from the data collection? For the missing values, discuss which technique would be the most suitable one in filling out these values. Regarding the outliers, discuss their potential effects on your analysis and select an appropriate method to deal with them. 

In [None]:
lifeexpectancy_df.head()

In [None]:
lifeexpectancy_df.info()

In [None]:
lifeexpectancy_df.describe()

###### missing values

In [None]:
lifeexpectancy_df.isnull().sum()

In [None]:
lifeexpectancy_df.nunique()

In [None]:
lifeexpectancy_df.columns.unique()

In [None]:
#fix trailing/leading white space
lifeexpectancy_df.columns = lifeexpectancy_df.columns.str.strip()

In [None]:
lifeexpectancy_df.columns.unique()

In [None]:
#fill in missing values: groupby Country to improve accuracy of interpolation
fill_list = ['Life expectancy', 'Adult Mortality', 'Alcohol', 'Hepatitis B', 
             'BMI', 'Polio', 'Total expenditure', 'Diphtheria', 'GDP', 
             'Population', 'thinness  1-19 years', 'thinness 5-9 years', 
             'Income composition of resources', 'Schooling'] 

for country in lifeexpectancy_df['Country'].unique():
    for column in fill_list:
        lifeexpectancy_df.loc[:, column] = lifeexpectancy_df.loc[:, column].interpolate()

In [None]:
lifeexpectancy_df.isnull().sum()

In [None]:
lifeexpectancy_df.describe()

###### outliers
will run zscore & IQR for comparison

In [None]:
#zscore
outliers = []

columns = lifeexpectancy_df.select_dtypes(exclude=['object']).columns.tolist()
print(columns)

for i in columns:
    threshold = 3 
    z_scores = zscore(lifeexpectancy_df[i])

    for threshold in range(3,5):
        print(i)  
        print('The score threshold is: {}'.format(threshold))
        print('Number of outliers is: {}'.format(len((np.where(z_scores > threshold)[0]))))
    

In [None]:
#iqr
columns = lifeexpectancy_df.select_dtypes(exclude=['object']).columns.tolist()
print(columns)

for i in columns:
    threshold = 3 
    q75, q25 = np.percentile(lifeexpectancy_df[i], [75 ,25])
    iqr = q75 - q25

    for threshold in np.arange(3,5,1):
        min_val = q25 - (iqr*threshold)
        max_val = q75 + (iqr*threshold)
        print(i)
        print("The score threshold is: {}".format(threshold))
        print("Number of outliers is: {}".format(
            len((np.where((lifeexpectancy_df[i] > max_val) 
                      | (lifeexpectancy_df[i] < min_val))[0]))
        ))

##### transform outliers

In [None]:
outliers = ['Adult Mortality', 'infant deaths', 'Alcohol', 
            'percentage expenditure', 'Measles', 'under-five deaths', 
            'Polio', 'Total expenditure', 'Diphtheria', 'HIV/AIDS', 
            'GDP', 'Population', 'thinness  1-19 years', 
            'thinness 5-9 years']

In [None]:
for outlier in outliers:
    lifeexpectancy_df['winsorized_'+ outlier] = winsorize(lifeexpectancy_df[outlier], (0, 0.05))

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(4,4,1)
plt.boxplot(lifeexpectancy_df['winsorized_Adult Mortality'], whis=3)
plt.title("Box plot of winsorized_adult_mortality (whis=3)")

plt.subplot(4,4,2)
plt.boxplot(lifeexpectancy_df['winsorized_infant deaths'], whis=3)
plt.title("Box plot of winsorized_infant deaths (whis=3)")

plt.subplot(4,4,3)
plt.boxplot(lifeexpectancy_df['winsorized_Alcohol'], whis=3)
plt.title("Box plot of winsorized_Alcohol (whis=3)")

plt.subplot(4,4,4)
plt.boxplot(lifeexpectancy_df['winsorized_percentage expenditure'], whis=3)
plt.title("Box plot of winsorized_percentage expenditure (whis=3)")

plt.subplot(4,4,5)
plt.boxplot(lifeexpectancy_df['winsorized_Measles'], whis=3)
plt.title("Box plot of winsorized_Measles (whis=3)")

plt.subplot(4,4,6)
plt.boxplot(lifeexpectancy_df['winsorized_under-five deaths'],whis=3 )
plt.title("Box plot of winsorized_under-five deaths (whis=3)")

plt.subplot(4,4,7)
plt.boxplot(lifeexpectancy_df['winsorized_Polio'], whis=3)
plt.title("Box plot of winsorized_Polio (whis=3)")

plt.subplot(4,4,8)
plt.boxplot(lifeexpectancy_df['winsorized_Total expenditure'], whis=3)
plt.title("Box plot of winsorized_Total expenditure (whis=3)")

plt.subplot(4,4,9)
plt.boxplot(lifeexpectancy_df['winsorized_Diphtheria'], whis=3)
plt.title("Box plot of winsorized_Diphtheria (whis=3)")

plt.subplot(4,4,10)
plt.boxplot(lifeexpectancy_df['winsorized_HIV/AIDS'], whis=3)
plt.title("Box plot of winsorized_HIV/AIDS (whis=3)")

plt.subplot(4,4,11)
plt.boxplot(lifeexpectancy_df['winsorized_GDP'], whis=3)
plt.title("Box plot of winsorized_GDP (whis=3)")

plt.subplot(4,4,12)
plt.boxplot(lifeexpectancy_df['winsorized_Population'], whis=3)
plt.title("Box plot of winsorized_Population (whis=3)")

plt.subplot(4,4,13)
plt.boxplot(lifeexpectancy_df['winsorized_thinness  1-19 years'], whis=3)
plt.title("Box plot of winsorized_thinness  1-19 years (whis=3)")

plt.subplot(4,4,14)
plt.boxplot(lifeexpectancy_df['winsorized_thinness 5-9 years'],whis=3 )
plt.title("Box plot of winsorized_thinness 5-9 years (whis=3)")


plt.tight_layout()
plt.show()

In [None]:
#adjustments for variables with remaining outliers

outliers = ['infant deaths', 'percentage expenditure', 
            'Measles', 'under-five deaths', 'Polio', 
            'Diphtheria', 'HIV/AIDS', 
            'GDP', 'Population']

for outlier in outliers:
    lifeexpectancy_df['winsorized_'+ outlier] = winsorize(lifeexpectancy_df[outlier], (0.10, 0.10))
    

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(4,4,1)
plt.boxplot(lifeexpectancy_df['winsorized_Adult Mortality'], whis=3)
plt.title("Box plot of winsorized_adult_mortality (whis=3)")

plt.subplot(4,4,2)
plt.boxplot(lifeexpectancy_df['winsorized_infant deaths'], whis=3)
plt.title("Box plot of winsorized_infant deaths (whis=3)")

plt.subplot(4,4,3)
plt.boxplot(lifeexpectancy_df['winsorized_Alcohol'], whis=3)
plt.title("Box plot of winsorized_Alcohol (whis=3)")

plt.subplot(4,4,4)
plt.boxplot(lifeexpectancy_df['winsorized_percentage expenditure'], whis=3)
plt.title("Box plot of winsorized_percentage expenditure (whis=3)")

plt.subplot(4,4,5)
plt.boxplot(lifeexpectancy_df['winsorized_Measles'], whis=3)
plt.title("Box plot of winsorized_Measles (whis=3)")

plt.subplot(4,4,6)
plt.boxplot(lifeexpectancy_df['winsorized_under-five deaths'],whis=3 )
plt.title("Box plot of winsorized_under-five deaths (whis=3)")

plt.subplot(4,4,7)
plt.boxplot(lifeexpectancy_df['winsorized_Polio'], whis=3)
plt.title("Box plot of winsorized_Polio (whis=3)")

plt.subplot(4,4,8)
plt.boxplot(lifeexpectancy_df['winsorized_Total expenditure'], whis=3)
plt.title("Box plot of winsorized_Total expenditure (whis=3)")

plt.subplot(4,4,9)
plt.boxplot(lifeexpectancy_df['winsorized_Diphtheria'], whis=3)
plt.title("Box plot of winsorized_Diphtheria (whis=3)")

plt.subplot(4,4,10)
plt.boxplot(lifeexpectancy_df['winsorized_HIV/AIDS'], whis=3)
plt.title("Box plot of winsorized_HIV/AIDS (whis=3)")

plt.subplot(4,4,11)
plt.boxplot(lifeexpectancy_df['winsorized_GDP'], whis=3)
plt.title("Box plot of winsorized_GDP (whis=3)")

plt.subplot(4,4,12)
plt.boxplot(lifeexpectancy_df['winsorized_Population'], whis=3)
plt.title("Box plot of winsorized_Population (whis=3)")

plt.subplot(4,4,13)
plt.boxplot(lifeexpectancy_df['winsorized_thinness  1-19 years'], whis=3)
plt.title("Box plot of winsorized_thinness  1-19 years (whis=3)")

plt.subplot(4,4,14)
plt.boxplot(lifeexpectancy_df['winsorized_thinness 5-9 years'],whis=3 )
plt.title("Box plot of winsorized_thinness 5-9 years (whis=3)")


plt.tight_layout()
plt.show()

recheck zscore & IQR for outliers

In [None]:
outliers = []

columns = lifeexpectancy_df.select_dtypes(exclude=['object']).columns.tolist()
print(columns)

for i in columns:
    threshold = 3 
    z_scores = zscore(lifeexpectancy_df[i])

    for threshold in range(3,5):
        print(i)  
        print('The score threshold is: {}'.format(threshold))
        print('Number of outliers is: {}'.format(len((np.where(z_scores > threshold)[0]))))

In [None]:
columns = lifeexpectancy_df.select_dtypes(exclude=['object']).columns.tolist()
print(columns)

for i in columns:
    threshold = 3 
    q75, q25 = np.percentile(lifeexpectancy_df[i], [75 ,25])
    iqr = q75 - q25

    for threshold in np.arange(3,5,1):
        min_val = q25 - (iqr*threshold)
        max_val = q75 + (iqr*threshold)
        print(i)
        print("The score threshold is: {}".format(threshold))
        print("Number of outliers is: {}".format(
            len((np.where((lifeexpectancy_df[i] > max_val) 
                      | (lifeexpectancy_df[i] < min_val))[0]))
        ))

After winsorization, 3 variables remain with outliers. Will log_transform these variables. 

In [None]:

lifeexpectancy_df['log_percentage expenditure'] = np.log(lifeexpectancy_df['percentage expenditure'])
lifeexpectancy_df['log_Measles'] = np.log(lifeexpectancy_df['Measles'])

lifeexpectancy_df['log_HIV/AIDS'] = np.log(lifeexpectancy_df['HIV/AIDS'])


In [None]:
columns = lifeexpectancy_df.select_dtypes(exclude=['object']).columns.tolist()
print(columns)

for i in columns:
    threshold = 3 
    q75, q25 = np.percentile(lifeexpectancy_df[i], [75 ,25])
    iqr = q75 - q25

    for threshold in np.arange(3,5,1):
        min_val = q25 - (iqr*threshold)
        max_val = q75 + (iqr*threshold)
        print(i)
        print("The score threshold is: {}".format(threshold))
        print("Number of outliers is: {}".format(
            len((np.where((lifeexpectancy_df[i] > max_val) 
                      | (lifeexpectancy_df[i] < min_val))[0]))
        ))

###### still have 1222 outliers for percentage_expenditure after log transfrom but only 304 after winsor transform; will use winsor transform for this variable.
question: is there a way or should another transformation be used to eliminate these outliers beyond winsor?

##### question 2 
Explore the data using univariate and multivariate exploration techniques. You should pay special attention to your target variable. In this regard, your focus should be on finding the relevant variables that may affect life expectancy.

In [None]:
lifeexpectancy_df.describe()

The log versions of percentage_expenditure and Measles appear to be unusable for our purposes. Will use winsorized values for all variable except for possibly HIV/AIDS?

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(5,4,1)
plt.hist(lifeexpectancy_df['Life expectancy'])
plt.title("Histogram of Life expectancy")

plt.subplot(5,4,2)
plt.hist(lifeexpectancy_df['winsorized_Adult Mortality'])
plt.title("Histogram of winsorized_Adult Mortality")

plt.subplot(5,4,3)
plt.hist(lifeexpectancy_df['winsorized_infant deaths'])
plt.title("Histogram of winsorized_infant deaths")

plt.subplot(5,4,4)
plt.hist(lifeexpectancy_df['winsorized_Alcohol'])
plt.title("Histogram of winsorized_Alcohol (whis=3)")

plt.subplot(5,4,5)
plt.hist(lifeexpectancy_df['winsorized_percentage expenditure'])
plt.title("Histogram of winsorized_percentage expenditure")

plt.subplot(5,4,6)
plt.hist(lifeexpectancy_df['Hepatitis B'])
plt.title("Histogram of Hepatitis B")

plt.subplot(5,4,7)
plt.hist(lifeexpectancy_df['winsorized_Measles'])
plt.title("Histogram of winsorized_Measles")

plt.subplot(5,4,8)
plt.hist(lifeexpectancy_df['BMI'])
plt.title("Histogram of BMI")

plt.subplot(5,4,9)
plt.hist(lifeexpectancy_df['winsorized_under-five deaths'])
plt.title("Histogram of winsorized_under-five deaths")

plt.subplot(5,4,10)
plt.hist(lifeexpectancy_df['winsorized_Polio'])
plt.title("Histogram of winsorized_Polio")

plt.subplot(5,4,11)
plt.hist(lifeexpectancy_df['winsorized_Total expenditure'])
plt.title("Histogram of winsorized_Total expenditure")

plt.subplot(5,4,12)
plt.hist(lifeexpectancy_df['winsorized_Diphtheria'])
plt.title("Histogram of winsorized_Diphtheria")

plt.subplot(5,4,13)
plt.hist(lifeexpectancy_df['winsorized_HIV/AIDS'])
plt.title("Histogram of winsorized_HIV/AIDS")

plt.subplot(5,4,14)
plt.hist(lifeexpectancy_df['winsorized_GDP'])
plt.title("Histogram of winsorized_GDP")

plt.subplot(5,4,15)
plt.hist(lifeexpectancy_df['winsorized_Population'])
plt.title("Histogram of winsorized_Population")

plt.subplot(5,4,16)
plt.hist(lifeexpectancy_df['winsorized_thinness  1-19 years'])
plt.title("Histogram of winsorized_thinness  1-19 years")

plt.subplot(5,4,17)
plt.hist(lifeexpectancy_df['winsorized_thinness 5-9 years'])
plt.title("Histogram of winsorized_thinness 5-9 years")

plt.subplot(5,4,18)
plt.hist(lifeexpectancy_df['Income composition of resources'])
plt.title("Histogram of Income composition of resources")

plt.subplot(5,4,19)
plt.hist(lifeexpectancy_df['Schooling'])
plt.title("Histogram of Schooling")


plt.tight_layout()
plt.show()

NONE of the variables have a normal distribution. Will run log transform on all variables to compare

In [None]:
outliers = ['Life expectancy', 'Adult Mortality', 'infant deaths', 'Alcohol', 
            'percentage expenditure', 'Hepatitis B', 'Measles', 'BMI',
            'under-five deaths', 'Polio', 'Total expenditure', 'Diphtheria', 
            'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years', 
            'thinness 5-9 years', 'Income composition of resources', 'Schooling']

for outlier in outliers:
    lifeexpectancy_df['log_'+ outlier] = np.log(lifeexpectancy_df[outlier])

In [None]:
lifeexpectancy_df.describe()

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(5,5,1)
plt.hist(lifeexpectancy_df['log_Life expectancy'])
plt.title("Histogram of log_Life expectancy")

plt.subplot(5,5,2)
plt.hist(lifeexpectancy_df['log_Adult Mortality'])
plt.title("Histogram of log_adult_mortality")

plt.subplot(5,5,3)
plt.hist(lifeexpectancy_df['winsorized_infant deaths'])
plt.title("Histogram of winsorized_infant deaths")

plt.subplot(5,5,4)
plt.hist(lifeexpectancy_df['log_Alcohol'])
plt.title("Histogram of log_Alcohol (whis=3)")

plt.subplot(5,5,5)
plt.hist(lifeexpectancy_df['winsorized_percentage expenditure'])
plt.title("Histogram of winsorized_percentage expenditure")

plt.subplot(5,5,6)
plt.hist(lifeexpectancy_df['Life expectancy'])
plt.title("Histogram of Life expectancy")

plt.subplot(5,5,7)
plt.hist(lifeexpectancy_df['winsorized_Adult Mortality'])
plt.title("Histogram of winsorized_adult_mortality")

plt.subplot(5,5,8)
plt.hist(lifeexpectancy_df['winsorized_infant deaths'])
plt.title("Histogram of winsorized_infant deaths")

plt.subplot(5,5,9)
plt.hist(lifeexpectancy_df['winsorized_Alcohol'])
plt.title("Histogram of winsorized_Alcohol")

plt.subplot(5,5,10)
plt.hist(lifeexpectancy_df['winsorized_percentage expenditure'])
plt.title("Histogram of winsorized_percentage expenditure")

plt.subplot(5,5,11)
plt.hist(lifeexpectancy_df['log_Hepatitis B'])
plt.title("Histogram of log_Hepatitis B")

plt.subplot(5,5,12)
plt.hist(lifeexpectancy_df['winsorized_Measles'])
plt.title("Histogram of winsorized_Measles")


plt.subplot(5,5,13)
plt.hist(lifeexpectancy_df['log_BMI'])
plt.title("Histogram of log_BMI")

plt.subplot(5,5,14)
plt.hist(lifeexpectancy_df['winsorized_under-five deaths'])
plt.title("Histogram of wind=soized_under-five deaths")

plt.subplot(5,5,15)
plt.hist(lifeexpectancy_df['log_Polio'])
plt.title("Histogram of log_Polio")

plt.subplot(5,5,16)
plt.hist(lifeexpectancy_df['Hepatitis B'])
plt.title("Histogram of Hepatitis B")

plt.subplot(5,5,17)
plt.hist(lifeexpectancy_df['winsorized_Measles'])
plt.title("Histogram of winsorized_Measles")

plt.subplot(5,5,18)
plt.hist(lifeexpectancy_df['BMI'])
plt.title("Histogram of BMI")

plt.subplot(5,5,19)
plt.hist(lifeexpectancy_df['winsorized_under-five deaths'])
plt.title("Histogram of winsorized_under-five deaths")

plt.subplot(5,5,20)
plt.hist(lifeexpectancy_df['winsorized_Polio'])
plt.title("Histogram of winsorized_Polio")

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(5,5,1)
plt.hist(lifeexpectancy_df['log_Total expenditure'])
plt.title("Histogram of log_Total expenditure")

plt.subplot(5,5,2)
plt.hist(lifeexpectancy_df['log_Diphtheria'])
plt.title("Histogram of log_Diphtheria")

plt.subplot(5,5,3)
plt.hist(lifeexpectancy_df['log_HIV/AIDS'])
plt.title("Histogram of log_HIV/AIDS")

plt.subplot(5,5,4)
plt.hist(lifeexpectancy_df['log_GDP'])
plt.title("Histogram of log_GDP")

plt.subplot(5,5,5)
plt.hist(lifeexpectancy_df['log_Population'])
plt.title("Histogram of log_Population")

plt.subplot(5,5,6)
plt.hist(lifeexpectancy_df['winsorized_Total expenditure'])
plt.title("Histogram of winsorized_Total expenditure")

plt.subplot(5,5,7)
plt.hist(lifeexpectancy_df['winsorized_Diphtheria'])
plt.title("Histogram of winsorized_Diphtheria")

plt.subplot(5,5,8)
plt.hist(lifeexpectancy_df['winsorized_HIV/AIDS'])
plt.title("Histogram of winsorized_HIV/AIDS")

plt.subplot(5,5,9)
plt.hist(lifeexpectancy_df['winsorized_GDP'])
plt.title("Histogram of winsorized_GDP")

plt.subplot(5,5,10)
plt.hist(lifeexpectancy_df['winsorized_Population'])
plt.title("Histogram of winsorized_Population")

plt.subplot(5,5,11)
plt.hist(lifeexpectancy_df['log_thinness  1-19 years'])
plt.title("Histogram of log_thinness  1-19 years")

plt.subplot(5,5,12)
plt.hist(lifeexpectancy_df['log_thinness 5-9 years'])
plt.title("Histogram of log_thinness 5-9 years")

plt.subplot(5,5,13)
plt.hist(lifeexpectancy_df['Income composition of resources'])
plt.title("Histogram of Income composition of resources")

plt.subplot(5,5,14)
plt.hist(lifeexpectancy_df['Schooling'])
plt.title("Histogram of Schooling")


plt.subplot(5,5,15)


plt.subplot(5,5,16)
plt.hist(lifeexpectancy_df['winsorized_thinness  1-19 years'])
plt.title("Histogram of winsorized_thinness  1-19 years")

plt.subplot(5,5,17)
plt.hist(lifeexpectancy_df['winsorized_thinness 5-9 years'])
plt.title("Histogram of winsorized_thinness 5-9 years")

plt.subplot(5,5,18)
plt.hist(lifeexpectancy_df['Income composition of resources'])
plt.title("Histogram of Income composition of resources")

plt.subplot(5,5,19)
plt.hist(lifeexpectancy_df['Schooling'])
plt.title("Histogram of Schooling")

plt.tight_layout()
plt.show()

In [None]:
lifeexpectancy_df.corr()

In [None]:
lifeexpectancy_df.columns.unique()

###### checking for relationships between continuous variables

In [None]:
corrmat1_lifeexpectancy = lifeexpectancy_df[['Life expectancy', 'Adult Mortality', 
                                            'infant deaths', 'Alcohol', 'percentage expenditure', 
                                            'Hepatitis B', 'Measles', 'BMI', 'under-five deaths', 
                                            'Polio', 'Total expenditure', 'Diphtheria', 'HIV/AIDS', 
                                            'GDP', 'Population', 'thinness  1-19 years', 
                                            'thinness 5-9 years', 'Income composition of resources', 
                                            'Schooling']].corr()

corrmat2_lifeexpectancy = lifeexpectancy_df[['Life expectancy','winsorized_Adult Mortality', 
                                             'winsorized_infant deaths', 'winsorized_Alcohol', 
                                             'winsorized_percentage expenditure', 'winsorized_Measles', 
                                             'winsorized_under-five deaths', 'winsorized_Polio', 
                                             'winsorized_Total expenditure', 'winsorized_Diphtheria', 
                                             'winsorized_HIV/AIDS', 'winsorized_GDP', 'winsorized_Population', 
                                             'winsorized_thinness  1-19 years','winsorized_thinness 5-9 years']].corr()

corrmat3_lifeexpectancy = lifeexpectancy_df[['Life expectancy','log_Life expectancy',
                                             'log_Adult Mortality', 'log_infant deaths', 'log_Alcohol',
                                             'log_percentage expenditure', 'log_Hepatitis B', 'log_Measles',
                                             'log_BMI', 'log_under-five deaths', 'log_Polio', 'log_Total expenditure', 
                                             'log_Diphtheria', 'log_HIV/AIDS', 'log_GDP', 'log_Population', 
                                             'log_thinness  1-19 years', 'log_thinness 5-9 years',
                                             'log_Income composition of resources', 'log_Schooling']].corr()


plt.figure(figsize=(30,30))
plt.subplot(3,1,1)
# draw the heatmap using seaborn.
sns.heatmap(corrmat1_lifeexpectancy, square=True, annot=True, linewidths=.5)
plt.title("correlation matrix life expectancy")
    
plt.subplot(3,1,2)
sns.heatmap(corrmat2_lifeexpectancy, square=True, annot=True, linewidths=.5)
plt.title("correlation matrix life expectancy")
    
plt.subplot(3,1,3)
sns.heatmap(corrmat3_lifeexpectancy, square=True, annot=True, linewidths=.5)
plt.title("correlation matrix life expectancy")


plt.tight_layout()
plt.show()

###### Correlation with life expectancy
'log_Income composition of resources': 0.84
'log_HIV/AIDS': -0.81
'log_Schooling: 0.71

###### checking relationships between continuous & categorical variables

In [None]:
lifeexpectancy_df.groupby('Country').mean()

In [None]:
avglifeexp_country = lifeexpectancy_df.groupby('Country')['Life expectancy'].mean()
print(avglifeexp_country)

Too many countries (193), will look at t-test for each country; will groupby 'Status' and look at relationship to life expectancy

In [None]:
lifeexpectancy_df.groupby('Status').mean()

In [None]:
status = lifeexpectancy_df['Status'].unique()
status_df = lifeexpectancy_df.groupby('Status')
for var in ['Life expectancy', 'log_Income composition of resources', 'log_HIV/AIDS', 'log_Schooling']:
    print("------------------------------------------------")
    print("Comparisons for variable: {}".format(var))
    print("------------------------------------------------")
    for i in range(0, len(status)):
        for j in range(i+1, len(status)):
            print("t-test between groups {0} and {1}:".format(status[i], status[j]))
            print(stats.ttest_ind(
                lifeexpectancy_df[lifeexpectancy_df['Status']==status[i]][var], 
                lifeexpectancy_df[lifeexpectancy_df['Status']==status[j]][var]
            ))

life expectancy by year? 

In [None]:
year_avg = lifeexpectancy_df.groupby('Year')['Life expectancy'].mean()
print(year_avg)

In [None]:
log_income = lifeexpectancy_df['log_Income composition of resources'].describe()
print(log_income)

In [None]:
org_income = lifeexpectancy_df['Income composition of resources'].describe()
print(org_income)

##### question 3
In the feature engineering step, you need to select a suite of variables that you think would be ideal in the modeling phase. More concretely, you may discard some variables that are very correlated with the other ones or the variables that you think irrelevant with the life expectancy.

##### suitable variables for modeling
'log_Income composition of resources', 'log_HIV/AIDS', 'log_Schooling', 'Status'

In [None]:
#veify statistically the distributions of log_transformed variables not normally distributed
jb_stats = jarque_bera(np.sqrt(lifeexpectancy_df['log_Income composition of resources']))
norm_stats = normaltest(np.sqrt(lifeexpectancy_df['log_Income composition of resources']))

print("Jarque-Bera test statistics is {0} and p value is {1}".format(jb_stats[0], jb_stats[1]))
print("Normality test statistics is {0} and p value is {1}".format(norm_stats[0], norm_stats[1]))

jb_stats = jarque_bera(np.sqrt(lifeexpectancy_df['log_HIV/AIDS']))
norm_stats = normaltest(np.sqrt(lifeexpectancy_df['log_HIV/AIDS']))

print("Jarque-Bera test statistics is {0} and p value is {1}".format(jb_stats[0], jb_stats[1]))
print("Normality test statistics is {0} and p value is {1}".format(norm_stats[0], norm_stats[1]))

jb_stats = jarque_bera(np.sqrt(lifeexpectancy_df['log_Schooling']))
norm_stats = normaltest(np.sqrt(lifeexpectancy_df['log_Schooling']))

print("Jarque-Bera test statistics is {0} and p value is {1}".format(jb_stats[0], jb_stats[1]))
print("Normality test statistics is {0} and p value is {1}".format(norm_stats[0], norm_stats[1]))

In [None]:

log_income_boxcox,_ = boxcox(lifeexpectancy_df['log_Income composition of resources'])
#log_hiv_boxcox,_ = boxcox(lifeexpectancy_df['log_HIV/AIDS'])
log_school_boxcox,_ = boxcox(lifeexpectancy_df['log_Schooling'])

plt.figure(figsize=(18,5))

# histograms of the box-cox transformed data
plt.subplot(1, 3, 1)
plt.hist(log_income_boxcox)
plt.title("histogram of log_income (box-cox transformed)")

plt.subplot(1, 3, 2)
plt.hist(log_hiv_boxcox)
plt.title("histogram of log_hiv (box-cox transformed)")

plt.subplot(1, 3, 3)
plt.hist(log_school_boxcox)
plt.title("histogram of log_school (box-cox transformed)")

plt.show()

In [None]:
np.sqrt(lifeexpectancy_df['log_Income composition of resources'])

##### question 4 
Summarize your findings. One of the most important skills of a data scientist is to convey ideas and findings to nontechnical people using understandable language. In this regard, one of the most effective ways to communicate your ideas is to do it using effective visualization.