# Challenge: Factors that affect life expectancy
Throughout this module, we dove deep into the topics of exploratory data analysis (EDA) and its substeps. Now it's time to apply your knowledge. In this challenge, you'll work with the Life Expectancy dataset from Kaggle. The data is collected from the websites of the World Health Organisation (WHO) and World Bank (WB). The dataset contains annual variables for many of the countries in the world which are divided into several broad categories: immunization related factors, mortality factors, economical factors, and social factors.

https://www.kaggle.com/kumarajarshi/life-expectancy-who

You should access the data from the Thinkful database. Here's the credentials you can use to connect to the database:

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'lifeexpectancy'
table_name = 'lifeexpectancy'

To complete this challenge, submit a Jupyter notebook containing your solutions to the following tasks. Moreover, you can also submit a kernel to the Kaggle containing your solutions.

- Your goal in this challenge is to find the factors that affect the life expectancy. Specifically, you need to find out which factors increase the expected life in the countries and which factors decrease it.

- First, load the dataset.

- Detect the problems with the data such as missing values and outliers. Are there any nonsense values that seem to be stemmed from the data collection? For the missing values, discuss which technique would be the most suitable one in filling out these values. Regarding the outliers, discuss their potential effects on your analysis and select an appropriate method to deal with them.

- Explore the data using univariate and multivariate exploration techniques. You should pay special attention to your target variable. In this regard, your focus should be on finding the relevant variables that may affect life expectancy.

- In the feature engineering step, you need to select a suite of variables that you think would be ideal in the modeling phase. More concretely, you may discard some variables that are very correlated with the other ones or the variables that you think irrelevant with the life expectancy.

- Summarize your findings. One of the most important skills of a data scientist is to convey ideas and findings to nontechnical people using understandable language. In this regard, one of the most effective ways to communicate your ideas is to do it using effective visualization.

# step 1

In [131]:
%matplotlib inline
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import warnings
import seaborn as sns
from matplotlib import pyplot as plt
import scipy.stats as stats

warnings.filterwarnings('ignore')

# step 2
First, load the dataset.

In [132]:
# First, load the dataset.
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'lifeexpectancy'
table_name = 'lifeexpectancy'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df_original = pd.read_sql_query('select * from '+ table_name,con=engine)
engine.dispose()

In [133]:
df_original.info()
df_original.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
Country                            2938 non-null object
Year                               2938 non-null int64
Status                             2938 non-null object
Life expectancy                    2928 non-null float64
Adult Mortality                    2928 non-null float64
infant deaths                      2938 non-null int64
Alcohol                            2744 non-null float64
percentage expenditure             2938 non-null float64
Hepatitis B                        2385 non-null float64
Measles                            2938 non-null int64
 BMI                               2904 non-null float64
under-five deaths                  2938 non-null int64
Polio                              2919 non-null float64
Total expenditure                  2712 non-null float64
Diphtheria                         2919 non-null float64
 HIV/AIDS                          2938 non-null

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


# step 3
Detect the problems with the data such as missing values and outliers. Are there any nonsense values that seem to be stemmed from the data collection? For the missing values, discuss which technique would be the most suitable one in filling out these values. Regarding the outliers, discuss their potential effects on your analysis and select an appropriate method to deal with them.

### step 3.1 discriptive statistics
use df.describe() to examine the data

### step 3.2 nonsense value check
- for valuables inlcuding 'Adult Mortality', "infrant deaths', 'Measles', 'under-five deaths', by definition, it is 'Number of Infant Deaths per 1000 population', so it doesn't make sense if certain values are over 1000 for such varialbes. Whilst, we can spot 14 infant deaths records that are over 1000. (i.e. df[df['infant_deaths'] >=1000]), we can change them to nan
- also those zero values across 'infant deaths',  'measeles', and such are questionable, but not totally nonsense, so we can just keep them at this point.

### step 3.3 fill nan
- the data is time seriers, so use interpolate() to fill nan values
- also, it makes more sense to fill nan values within each coutry

### step 3.4 outliers handeling
- to be continued

In [134]:
# make a copy of the orignial dataframe, df
df = df_original.copy()

# get rid of the white space at the end of column names, and change with '_' in between
df.columns = df.columns.str.rstrip()
df.columns = df.columns.str.lstrip()
df.columns = df.columns.str.replace(' ', '_')

df.describe()

Unnamed: 0,Year,Life_expectancy,Adult_Mortality,infant_deaths,Alcohol,percentage_expenditure,Hepatitis_B,Measles,BMI,under-five_deaths,Polio,Total_expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness__1-19_years,thinness_5-9_years,Income_composition_of_resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


In [None]:
# get rid of nonsense values and replace with np.nan, 
for col in ['Adult_Mortality', 'infant_deaths', 'Measles', 'under-five_deaths']:
# for col in ['under-five deaths']:    
#     print(df.loc[df[col] >=1000, col])
    df.loc[df[col] >=1000, col] = np.nan

# fill nan within coutries
for col in df.columns.unique():
    if col not in ['Country', 'Year', 'Status']:
        for country in df.Country.unique():
            df.loc[df.Country==country, col] = df.loc[df.Country==country, col].interpolate()
# then drop nan
df = df.dropna()


### potiential effects of outliers
- They skew the descriptive statistics of the data. For example, consider one of the most commonly used descriptive statistics—the mean. This value can be severely skewed by outliers.
- Some machine learning models are sensitive to extreme values. In order to get more accurate estimates, we need to eliminate those values from our dataset.

### step 3.4 outliers handeling
- use boxplot to visualize outliers, set whis=1.5
- use winsorization to drop outliers. 

##### Note: 
- I performed two trials for winsorization. The 1st trial uses fixed threshold, the 2nd trial uses flexible threshold based on how many outliers are there for each column. 
- both trials can get rid of all the outliers, but
- the result of 2nd trial is perferable, since more data can remain its orignial values.
- it is also possible to transform the outliers, i.e. using log transformation. I didn't perform in this stage though, but as a reminder, I will perform log transformation later if desired. 

In [None]:
# work on outliers

# life_expectancy: our target
plt.boxplot(df['Life_expectancy'])
plt.title("{} (whis=1.5)".format('Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(df.columns.unique(), range(len(df.columns.unique()))):
    if col not in ['Country', 'Year', 'Status', 'Life_expectancy']:
        plt.subplot(6,3,index-3)
        plt.boxplot(df[col])
        plt.title("{} (whis=1.5)".format(col))
plt.show()

In [None]:
# working on outliers 
# count outliers under 1.5 threshold, represent in portion for later use in winsorization
threshold = 1.5
record_num = len(df)

lower_list = []
upper_list = []
for col,index in zip(df.columns.unique(), range(len(df.columns.unique()))):
    if col not in ['Country', 'Year', 'Status']:
        q75, q25 = np.percentile(df[col], [75 ,25])
        iqr = q75 - q25
        min_val = q25 - (iqr*threshold)
        max_val = q75 + (iqr*threshold)
#         store the upper bound and lower bound
        lower_list.append(len(np.where((df[col] < min_val))[0]) / record_num)
        upper_list.append(len(np.where((df[col] > max_val))[0]) / record_num)
        print("{:25s} : ".format(col)
        + "percentile of lower outliers is: {:2f},\t   ".format(len(np.where((df[col] < min_val))[0]) / record_num)\
        + "percentile of upper outliers is: {:2f}. ".format(len(np.where((df[col] > max_val))[0]) / record_num)  \                                          
                                                )
     

In [None]:
# trial 1, perform winsorization, use fixed threshold (0.12, 0.16) for all columns

import scipy.stats.mstats

df_winsor = df.copy()
winsorized_list = []
for col,index in zip(df_winsor.columns.unique(), range(len(df_winsor.columns.unique()))):
    if col not in ['Country', 'Year', 'Status']:
        df_winsor['winsorized_'+ col] = scipy.stats.mstats.winsorize(df_winsor[col], (0.12, 0.16))
        winsorized_list.append('winsorized_'+ col)

        
plt.boxplot(df_winsor['winsorized_'+'Life_expectancy'])
plt.title("{} (whis=1.5)".format('winsorized_'+'Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(winsorized_list, range(len(winsorized_list))):
    if col not in ['Country', 'Year', 'Status', 'winsorized_'+'Life_expectancy']:
        plt.subplot(6,3,index)
        plt.boxplot(df_winsor[col])
        plt.title("{} (whis=1.5)".format(col))
plt.show()

df_trial1 = df_winsor.copy()

In [None]:
# trial 2, perform winsorization, use flexible threshold for each column, based on the percentile of outiers

# import scipy.stats.mstats

df_winsor = df.copy()
winsorized_list = []
for col in df.columns.unique():
    if col not in ['Country', 'Year', 'Status']:
        winsorized_list.append( col)
    
    
for col,index in zip(winsorized_list, range(len(winsorized_list))):
#     if col not in ['Country', 'Year', 'Status']:
    df_winsor['winsorized_'+ col] = scipy.stats.mstats.winsorize(df_winsor[col], (lower_list[index], upper_list[index]))
#         winsorized_list.append('winsorized_'+ col)

        
plt.boxplot(df_winsor['winsorized_'+'Life_expectancy'])
plt.title("{} (whis=1.5)".format('winsorized_'+'Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(winsorized_list[1:], range(len(winsorized_list[1:]))):
    plt.subplot(6,3,index+1)
    plt.boxplot(df_winsor['winsorized_'+ col])
    plt.title("{} (whis=1.5)".format('winsorized_'+col))
plt.show()

df_trial2 = df_winsor.copy()

# step 4
Explore the data using univariate and multivariate exploration techniques. You should pay special attention to your target variable. In this regard, your focus should be on finding the relevant variables that may affect life expectancy.

### univariant exploration
- we can choose discriptive statistics method (i.e. .describe())
- or visulization method, (i.e. histogram or boxplot) for continous variables, which is the case in this challenge

In [None]:
# univariant exploration
# visulization of distribution using histogram

plt.hist(df_winsor['winsorized_'+'Life_expectancy'])
plt.title("{} ".format('winsorized_'+'Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(winsorized_list[1:], range(len(winsorized_list[1:]))):
    plt.subplot(6,3,index+1)
    plt.hist(df_winsor['winsorized_'+ col])
    plt.title("{} ".format('winsorized_'+col))
plt.show()


### step 4. 1 multivariant exploration
we are focusing on finding the correlation, and the mothods include
- scatter plot
- correlation matrix, i.e. .corr()
- heatmap

in this challenge, I will choose scatter plot and heatmap method, because the information in correlation matrix can be covered from heatmap has well.

In [None]:
# scatter plot to show correlation between each column and the target variable
# at the first glance, adult_motality, HIV/AIDS, composition_of_resource, and schooling 
# have high coorelation with our target variable life_expectancy

# work on the winsorized part
df_clean = df_winsor.iloc[:,22:].copy()

plt.figure(figsize=(15, 20))
for col,index in zip(df_clean.columns.unique()[1:], range(len(df_clean.columns.unique()[1:]))):
    plt.subplot(6,3,index+1)
    plt.scatter(df_winsor[col], df_winsor['winsorized_'+'Life_expectancy'])
    plt.title("{} - Life_expectancy (target)".format(col))
plt.show()


In [None]:
# calculate the correlation coefficient and illustrate using heatmap
plt.figure(figsize=(15,15))

# draw the heatmap using seaborn.
sns.heatmap(df_clean.corr(), square=True, annot=True, linewidths=.5)
# plt.title("correlation matrix")
plt.show()

### intermediate summary for step 4
- the relationship between the target variable and the rest can be represented by correlationship/ colinearity
- we can use visulazation (scatter plot) or use discriptive statistics (correlation coefficient) to illustrate correlationship
- In this study case, the top correlation between target variable and the rest of the column is income_composition_of_resource = 0.82, HIV/AIDS = -0.79, schooling = 0.76, 
- normally, a correlation coeffient higher than 0.7 indicates strong correlationship, (either positive or negative). So, the relevant variables tha may affect life expectancy including "income_composition_of_resource" (cc=0.82), "HIV/AIDS" (cc=-0.79), "schooling" (cc=0.76).

# step 5
In the feature engineering step, you need to select a suite of variables that you think would be ideal in the modeling phase. More concretely, you may discard some variables that are very correlated with the other ones or the variables that you think irrelevant with the life expectancy.

### ANS:
we can apply a mask on the heat map to highlight very low and/or very high correlation coefficient
- High cc check. High correlation coefficient pair include infant_deaths- under-five_deaths (cc=0.99), thinness__1-19_years - thinness_5-9_years (cc=0.93), percentage_expenditure-GDP (cc=0.92), Income_composition_of_resources – schooling(cc=0.88), Polio-Diphtheria (cc=0.84), 
- Since,high correlation coefficient might cause unstability, we can get rid of one variable in each of the pair. I chose to drop the one with weaker CC with our target variables, i.e. drop infant_deaths, thinness_5-9_years, GDP, schooling, polio.
- Low cc check. This time, we drop columns have very low correlation with the target variable. The one we would drop is Population (cc=-0.023)


In [None]:
# calculate the correlation coefficient and illustrate using heatmap
plt.figure(figsize=(15,15))
df_corr_matrix = df_clean.corr()
# apply a mask on the corr, that only show high correlation (cc>=0.7), and under triangle part
df_corr_matrix.iloc[1:,1:][np.abs(df_corr_matrix)<.7] = -10
# draw the heatmap using seaborn.
sns.heatmap(df_corr_matrix, square=True, annot=True, linewidths=.5)
# plt.title("correlation matrix")
plt.show()


plt.figure(figsize=(15,15))
df_corr_matrix = df_clean.corr()
# apply a mask on the corr, that only show low correlation (cc<0.1), and under triangle part
df_corr_matrix[np.abs(df_corr_matrix)>.1] = 10
# draw the heatmap using seaborn.
sns.heatmap(df_corr_matrix, square=True, annot=True, linewidths=.5)
# plt.title("correlation matrix")
plt.show()

In [None]:
# drop the columns and make a copy of new dataframe to work on.

col_list_new = ['winsorized_Life_expectancy', 'winsorized_Adult_Mortality',
       'winsorized_Alcohol', 'winsorized_percentage_expenditure',
       'winsorized_Hepatitis_B', 'winsorized_Measles', 'winsorized_BMI',
       'winsorized_under-five_deaths',
       'winsorized_Total_expenditure', 'winsorized_Diphtheria',
       'winsorized_HIV/AIDS', 
       'winsorized_thinness__1-19_years',
       'winsorized_Income_composition_of_resources']

df_trimmed = df_clean[col_list_new].copy()
df_trimmed

In [None]:
len(df_trimmed.columns.unique())

# step 6
Summarize your findings. One of the most important skills of a data scientist is to convey ideas and findings to nontechnical people using understandable language. In this regard, one of the most effective ways to communicate your ideas is to do it using effective visualization.

###  step 6. 1 data cleaning. 
- we have filled missing values using interpolation within contries, then dropna
- then we got rid of outliers under whis=1.5, using winsorization
- let's review the data using discriptive statistics methods, and visulization (boxplot)


In [None]:
df_trimmed.describe()

In [None]:
# data visulization
        
plt.boxplot(df_trimmed['winsorized_'+'Life_expectancy'])
plt.title("{} (whis=1.5)".format('Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(df_trimmed.columns.unique()[1:], range(len(df_trimmed.columns.unique()[1:]))):
    plt.subplot(6,3,index+1)
    plt.boxplot(df_trimmed[col])
    plt.title("{} (whis=1.5)".format(col))
plt.show()

df_trial2 = df_winsor.copy()

### step 6. 2 data exploration
- for univariant analysis, we can plot the distribution using histogram
- note that for some cases normaility is required, so we can perform boxcox tranformation to make the distribution more normal

- for multivariant analysis, we care about the correlationship between each variable and our target varialbe, also coorelationship between each non-target variable.
- we can use scatter plot to visulize the correlationship
- we can calculate correlation coefficient (CC) to quantify correlationship and use heat map to display that, which is helpful when illustrating correlationship between each pair of the non-target variables.
- we can spot some variables have high-correlation with the target variables, i.e. "income_composition_of_resource" (cc=0.82), "HIV/AIDS" (cc=-0.79), "schooling" (cc=0.76). (please see analysis in step 5)
- For cases including PCA, we might want to get rid of certain variables that are high-coorelated with each other within the non-target variable pairs, as well as variables that have very low correlation with the target variables.

In [None]:
# univariant exploration
# visulization of distribution using histogram

plt.hist(df_trimmed['winsorized_'+'Life_expectancy'])
plt.title("{} ".format('Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(df_trimmed.columns.unique()[1:], range(len(df_trimmed.columns.unique()[1:]))):
    plt.subplot(6,3,index+1)
    plt.hist(df_trimmed[col])
    plt.title("{} ".format(col))
plt.show()

In [None]:
# perform normality check
from scipy.stats import jarque_bera
from scipy.stats import normaltest

for col,index in zip(df_trimmed.columns.unique(), range(len(df_trimmed.columns.unique()))):
    jb_stats = jarque_bera(df_trimmed[col])
    norm_stats = normaltest(df_trimmed[col])
    print(col)
    print("Jarque-Bera test statistics is {0} and p value is {1}".format(jb_stats[0], jb_stats[1]))
    print("Normality test statistics is {0} and p value is {1}".format(norm_stats[0], norm_stats[1]))
    print('-------------')



In [None]:
from scipy.stats import boxcox
    
# perform boxcox
# make a copy
df_boxcox = df_trimmed.copy()

# boxcox require all datapoints larger than zero, perform the following:
df_boxcox.iloc[np.abs(df_boxcox)<.001] = np.nan
df_boxcox = df_boxcox.dropna()

for col,index in zip(df_boxcox.columns.unique(), range(len(df_boxcox.columns.unique()))):
    df_boxcox[col], _ = boxcox(df_boxcox[col])

# display the distribution

plt.hist(df_boxcox['winsorized_'+'Life_expectancy'])
plt.title("{} ".format('Life_expectancy'))
plt.show()

plt.figure(figsize=(15, 20))
for col,index in zip(df_boxcox.columns.unique()[1:], range(len(df_boxcox.columns.unique()[1:]))):
    plt.subplot(6,3,index+1)
    plt.hist(df_boxcox[col])
    plt.title("{} ".format(col))
plt.show()

In [None]:
# explore multivariant variable for variables after coxbox, check correlation coefficient once again

# calculate the correlation coefficient and illustrate using heatmap
plt.figure(figsize=(15,15))
df_corr_matrix = df_boxcox.corr()
# apply a mask on the corr, that only show high correlation (cc>=0.7), and under triangle part
df_corr_matrix.iloc[1:,1:][np.abs(df_corr_matrix)<.7] = -10
# draw the heatmap using seaborn.
sns.heatmap(df_corr_matrix, square=True, annot=True, linewidths=.5)
# plt.title("correlation matrix")
plt.show()


plt.figure(figsize=(15,15))
df_corr_matrix = df_boxcox.corr()
# apply a mask on the corr, that only show low correlation (cc<0.1), and under triangle part
df_corr_matrix[np.abs(df_corr_matrix)>.1] = 10
# draw the heatmap using seaborn.
sns.heatmap(df_corr_matrix, square=True, annot=True, linewidths=.5)
# plt.title("correlation matrix")
plt.show()

### step 6. 3 feature engineering - PCA
- the former steps ensure that the prerequisites for performing PCA satisfiy
- we have tweleve variables before performing PCA, and we can choose the first 8 principle compoenets. 
- The choice is made based on the eigen values and their explained variance ratio. Note that, such first 8 principle components, in total, can explain 90% of the varaiance.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA# perform PCA, (the prerequisites for PCA satisfied)

# Normalize the data so that all variables have a mean of 0 and standard deviation
# of 1.
X = StandardScaler().fit_transform(df_boxcox.iloc[:,1:])
# np.shape(X)

sklearn_pca = PCA(n_components=12)
Y_sklearn = sklearn_pca.fit_transform(X)

print(
    'The percentage of total variance in the dataset explained by each',
    'component from Sklearn PCA.\n',
    sklearn_pca.explained_variance_ratio_
)



In [None]:
# Calculating eigenvalues and eigenvectors.
Cx = np.cov(X.T)
eig_val_cov, eig_vec_cov = np.linalg.eig(Cx)
# sort
idx = eig_val_cov.argsort()[::-1]   
eig_val_cov = eig_val_cov[idx]
eig_vec_cov = eig_vec_cov[:,idx]

# Inspecting the eigenvalues and eigenvectors.
# for i in range(len(eig_val_cov)):
#     eigvec_cov = eig_vec_cov[:, i].T
#     print('Eigenvector {}: \n{}'.format(i + 1, eigvec_cov))
#     print('Eigenvalue {}: {}'.format(i + 1, eig_val_cov[i]))
#     print(40 * '-')

print(
    'The percentage of total variance in the dataset explained by each',
    'component calculated by hand.\n',
    eig_val_cov / sum(eig_val_cov)
)

plt.figure(figsize=(8,8))
plt.subplot(2,1,1)
plt.plot(eig_val_cov)
plt.subplot(2,1,2)
eigen_sum = 0
elgen_sum_list =[]
for eigen_val in sklearn_pca.explained_variance_ratio_:
    eigen_sum += eigen_val
    elgen_sum_list.append(eigen_sum)
    plt.plot(elgen_sum_list)
    
plt.show()

### step 6. 4 one more point
the boxcox transformation was performed after we drop the outliers, but note that boxcox transformation will affect distritrion, thus the outliers. If we have performed the boxcox transformation before not after outlier handeling, then result would be different.