# Part 2: Life Expectancy vs World Happiness Level

## Load in Data
For this section of the project, we will be comparing the world life expectancy rates for all 7 continents to the world happiness levels to see if they correlate. Let's begin by loading the csv files.

In [32]:
!pip install pandas
!pip install plotly
!pip install -q folium mapclassify
!pip install pycountry-convert
%matplotlib inline

import geopandas as gpd
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import pycountry_convert as pc

# read csvs for world happiness and life expectancy
world_happiness = pd.read_csv("world_happiness.csv")
life_expectancy = pd.read_csv("life_expectancy.csv")



Great! Now let's display the world happiness and life expectancy datasets to see what we need to fix.

In [33]:
# drop the Rank column as it is not needed for analysis
world_happiness = world_happiness.drop(columns='Rank')

# show dataframe
world_happiness.head()

Unnamed: 0,Country,Year,Index
0,Afghanistan,2013,4.04
1,Afghanistan,2015,3.575
2,Afghanistan,2016,3.36
3,Afghanistan,2017,3.794
4,Afghanistan,2018,3.632


In [34]:
life_expectancy.head()

Unnamed: 0,Entity,Code,Year,Period life expectancy at birth - Sex: all - Age: 0
0,Afghanistan,AFG,1950,27.7275
1,Afghanistan,AFG,1951,27.9634
2,Afghanistan,AFG,1952,28.4456
3,Afghanistan,AFG,1953,28.9304
4,Afghanistan,AFG,1954,29.2258


It appears we need to drop the "Code" and "Rank" columns. Then, we will filter our data to only include data from years 2013 and beyond.

In [35]:
# filter the life expectancy data for only years 2013 and beyond
filtered_life_expectancy = life_expectancy.loc[life_expectancy['Year']>=2013]

# drop code column as it is not needed
filtered_life_expectancy = filtered_life_expectancy.drop(columns='Code')
filtered_life_expectancy.head()

Unnamed: 0,Entity,Year,Period life expectancy at birth - Sex: all - Age: 0
63,Afghanistan,2013,62.4167
64,Afghanistan,2014,62.5451
65,Afghanistan,2015,62.6587
66,Afghanistan,2016,63.1361
67,Afghanistan,2017,63.016


## Merge Life Expectancy and World Happiness Datasets

Now that all our data is cleaned up, let's merge the two datasets together. We will use the pandas .merge() function to combine the datasets. They will be merged on Country and Year to ensure the life expectancies line up with the country for a given year.

In [36]:
# merge both datasets
merged_data = pd.merge(world_happiness, filtered_life_expectancy, left_on=['Country', 'Year'], right_on=['Entity','Year'])

# drop entity column as we have country
merged_data = merged_data.drop(columns='Entity')
merged_data.head()

Unnamed: 0,Country,Year,Index,Period life expectancy at birth - Sex: all - Age: 0
0,Afghanistan,2013,4.04,62.4167
1,Afghanistan,2015,3.575,62.6587
2,Afghanistan,2016,3.36,63.1361
3,Afghanistan,2017,3.794,63.016
4,Afghanistan,2018,3.632,63.081


Convert the countries to continents for easier analysis. This code was done with assistance from this thread: https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry.

The country_to_continent function below originally didn't recognize Kosovo which was in the csv. To account for this error, if the country was Kosovo, we returned Europe for it.

After creating the function, we will use this on our merged data using .apply() to create a new column with the Continent name for the Country.

In [38]:
# converts countries to their continent
def country_to_continent(country):
    if country == "Kosovo":
        return "Europe"
    country_code = pc.country_name_to_country_alpha2(country)
    if country_code is not None:
        continent_code = pc.country_alpha2_to_continent_code(country_code)
        if continent_code is not None:
            continent_name = pc.convert_continent_code_to_continent_name(continent_code)
            return continent_name
    else:
        return None
    
# use this on merged dataset
merged_data['Continent'] = merged_data['Country'].apply(country_to_continent)
merged_data.head()

Unnamed: 0,Country,Year,Index,Period life expectancy at birth - Sex: all - Age: 0,Continent
0,Afghanistan,2013,4.04,62.4167,Asia
1,Afghanistan,2015,3.575,62.6587,Asia
2,Afghanistan,2016,3.36,63.1361,Asia
3,Afghanistan,2017,3.794,63.016,Asia
4,Afghanistan,2018,3.632,63.081,Asia


## Testing

We checked for accuracy using the assertion tests - if the tests passed, there should be no error returned. These assertions for determining continent are based off the geographical locations of each country.

In [22]:
# assertion tests
assert country_to_continent("Germany") == "Europe"
assert country_to_continent("United States") == "North America"
assert country_to_continent("Kosovo") == "Europe"
assert country_to_continent("Brazil") == "South America"
assert country_to_continent("Australia") == "Oceania"
assert country_to_continent("Afghanistan") == "Asia"

## Creating the World Happiness vs. Life Expectancy Visualizations

Now we will average the life expectancies for a single year for each continent. Because there were multiple years values in our merged_data dataframe, we need to have data from years 2013 - 2021 for each Continent. Drop the Country column and group the data accordingly for life expectancies and world happiness indexes and use .mean() to find the average value for each item in a given year.

In [41]:
# average all life expectancies for a year for a continent
merged_data = merged_data.drop(columns='Country')
avg_life_exp = merged_data.groupby(['Continent', 'Year'])['Period life expectancy at birth - Sex: all - Age: 0'].mean()
avg_life_exp.head()

KeyError: "['Country'] not found in axis"

Repeat for the world happiness indexes.

In [27]:
# average all happiness indexes for a year for a continent
avg_happiness = merged_data.groupby(['Continent', 'Year'])['Index'].mean()
avg_happiness.head()

Continent  Year
Africa     2013    4.420450
           2015    4.283300
           2016    4.262333
           2017    4.238707
           2018    4.274561
Name: Index, dtype: float64

Now, we will merge the averaged world happiness index and life expectancies dataframes on Continent and Year to have their values next to each other for the visualizations that will follow. Reset the index to ensure the format of the table is accurate.

In [29]:
# merge two averaged datasets on continent and year
new_merged_data = pd.merge(avg_life_exp, avg_happiness, on=['Continent', 'Year'])

# reset index for proper formatting
new_merged_data = new_merged_data.reset_index()
#new_merged_data.to_csv('happiness_vs_life_expectancy.csv')

Using plot.ly, we are first going to graph the life expectancies for each continent for each year, 2013 - 2021. The x-axis will be the year, the y-axis will be the average life expectancy for that year, and the color will be the continent, so each continent's graphs will be combined to be shown on one graph for this initial viewing purpose. Show the plot by writing it to an html (.show() did not work in the Jupyter notebook).

In [30]:
# create visualization plotting year with life expectancy for all continents with plot.ly
fig_life = px.line(new_merged_data, x = 'Year' ,
              y = 'Period life expectancy at birth - Sex: all - Age: 0',
              color = 'Continent',
              title = 'Life Expectancy Trends for Each Continent from 2013 - 2021')

# show plot by writing it to html
fig_life.write_html('life_plot.html')

Repeat but replace the y variable will world happiness index now, as we want to see trends in index for each continent in order to compare life expectancy and world happiness index in the statistical analysis section that follows.

In [13]:
# plot year with happiness index for all continents
fig_life2 = px.line(new_merged_data, x = 'Year' ,
              y = 'Index',
              color = 'Continent',
              title = 'World Happiness Index Trends for Each Continent from 2013 - 2021')

fig_life2.write_html('life_plot2.html')

## Pearson Correlation Coefficient Statistical Analysis

In order to meet our challenge goal, we had one being testing for result validility. We chose to do this in the form of a correlation analysis.

Now let's analyze the relationship between world happiness index and life expectancy for a continent overall. We picked the Pearson correlation coefficient because this method can be used to determine the correlation between two different variables. A positive value means... while a negative value means... A stronger correlation will be closer to 1.

In [14]:
# statistical analysis - pearson correlation coefficient for all continents

# africa
# filter new_merged_data with only rows where Continent is Africa
africa_life = new_merged_data[new_merged_data['Continent']=='Africa']

# drop columns Continent and Year for calculating pearson correlation coefficient and rename table
africa_corr = africa_life.drop(columns=['Continent','Year'])
africa_corr.corr(method='pearson')

Unnamed: 0,Period life expectancy at birth - Sex: all - Age: 0,Index
Period life expectancy at birth - Sex: all - Age: 0,1.0,-0.185191
Index,-0.185191,1.0


In [15]:
# asia
asia_life = new_merged_data[new_merged_data['Continent']=='Asia']
asia_corr = asia_life.drop(columns=['Continent','Year'])
asia_corr.corr(method='pearson')

Unnamed: 0,Period life expectancy at birth - Sex: all - Age: 0,Index
Period life expectancy at birth - Sex: all - Age: 0,1.0,-0.644533
Index,-0.644533,1.0


In [16]:
# europe
europe_life = new_merged_data[new_merged_data['Continent']=='Europe']
europe_corr = europe_life.drop(columns=['Continent','Year'])
europe_corr.corr(method='pearson')

Unnamed: 0,Period life expectancy at birth - Sex: all - Age: 0,Index
Period life expectancy at birth - Sex: all - Age: 0,1.0,-0.176369
Index,-0.176369,1.0


In [17]:
# north america
north_life = new_merged_data[new_merged_data['Continent']=='North America']
north_corr = north_life.drop(columns=['Continent','Year'])
north_corr.corr(method='pearson')

Unnamed: 0,Period life expectancy at birth - Sex: all - Age: 0,Index
Period life expectancy at birth - Sex: all - Age: 0,1.0,-0.168439
Index,-0.168439,1.0


In [18]:
# oceania
oceania_life = new_merged_data[new_merged_data['Continent']=='Oceania']
oceania_corr = oceania_life.drop(columns=['Continent','Year'])
oceania_corr.corr(method='pearson')

Unnamed: 0,Period life expectancy at birth - Sex: all - Age: 0,Index
Period life expectancy at birth - Sex: all - Age: 0,1.0,-0.695444
Index,-0.695444,1.0


In [19]:
# south america
south_life = new_merged_data[new_merged_data['Continent']=='South America']
south_corr = south_life.drop(columns=['Continent','Year'])
south_corr.corr(method='pearson')

Unnamed: 0,Period life expectancy at birth - Sex: all - Age: 0,Index
Period life expectancy at birth - Sex: all - Age: 0,1.0,0.254417
Index,0.254417,1.0


Plot one dataset to see how..

In [20]:
# plot one data set more closely to see pearson correlation coefficient
# this is life expectancy trends for the continent of africa
fig_africa_life = px.line(africa_life, x = 'Year' ,
              y = 'Period life expectancy at birth - Sex: all - Age: 0',
              color = 'Continent',
              title = 'Life Expectancy Trends for Africa from 2013 - 2021')

fig_africa_life.write_html('africa_life_plot.html')

In [21]:
# plot one data set more closely to see pearson correlation coefficient
# this is world happiness index trends for the continent of africa
fig_africa_life2 = px.line(africa_life, x = 'Year' ,
              y = 'Index',
              color = 'Continent',
              title = 'World Happiness Index Trends for Africa from 2013 - 2021')

fig_africa_life2.write_html('africa_life_plot2.html')

## Result

## Impact and Limitations

## Plan Evaluation for Part 2