   # What Make Us Smile: World Happiness Data at a Glance

What is the factor that most influences the level of happiness?

  - **Hypothesis**: GDP is positively related to happiness level.
  
  - **Null hypothesis**: GDP is not related to the happiness level.

*Background: The happiness score ranking use data from theGallup World Poll. The scores are based on answers to different factors including GDP, freedom, family, social support, and others. There are over 150 countries in this dataset, ranging from 2015 to 2020. For our analysis, we use the year 2020.*

source: https://www.kaggle.com/mathurinache/world-happiness-report?select=2015.csv

### 1. Cleaning process

In [None]:
# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import os
import csv
import seaborn as sns

In [None]:
#The path to the CSV file
file_2020 = "Resources/2020.csv"

In [None]:
# Read the file
df_2020 = pd.read_csv(file_2020)

In [None]:
# visualizate the data
df_2020.head()

In [None]:
# Rename ladder score column 
clean_df_2020 = df_2020.rename(columns={"Ladder score":"Happiness Score", "Logged GDP per capita":"GDP per capita"})

# Drop unnecesary columns
clean_df_2020 = clean_df_2020[["Country name", "Regional indicator", "Happiness Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity"]]

In [None]:
clean_df_2020

In [None]:
# Narrow down the range of countries (Western Europe vs Latin America and Caribbean)

region_df_2020 = clean_df_2020.loc[(clean_df_2020["Regional indicator"]=="Western Europe")|(clean_df_2020["Regional indicator"]=="Latin America and Caribbean")|(clean_df_2020["Regional indicator"]=="Middle East and North Africa")
                |(clean_df_2020["Regional indicator"]=="Central and Eastern Europe")|(clean_df_2020["Regional indicator"]=="North America and ANZ")]

# Reset the index
region_df_2020 = region_df_2020.reset_index(drop=True)

In [None]:
# Western Europe df
we_df = region_df_2020.loc[(region_df_2020["Regional indicator"]=="Western Europe")]

In [None]:
# Latin American and Caribbean
lac_df = clean_df_2020.loc[(clean_df_2020["Regional indicator"]=="Latin America and Caribbean")]

In [None]:
# Middle East and North Africa
mena_df = region_df_2020.loc[(region_df_2020["Regional indicator"]=="Middle East and North Africa")]

In [None]:
# Central and Eastern Europe
cee_df = region_df_2020.loc[(region_df_2020["Regional indicator"]=="Central and Eastern Europe")]

In [None]:
# North America and ANZ
na_df = region_df_2020.loc[(region_df_2020["Regional indicator"]=="North America and ANZ")]

In [None]:
# Look for null values
region_df_2020.count()

### 2. Data visualization

In [None]:
#Distribution of the data

happiness_rate = region_df_2020["Happiness Score"]
plt.hist(happiness_rate, color="gold")
plt.xlabel('Happiness Rate')
plt.show()

In [None]:
# Describe the data
region_df_2020.describe()

In [None]:
# Western Europe
we_df.describe()

In [None]:
# Latin American and Caribbean
lac_df.describe()

In [None]:
# Middle East and North Africa
mena_df.describe()

In [None]:
# Central and Eastern Europe
cee_df.describe()

In [None]:
# North America and ANZ
na_df.describe()

### 1. Do the countries of the same region have the same level of happiness?

Using bar graphs for each region, we were able to compare the happiness scores for each country. Based on the data that we found, happiness level varies between countries within regions. We did however find that some regions have larger variations between countries than others; Latin America and Caribbean as well as Central and Eastern Europe have multiple countries with similar happiness levels. By contrast, Western Europe as well as Middle East and North Africa each either have very few countries with similar levels of happiness.

In [None]:
# Create the variables for the plot

# WE Region
rate_we = we_df["Happiness Score"].values
country_name_we = we_df["Country name"].values
# LAC Region
rate_lac = lac_df["Happiness Score"].values
country_name_lac = lac_df["Country name"].values
# MENA Region
rate_mena = mena_df["Happiness Score"].values
country_name_mena = mena_df["Country name"].values
# CEE Region
rate_cee = cee_df["Happiness Score"].values
country_name_cee = cee_df["Country name"].values
# NA REGION
rate_na = na_df["Happiness Score"].values
country_name_na = na_df["Country name"].values

In [None]:
# Western Europe vs Latin American and Caribbean
plt.subplot(1, 2, 1)
happiness_we = plt.barh(country_name_we,rate_we ,color="steelblue", linewidth=1)
plt.title("Western Europe")
plt.subplot(1, 2, 2)
happiness_lac = plt.barh(country_name_lac,rate_lac ,color="darkred", linewidth=1)
plt.title("Latin American and Caribbean")
plt.tight_layout()
plt.savefig("Images/we_vs_lac.png")

The happiness score is different among the countries in each region.

In [None]:
# Middle East and North Africa vs Central and Eastern Europe
plt.subplot(1, 2, 1)
happiness_mena = plt.barh(country_name_mena,rate_mena ,color="teal", linewidth=1)
plt.title("Middle East and North Africa")
plt.subplot(1, 2, 2)
happiness_cee = plt.barh(country_name_cee,rate_cee ,color="salmon", linewidth=1)
plt.title("Central and Eastern Europe")
plt.tight_layout()
plt.savefig("Images/mena_vs_cee.png")

In [None]:
# Western Europe
happiness_we = plt.barh(country_name_we,rate_we ,color="steelblue", linewidth=5)
plt.title("Western Europe")
plt.xlabel("Happiness Score")
plt.savefig("Images/we_happiness_score.png")

In [None]:
# Latin American and Caribbean
happiness_lac = plt.barh(country_name_lac,rate_lac ,color="darkred", linewidth=1)
plt.title("Latin American and Caribbean")
plt.xlabel("Happiness Score")
plt.savefig("Images/lac_happiness_score.png")

In [None]:
# Middle East and North Africa
happiness_mena = plt.barh(country_name_mena,rate_mena ,color="teal", linewidth=1)
plt.title("Middle East and North Africa")
plt.xlabel("Happiness Score")
plt.savefig("Images/mena_happiness_score.png")

In [None]:
# Central and Eastern Europe
happiness_cee = plt.barh(country_name_cee,rate_cee ,color="salmon", linewidth=1)
plt.title("Central and Eastern Europe")
plt.xlabel("Happiness Score")
plt.savefig("Images/cee_happiness_score.png")

In [None]:
# North America and ANZ
happiness_na = plt.barh(country_name_na,rate_na ,color="mediumseagreen", linewidth=1)
plt.title("North America and ANZ")
plt.xlabel("Happiness Score")
plt.savefig("Images/na_happiness_score.png")

### 2. Do all of the variables have an equal effect on the happiness level of regions?

We used heat maps to show the correlation between the variables (GDP, social support, health life expectancy, freedom to make choices, and generosity) and the overall happiness level. The first heat map, which encompassed every region together, showed a positive and overall similar correlation between happiness level and four of the variables. Generosity however showed a very weak correlation to happiness level.


Breaking down region by region shows interesting variances. In two of our regions (Latin America and Caribbean as well as Central and Eastern Europe) we found a negative correlation between generosity and happiness level. We also found that these two regions had the lowest GDP correlations out of the regions we measured. We also noticed that North America and ANZ had a strong negative correlation between happiness level and GDP.

In [None]:
# All regions
corr_world =region_df_2020.corr(method="pearson", min_periods=80)
sns.heatmap(corr_world, annot=True, cmap ='RdBu', vmin=-1, vmax=1)
plt.title("Variables vs Happiness Rate (All Regions)")
plt.savefig("Images/correlation_all_regions.png")

In [None]:
# Correlation for Western Europe
corr_we = we_df.corr()
sns.heatmap(corr_we, annot=True, cmap ='RdBu', vmin=-1, vmax=1)
plt.title("Variables vs Happiness Rate (WE)")
plt.savefig("Images/correlation_we.png")

In [None]:
# Correlation for Latin American and Caribbean data frame
corr_lac =lac_df.corr(method="pearson", min_periods=21)
sns.heatmap(lac_df.corr(), annot=True, cmap ='RdBu', vmin=-1, vmax=1)
plt.title("Variables vs Happiness Rate (LAC)")
plt.savefig("Images/correlation_lac.png");

In [None]:
# Correlation for Middle East and North Africa
corr_mena =mena_df.corr(method="pearson", min_periods=17)
sns.heatmap(corr_mena, annot=True, cmap ='RdBu', vmin=-1, vmax=1)
plt.title("Variables vs Happiness Rate (MENA)")
plt.savefig("Images/correlation_mena.png");

In [None]:
# Correlation for Central and Eastern Europe
corr_cee =cee_df.corr(method="pearson", min_periods=17)
sns.heatmap(corr_cee, annot=True, cmap ='RdBu', vmin=-1, vmax=1)
plt.title("Variables vs Happiness Rate (CEE)")
plt.savefig("Images/correlation_cee.png");

In [None]:
# Correlation for North America and ANZ
corr_na =na_df.corr(method="pearson", min_periods=4)
sns.heatmap(corr_na, annot=True, cmap ='RdBu', vmin=-1, vmax=1)
plt.title("Variables vs Happiness Rate (NA)")
plt.savefig("Images/correlation_na.png");

### 3. How do the countries with the highest GDP compare in terms of happiness level versus the countries with the lowest GDP?

We used a box plot to compare the happiness level of higher-GDP countries to the happiness level of lower-GDP countries. We found that countries with the highest GDP levels have a significantly higher level of happiness than countries with the lowest GDP levels. The difference was so significant that approximately 75% of countries with the lowest GDPs had happiness levels lower than what would be considered an outlier on the low side of the highest-GDP countries.

In [None]:
# Select ten countries with the highest GDP in all regions 
highest_gdp = region_df_2020.sort_values("GDP per capita", ascending=False)
highest_gdp = highest_gdp.head(10)
highest_gdp

In [None]:
# Select ten countries with the lowest GDP in all regions 
lowest_gdp = region_df_2020.sort_values("GDP per capita")
lowest_gdp = lowest_gdp.head(10)
lowest_gdp

In [None]:
# Plot the results in the same chart
f,ax = plt.subplots(1, 2, sharey=True, figsize=(5, 5))
ax[0].boxplot(highest_gdp["Happiness Score"], labels=["Countries with High GDP"])
ax[1].boxplot(lowest_gdp["Happiness Score"], labels=["Countries with Low GDP"])
ax[0].grid()
ax[1].grid()
f.suptitle("High GDP vs Low GDP")
plt.savefig("Images/high_gdp_low_gdp.png");

### 4. What portion of high-GDP countries have low happiness levels?

We used a pie chart to visualize the percentage of high-GDP countries with low happiness levels. We found that out of all of the countries listed, only 10% of them had both high-GDP and low happiness. Out of the high-GDP countries, only 18% had low happiness levels. Only eight countries with high GDP are considered low-happiness countries.


Note: 
- Countries with a GDP under the average, will be considered low-GDP countries. 
- Countries with a happiness score under the average, will be considered low-happiness countries. 

In [None]:
# Establish a dynamic mean for GDP
mean_gdp = region_df_2020['GDP per capita'].mean()

In [None]:
# Establish a dynamic mean for happiness rate
mean_happiness = region_df_2020['Happiness Score'].mean()

In [None]:
# Filter the countries which have a high GDP but low Happiness Score
question_4 = region_df_2020.loc[(region_df_2020['GDP per capita']>=mean_gdp) & (region_df_2020['Happiness Score']<mean_happiness)]
question_4

In [None]:
# High GDP countries with low happiness levels

ru_region_group = question_4.groupby("Regional indicator")
ru_region_distribution = ru_region_group["Regional indicator"].count()
colors=('khaki', 'darkkhaki', 'beige')
reach_unhappy_countries_pie = ru_region_distribution.plot(kind="pie",autopct='%1.1f%%', colors=colors, figsize=(7, 7))
reach_unhappy_countries_pie.set_title('High GDP Countries with Low Happiness Levels', fontsize=20)
reach_unhappy_countries_pie.set_ylabel("")
plt.savefig("Images/question_4.png");

In [None]:
# Calculate the % of High GDP countries with low happiness levels

proportion_ru_countries = (question_4["Country name"].count() / region_df_2020["Country name"].count() ) * 100
print(f"{proportion_ru_countries}% of the total countries have high GDP and have low happiness levels")

In [None]:
# Countries with high GDP
question_4_rc = region_df_2020.loc[(region_df_2020['GDP per capita']>=mean_gdp)]

In [None]:
# What percentage of the total high-GDP countries have low happiness levels?
proportion_countries_rc = round(((question_4["Country name"].count() / question_4_rc["Country name"].count() ) * 100),2)
print(f"{proportion_countries_rc}% of the countries with high GDP and have low happiness levels")


### 5.  What portion of low-GDP countries have high happiness levels?

We decided to use a pie chart to visualize the percentage of low-GDP countries with high happiness levels. We found that 25% of the total countries with low GDP have high happiness levels. We also found that out of all countries listed, only 11% fit the criteria of being low-GDP countries with high happiness levels.

In [None]:
# Filter the countries which have a low GPD but high Happiness Score

question_5 = region_df_2020.loc[(region_df_2020['GDP per capita']<mean_gdp) & (region_df_2020['Happiness Score']>=mean_happiness)]
question_5

In [None]:
# Countries with low GDP
question_5_pc = region_df_2020.loc[(region_df_2020['GDP per capita']<mean_gdp)]

In [None]:
# What percentage of the total low-GDP countries have high happiness levels?
proportion_countries_pc = (question_5["Country name"].count() / question_5_pc["Country name"].count() ) * 100
print(f"{proportion_countries_pc}% of the total countries with low GDP have high happiness levels")

In [None]:
# Low-GDP countries with high happiness levels
ph_region_group = question_5.groupby("Regional indicator")
ph_region_distribution = ph_region_group["Regional indicator"].count()
colors=('gold', 'orange')
poor_happy_countries_pie = ph_region_distribution.plot(kind="pie", autopct='%1.1f%%', colors=colors, figsize=(7, 7))
poor_happy_countries_pie.set_title('Low-GDP Countries with High Happiness Levels', fontsize=20)
poor_happy_countries_pie.set_ylabel("")
plt.savefig("Images/question_5.png");

In [None]:
# Calculate the % of Low-GDP countries with have high happiness levels
proportion_ph_countries = (question_5["Country name"].count() / region_df_2020["Country name"].count() ) * 100
print(f"{proportion_ph_countries}% of low-GDP countries have high happiness levels.")

### GDP Per Capita

To understand the differences in GDP for different regions, we started by doing two independent t-test. 

Our first t-test is between Western Europe versus Latin America and Caribbean. According to our results, the 21 countries in Western Europe (M =10.688, SD =0.306) compared to 21 countries in Latin America and Caribbean (M = 9.303, SD = 0.671) demonstrated significantly higher Happiness scores, t(40) = 8.607, p  <0.001).

Our other t-test is between Middle East and North Africa versus Central and Eastern Europe. According to the data, the 17 countries in Middle East and North Africa (M = 9.714, SD =0.927) are not demonstrating significantly different happiness scores, t(40) = 1.0712, p = 0.292 when compared to the 17 countries in Central and Eastern Europe (M = 9.998, SD = 0.397).

A one-way ANOVA was conducted to compare GDP Per Capita in Western Europe, Latin America and Caribbean, Middle East and North Africa, Central and Eastern Europe, and North America and ANZ. We found a significant difference in mean GDP (F(4,86)=15.763, p < 0.001) between the regions.

#### **Independent T-test**

In [None]:
#Independent t-test - GPD (WE vs LAC)
g1 = region_df_2020[region_df_2020['Regional indicator'] == 'Western Europe']["GDP per capita"]
g2 = region_df_2020[region_df_2020['Regional indicator'] == 'Latin America and Caribbean']["GDP per capita"]
stats.ttest_ind(g1,g2)

In [None]:
#Independent t-test - GPD (CEE vs MENA)
g3 = region_df_2020[region_df_2020['Regional indicator'] == 'Central and Eastern Europe']["GDP per capita"]
g4 = region_df_2020[region_df_2020['Regional indicator'] == 'Middle East and North Africa']["GDP per capita"]
stats.ttest_ind(g3,g4)

### ANOVA

In [None]:
# Create a boxplot to compare means - GDP 
boxprops = dict(linestyle='-', linewidth=3)
flierprops = dict(marker='o', markerfacecolor='silver', markersize=15,
                  linestyle='none')
medianprops = dict(linestyle='-', linewidth=2.5)
whiskerprops=dict(linestyle='-', linewidth=2)
capprops=dict(linestyle='-', linewidth=2)
color=dict(boxes='black', whiskers='black', medians='lime', caps='black')
region_GDP = region_df_2020.boxplot("GDP per capita", by="Regional indicator", figsize=(20, 10), fontsize=13,
                                             boxprops=boxprops, flierprops=flierprops, medianprops=medianprops, whiskerprops=whiskerprops,
                                            capprops=capprops, color=color)
region_GDP.set_title('GDP per Capita', fontsize=20)
plt.savefig("Images/gdp_by_region.png");

In [None]:
# Extract individual groups - Happiness Score
group1 = region_df_2020[region_df_2020['Regional indicator'] == 'Western Europe']["Happiness Score"]
group2 = region_df_2020[region_df_2020['Regional indicator'] == 'Latin America and Caribbean']["Happiness Score"]
group3 = region_df_2020[region_df_2020['Regional indicator'] == 'Central and Eastern Europe']["Happiness Score"]
group4 = region_df_2020[region_df_2020['Regional indicator'] == 'Middle East and North Africa']["Happiness Score"]
group5 = region_df_2020[region_df_2020['Regional indicator'] == 'North America and ANZ']["Happiness Score"]

In [None]:
# ANOVA
stats.f_oneway(group1, group2, group3,group4,group5)

### Happiness Score

Similarly, to understand the differences in happiness score for different regions, we first conducted two independent t-tests. 

The first t-test is between Western Europe versus Latin America and Caribbean. According to our results, the 21 countries in Western Europe (M =6.899, SD =0.683) demonstrated significantly higher happiness scores, t(40)= 4.425, p = 0.0000724 (<0.05) when compared to 21 countries in Latin America and Caribbean (M = 5.982, SD = 0.660).

Our other t-test is between Middle East and North Africa versus Central and Eastern Europe. According to the output, the 17 countries in Middle East and North Africa (M = 5.227, SD =0.988) demonstrated significantly lower happiness scores, t(40) = 2.421, p = .0213 (<0.05) when compared to 17 countries in Central and Eastern Europe (M = 5.884, SD = 0.523).

Our t-tests gave us some insights in the systematic difference in two groups; it is beneficial to further compare if significant differences in happiness score exists in all five regions. A one-way ANOVA was conducted to compare happiness scores in Western Europe, Latin America and Caribbean, Middle East and North Africa, Central and Eastern Europe, and finally North America and ANZ. We found a significant difference in mean happiness scores based on region memberships, F(4,86) = 15.7634, p value=0.000000002095 (<.05). While ANOVA tests answer whether all group means are the same, we conducted a post hoc test to find out where the differences are. We used 
t-tests with an adjustment to account for testing multiple times, which is available in the package “scikit_posthocs”. As shown in the table of p values below, there are group differences where the p value is less than 0.05. 

### **Independent T-test**

In [None]:
#Independent t-test - Happiness (WE vs LAC)
g1 = region_df_2020[region_df_2020['Regional indicator'] == 'Western Europe']["Happiness Score"]
g2 = region_df_2020[region_df_2020['Regional indicator'] == 'Latin America and Caribbean']["Happiness Score"]
stats.ttest_ind(g1,g2)

In [None]:
#Independent t-test - Happiness (CEE vs MENA)
g3 = region_df_2020[region_df_2020['Regional indicator'] == 'Central and Eastern Europe']["Happiness Score"]
g4 = region_df_2020[region_df_2020['Regional indicator'] == 'Middle East and North Africa']["Happiness Score"]
stats.ttest_ind(g3,g4)

In [None]:
# Create a boxplot to compare means
boxprops = dict(linestyle='-', linewidth=3)
flierprops = dict(marker='o', markerfacecolor='mediumblue', markersize=15,
                  linestyle='none')
medianprops = dict(linestyle='-', linewidth=2.5)
whiskerprops=dict(linestyle='-', linewidth=2)
capprops=dict(linestyle='-', linewidth=2)
color=dict(boxes='black', whiskers='black', medians='orange', caps='black')
region_happinness = region_df_2020.boxplot("Happiness Score", by="Regional indicator", figsize=(20, 10), fontsize=13,
                                             boxprops=boxprops, flierprops=flierprops, medianprops=medianprops, whiskerprops=whiskerprops,
                                            capprops=capprops, color=color)
region_happinness.set_title('Happiness Score', fontsize=20)
plt.savefig("Images/happiness_score_by_region.png");

### ANOVA

In [None]:
# Extract individual groups - GDP Score
group1 = region_df_2020[region_df_2020['Regional indicator'] == 'Western Europe']["GDP per capita"]
group2 = region_df_2020[region_df_2020['Regional indicator'] == 'Latin America and Caribbean']["GDP per capita"]
group3 = region_df_2020[region_df_2020['Regional indicator'] == 'Central and Eastern Europe']["GDP per capita"]
group4 = region_df_2020[region_df_2020['Regional indicator'] == 'Middle East and North Africa']["GDP per capita"]
group5 = region_df_2020[region_df_2020['Regional indicator'] == 'North America and ANZ']["GDP per capita"]

In [None]:
stats.f_oneway(group1, group2, group3,group4, group5)

### Posthocs

In [None]:
#Dependencies
import statsmodels.api as sa
import statsmodels.formula.api as sfa
import scikit_posthocs as sp

In [None]:
region_df_2020.columns

In [None]:
# Posthocs Happinness 
region_df_2020.columns=['Country', 'Region', 'Happiness', 'GDP', 'SSupport',
       'Health', 'Freedom', 'Generosity']
lm = sfa.ols('Happiness ~ Region',data=region_df_2020).fit()
anova = sa.stats.anova_lm(lm)
print(anova)
sp.posthoc_ttest(region_df_2020, val_col='Happiness', group_col='Region', p_adjust='holm')

In [None]:
# Posthocs GDP
region_df_2020.columns=['Country', 'Region', 'Happiness', 'GDP', 'SSupport',
       'Health', 'Freedom', 'Generosity']
lm = sfa.ols('Happiness ~ Region',data=region_df_2020).fit()
anova = sa.stats.anova_lm(lm)
print(anova)
sp.posthoc_ttest(region_df_2020, val_col='GDP', group_col='Region', p_adjust='holm')

In [None]:
# What is the correlation level between the variables and the happiness rank?
def happiness(x_values, y_values):
    (slope, intercept, rvalue, pvalue, stderr) = stats.linregress(x_values, y_values)
    regress_values = x_values * slope + intercept
    line_eq = "y =" + str(round(slope,2)) + "x +" + str(round(intercept,2))
    plt.scatter(x_values, y_values, c="royalblue", edgecolor="black")
    plt.plot(x_values, regress_values,"springgreen")
    plt.xlabel("Happiness Score")
    plt.ylabel(y_values.name)
    plt.show()
    print(line_eq)
    print(f"The r-value is: {rvalue}")

In [None]:
region_df_2020.columns

In [None]:
happiness(na_df["Happiness Score"], na_df["GDP per capita"])