# ***World Happiness Report - EDA***
---

For this Workshop we will explore the Happiness Score and Ranking across the planet and the relationship between the other variables in the datasets. Taking into account this information and the context of the problem we will culminate this notebook with the training of a Machine Learning model.

## **Setting the notebook**

First we will adjust the directory of our project in order to correctly detect the packages and modules that we are going to use.

In [1]:
import os

try:
    os.chdir("../../etl-workshop-3")
except FileNotFoundError:
    print("You are already in the correct directory.")

We proceed to import the following for this notebook:

### **Dependencies**

* **Pandas** ➜ Used for data manipulation and analysis.
* **Seaborn** ➜ Used for data visualization based on matplotlib.
* **Matplotlib** ➜ Used for creating static, animated, and interactive visualizations in Python.

### **Modules**

* **utils.dataframe_utils** ➜ Custom utility functions for dataframe operations.

In [2]:
# Data Manipulation
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Python modules
from utils.dataframe_utils import *

## **Reading the data**

### ***2015***

In [3]:
df_2015 = pd.read_csv("./data/2015.csv")
df_2015.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


In [4]:
df_2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1

### ***2016***

In [5]:
df_2016 = pd.read_csv("./data/2016.csv")
df_2016.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137


In [6]:
df_2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        157 non-null    object 
 1   Region                         157 non-null    object 
 2   Happiness Rank                 157 non-null    int64  
 3   Happiness Score                157 non-null    float64
 4   Lower Confidence Interval      157 non-null    float64
 5   Upper Confidence Interval      157 non-null    float64
 6   Economy (GDP per Capita)       157 non-null    float64
 7   Family                         157 non-null    float64
 8   Health (Life Expectancy)       157 non-null    float64
 9   Freedom                        157 non-null    float64
 10  Trust (Government Corruption)  157 non-null    float64
 11  Generosity                     157 non-null    float64
 12  Dystopia Residual              157 non-null    flo

### ***2017***

In [7]:
df_2017 = pd.read_csv("./data/2017.csv")
df_2017.head(3)

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715


In [8]:
df_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        155 non-null    object 
 1   Happiness.Rank                 155 non-null    int64  
 2   Happiness.Score                155 non-null    float64
 3   Whisker.high                   155 non-null    float64
 4   Whisker.low                    155 non-null    float64
 5   Economy..GDP.per.Capita.       155 non-null    float64
 6   Family                         155 non-null    float64
 7   Health..Life.Expectancy.       155 non-null    float64
 8   Freedom                        155 non-null    float64
 9   Generosity                     155 non-null    float64
 10  Trust..Government.Corruption.  155 non-null    float64
 11  Dystopia.Residual              155 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 

### ***2018***

In [9]:
df_2018 = pd.read_csv("./data/2018.csv")
df_2018.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


In [10]:
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     155 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


### ***2019***

In [11]:
df_2019 = pd.read_csv("./data/2019.csv")
df_2019.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341


In [12]:
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


## ***Comparing the data***

Upon analyzing the previous dataframes, it can be observed that many of them contain columns that, although they provide similar information, differ in their naming. It is necessary to normalize these names in order to merge these datasets.

To achieve this, we will thoroughly analyze these differences to ultimately obtain the merged dataset.

In [13]:
happiness_dataframes = {
    "2015": df_2015,
    "2016": df_2016,
    "2017": df_2017,
    "2018": df_2018,
    "2019": df_2019
}

In the following section, we observe that both the number of rows and columns differ between years. Additionally, we notice that there is a row in 2018 containing a null value, which we will consider later. For now, we do not have duplicate data in any of the datasets.

In [14]:
briefing = dataframe_briefing(happiness_dataframes)
briefing

Unnamed: 0,Year,Number of rows,Number of columns,Number of null values,Number of duplicate values
0,2015,158,12,0,0
1,2016,157,13,0,0
2,2017,155,12,0,0
3,2018,156,9,1,0
4,2019,156,9,0,0


To determine these differences in the number and naming of columns, we will need to analyze how they change and are distributed over the years.

In [15]:
comparison_df = comparing_names(happiness_dataframes)
print(comparison_df)

                              2015 2016 2017 2018 2019
Column Name                                           
Whisker.low                      ✘    ✘    ✔    ✘    ✘
Trust (Government Corruption)    ✔    ✔    ✘    ✘    ✘
Perceptions of corruption        ✘    ✘    ✘    ✔    ✔
Happiness Rank                   ✔    ✔    ✘    ✘    ✘
GDP per capita                   ✘    ✘    ✘    ✔    ✔
Score                            ✘    ✘    ✘    ✔    ✔
Happiness.Score                  ✘    ✘    ✔    ✘    ✘
Happiness.Rank                   ✘    ✘    ✔    ✘    ✘
Happiness Score                  ✔    ✔    ✘    ✘    ✘
Freedom to make life choices     ✘    ✘    ✘    ✔    ✔
Lower Confidence Interval        ✘    ✔    ✘    ✘    ✘
Freedom                          ✔    ✔    ✔    ✘    ✘
Upper Confidence Interval        ✘    ✔    ✘    ✘    ✘
Country                          ✔    ✔    ✔    ✘    ✘
Overall rank                     ✘    ✘    ✘    ✔    ✔
Trust..Government.Corruption.    ✘    ✘    ✔    ✘    ✘
Social sup

## ***Data preparation***

### **Normalizing the names of the columns**

From the previous table, we will begin to normalize and unify the column names into broader categories. The categories to use now are:

* Happiness Score

* Happiness Rank

* Economy (GDP per Capita)

* Health (Life Expectancy)

* Social Support

* Freedom

* Perceptions of Corruption

* Dystopia Residual

In [16]:
# Definir el diccionario de mapeo excluyendo las columnas de Standard Error and Confidence Intervals
column_mapping = {    
    # Country
    'Country or region': 'Country',
    
    # Happiness Score
    'Happiness Score': 'Happiness Score',
    'Happiness.Score': 'Happiness Score',
    'Score': 'Happiness Score',
    
    # Happiness Rank
    'Happiness Rank': 'Happiness Rank',
    'Happiness.Rank': 'Happiness Rank',
    'Overall rank': 'Happiness Rank',
    
    # Economy (GDP per Capita)
    'Economy (GDP per Capita)': 'Economy (GDP per Capita)',
    'Economy..GDP.per.Capita.': 'Economy (GDP per Capita)',
    'GDP per capita': 'Economy (GDP per Capita)',
    
    # Health (Life Expectancy)
    'Health (Life Expectancy)': 'Health (Life Expectancy)',
    'Health..Life.Expectancy.': 'Health (Life Expectancy)',
    'Healthy life expectancy': 'Health (Life Expectancy)',
    
    # Social Support
    'Family': 'Social Support',
    'Social support': 'Social Support',
    
    # Freedom
    'Freedom': 'Freedom',
    'Freedom to make life choices': 'Freedom',
    
    # Perceptions of Corruption
    'Trust (Government Corruption)': 'Perceptions of Corruption',
    'Trust..Government.Corruption.': 'Perceptions of Corruption',
    'Perceptions of corruption': 'Perceptions of Corruption',
    
    # Dystopia Residual
    'Dystopia Residual': 'Dystopia Residual',
    'Dystopia.Residual': 'Dystopia Residual',
}

We iterate through the dictionary to apply the column mapping to all the available dataframes. Upon reviewing the results, we can observe the changes made.

In [17]:
normalized_datasets = {}

for year, df in happiness_dataframes.items():
    df = df.rename(columns=column_mapping)
    normalized_datasets[year] = df

In [18]:
normalized_datasets["2015"].head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Social Support,Health (Life Expectancy),Freedom,Perceptions of Corruption,Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


In [19]:
normalized_datasets["2018"].head(3)

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social Support,Health (Life Expectancy),Freedom,Generosity,Perceptions of Corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


### **Adding years column**
To concatenate the DataFrames, we need to add a year column that allows us to identify the temporal aspect of the data.

In [20]:
yearly_happiness_dataframes = {}

for year, df in normalized_datasets.items():
    df["Year"] = year
    yearly_happiness_dataframes[year] = df

In [21]:
yearly_happiness_dataframes["2015"].head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Social Support,Health (Life Expectancy),Freedom,Perceptions of Corruption,Generosity,Dystopia Residual,Year
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,2015
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015


In [22]:
yearly_happiness_dataframes["2019"].head(3)

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social Support,Health (Life Expectancy),Freedom,Generosity,Perceptions of Corruption,Year
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393,2019
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41,2019
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341,2019


### **Concatenating all dataframes, keeping only the columns in common**

We perform the concatenation of the five dataframes contained in `yearly_happiness_dataframes`, retaining only the columns that are present in all of them.

In [23]:
common_columns = list(set.intersection(*(set(df.columns) for df in yearly_happiness_dataframes.values())))

filtered_dataframes = [df[common_columns] for df in yearly_happiness_dataframes.values()]

df = pd.concat(filtered_dataframes, ignore_index=True)

df.head()

Unnamed: 0,Happiness Score,Freedom,Country,Health (Life Expectancy),Year,Generosity,Social Support,Happiness Rank,Economy (GDP per Capita),Perceptions of Corruption
0,7.587,0.66557,Switzerland,0.94143,2015,0.29678,1.34951,1,1.39651,0.41978
1,7.561,0.62877,Iceland,0.94784,2015,0.4363,1.40223,2,1.30232,0.14145
2,7.527,0.64938,Denmark,0.87464,2015,0.34139,1.36058,3,1.32548,0.48357
3,7.522,0.66973,Norway,0.88521,2015,0.34699,1.33095,4,1.459,0.36503
4,7.427,0.63297,Canada,0.90563,2015,0.45811,1.32261,5,1.32629,0.32957


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 782 entries, 0 to 781
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Happiness Score            782 non-null    float64
 1   Freedom                    782 non-null    float64
 2   Country                    782 non-null    object 
 3   Health (Life Expectancy)   782 non-null    float64
 4   Year                       782 non-null    object 
 5   Generosity                 782 non-null    float64
 6   Social Support             782 non-null    float64
 7   Happiness Rank             782 non-null    int64  
 8   Economy (GDP per Capita)   782 non-null    float64
 9   Perceptions of Corruption  781 non-null    float64
dtypes: float64(7), int64(1), object(2)
memory usage: 61.2+ KB


## ***Data understanding***