# ***World Happiness Report - EDA***
---

For this Workshop we will explore the Happiness Score and Ranking across the planet and the relationship between the other variables in the datasets. Taking into account this information and the context of the problem we will culminate this notebook with the training of a Machine Learning model.

## **Setting the notebook**

First we will adjust the directory of our project in order to correctly detect the packages and modules that we are going to use.

In [1]:
import os

try:
    os.chdir("../../etl-workshop-3")
except FileNotFoundError:
    print("You are already in the correct directory.")

We proceed to import the following for this notebook:

### **Dependencies**

* **Pandas** ➜ Used for data manipulation and analysis.
* **Seaborn** ➜ Used for data visualization based on matplotlib.
* **Matplotlib** ➜ Used for creating static, animated, and interactive visualizations in Python.

### **Modules**

* **utils.dataframe_utils** ➜ Custom utility functions for dataframe operations.

In [2]:
# Data Manipulation
import pandas as pd
import country_converter as coco

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Python modules
from utils.dataframe_utils import *

## **Reading the data**

### ***2015***

In [3]:
df_2015 = pd.read_csv("./data/2015.csv")
df_2015.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


In [4]:
df_2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1

### ***2016***

In [5]:
df_2016 = pd.read_csv("./data/2016.csv")
df_2016.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137


In [6]:
df_2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        157 non-null    object 
 1   Region                         157 non-null    object 
 2   Happiness Rank                 157 non-null    int64  
 3   Happiness Score                157 non-null    float64
 4   Lower Confidence Interval      157 non-null    float64
 5   Upper Confidence Interval      157 non-null    float64
 6   Economy (GDP per Capita)       157 non-null    float64
 7   Family                         157 non-null    float64
 8   Health (Life Expectancy)       157 non-null    float64
 9   Freedom                        157 non-null    float64
 10  Trust (Government Corruption)  157 non-null    float64
 11  Generosity                     157 non-null    float64
 12  Dystopia Residual              157 non-null    flo

### ***2017***

In [7]:
df_2017 = pd.read_csv("./data/2017.csv")
df_2017.head(3)

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715


In [8]:
df_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        155 non-null    object 
 1   Happiness.Rank                 155 non-null    int64  
 2   Happiness.Score                155 non-null    float64
 3   Whisker.high                   155 non-null    float64
 4   Whisker.low                    155 non-null    float64
 5   Economy..GDP.per.Capita.       155 non-null    float64
 6   Family                         155 non-null    float64
 7   Health..Life.Expectancy.       155 non-null    float64
 8   Freedom                        155 non-null    float64
 9   Generosity                     155 non-null    float64
 10  Trust..Government.Corruption.  155 non-null    float64
 11  Dystopia.Residual              155 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 

### ***2018***

In [9]:
df_2018 = pd.read_csv("./data/2018.csv")
df_2018.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


In [10]:
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     155 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


### ***2019***

In [11]:
df_2019 = pd.read_csv("./data/2019.csv")
df_2019.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341


In [12]:
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


## ***Comparing the data***

Upon analyzing the previous dataframes, it can be observed that many of them contain columns that, although they provide similar information, differ in their naming. It is necessary to normalize these names in order to merge these datasets.

To achieve this, we will thoroughly analyze these differences to ultimately obtain the merged dataset.

In [13]:
happiness_dataframes = {
    "2015": df_2015,
    "2016": df_2016,
    "2017": df_2017,
    "2018": df_2018,
    "2019": df_2019
}

In the following section, we observe that both the number of rows and columns differ between years. Additionally, we notice that there is a row in 2018 containing a null value, which we will consider later. For now, we do not have duplicate data in any of the datasets.

In [14]:
briefing = dataframe_briefing(happiness_dataframes)
briefing

Unnamed: 0,Year,Number of rows,Number of columns,Number of null values,Number of duplicated values
0,2015,158,12,0,0
1,2016,157,13,0,0
2,2017,155,12,0,0
3,2018,156,9,1,0
4,2019,156,9,0,0


To determine these differences in the number and naming of columns, we will need to analyze how they change and are distributed over the years.

In [15]:
comparison_df = comparing_names(happiness_dataframes)
print(comparison_df)

                              2015 2016 2017 2018 2019
Column Name                                           
Social support                   ✘    ✘    ✘    ✔    ✔
Overall rank                     ✘    ✘    ✘    ✔    ✔
Country                          ✔    ✔    ✔    ✘    ✘
Perceptions of corruption        ✘    ✘    ✘    ✔    ✔
Happiness Rank                   ✔    ✔    ✘    ✘    ✘
Whisker.high                     ✘    ✘    ✔    ✘    ✘
Happiness.Rank                   ✘    ✘    ✔    ✘    ✘
Whisker.low                      ✘    ✘    ✔    ✘    ✘
Freedom to make life choices     ✘    ✘    ✘    ✔    ✔
Healthy life expectancy          ✘    ✘    ✘    ✔    ✔
Happiness.Score                  ✘    ✘    ✔    ✘    ✘
Region                           ✔    ✔    ✘    ✘    ✘
Country or region                ✘    ✘    ✘    ✔    ✔
Dystopia Residual                ✔    ✔    ✘    ✘    ✘
Economy..GDP.per.Capita.         ✘    ✘    ✔    ✘    ✘
Generosity                       ✔    ✔    ✔    ✔    ✔
Score     

## ***Data preparation***

### **Normalizing the names of the columns**

From the previous table, we will begin to normalize and unify the column names into broader categories. The categories to use now are:

* Happiness Score

* Happiness Rank

* Economy (GDP per Capita)

* Health (Life Expectancy)

* Social Support

* Freedom

* Perceptions of Corruption

* Dystopia Residual

In [16]:
# Definir el diccionario de mapeo excluyendo las columnas de Standard Error and Confidence Intervals
column_mapping = {    
    # Country
    'Country': 'country',
    'Country or region': 'country',
    
    # Happiness Score
    'Happiness Score': 'happiness_score',
    'Happiness.Score': 'happiness_score',
    'Score': 'happiness_score',
    
    # Happiness Rank
    'Happiness Rank': 'happiness_rank',
    'Happiness.Rank': 'happiness_rank',
    'Overall rank': 'happiness_rank',
    
    # Economy (GDP per Capita)
    'Economy (GDP per Capita)': 'economy',
    'Economy..GDP.per.Capita.': 'economy',
    'GDP per capita': 'economy',
    
    # Health (Life Expectancy)
    'Health (Life Expectancy)': 'health',
    'Health..Life.Expectancy.': 'health',
    'Healthy life expectancy': 'health',
    
    # Social Support
    'Family': 'social_support',
    'Social support': 'social_support',
    
    # Freedom
    'Freedom': 'freedom',
    'Freedom to make life choices': 'freedom',
    
    # Perceptions of Corruption
    'Trust (Government Corruption)': 'corruption_perception',
    'Trust..Government.Corruption.': 'corruption_perception',
    'Perceptions of corruption': 'corruption_perception',
    
    # Generosity
    'Generosity': 'generosity',
    
    # Dystopia Residual
    'Dystopia Residual': 'dystopia_residual',
    'Dystopia.Residual': 'dystopia_residual',
}

We iterate through the dictionary to apply the column mapping to all the available dataframes. Upon reviewing the results, we can observe the changes made.

In [17]:
normalized_datasets = {}

for year, df in happiness_dataframes.items():
    df = df.rename(columns=column_mapping)
    normalized_datasets[year] = df

In [18]:
normalized_datasets["2015"].head(3)

Unnamed: 0,country,Region,happiness_rank,happiness_score,Standard Error,economy,social_support,health,freedom,corruption_perception,generosity,dystopia_residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


In [19]:
normalized_datasets["2018"].head(3)

Unnamed: 0,happiness_rank,country,happiness_score,economy,social_support,health,freedom,generosity,corruption_perception
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


### **Adding years column**
To concatenate the DataFrames, we need to add a year column that allows us to identify the temporal aspect of the data.

In [20]:
yearly_happiness_dataframes = {}

for year, df in normalized_datasets.items():
    df["year"] = year
    yearly_happiness_dataframes[year] = df

In [21]:
yearly_happiness_dataframes["2015"].head(3)

Unnamed: 0,country,Region,happiness_rank,happiness_score,Standard Error,economy,social_support,health,freedom,corruption_perception,generosity,dystopia_residual,year
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,2015
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015


In [22]:
yearly_happiness_dataframes["2019"].head(3)

Unnamed: 0,happiness_rank,country,happiness_score,economy,social_support,health,freedom,generosity,corruption_perception,year
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393,2019
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41,2019
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341,2019


### **Concatenating all dataframes, keeping only the columns in common**

We perform the concatenation of the five dataframes contained in `yearly_happiness_dataframes`, retaining only the columns that are present in all of them. 

Let's also review which are those columns. In this case, the columns that we are going to drop are:

* Standard Error

* Lower Confidence Interval

* dystopia_residual

* Region

* Upper Confidence Interval

* Whisker.low

* Whisker.high

In [23]:
comparison_yearly_df = comparing_names(yearly_happiness_dataframes)
print(comparison_yearly_df)

                          2015 2016 2017 2018 2019
Column Name                                       
country                      ✔    ✔    ✔    ✔    ✔
Lower Confidence Interval    ✘    ✔    ✘    ✘    ✘
year                         ✔    ✔    ✔    ✔    ✔
Region                       ✔    ✔    ✘    ✘    ✘
dystopia_residual            ✔    ✔    ✔    ✘    ✘
corruption_perception        ✔    ✔    ✔    ✔    ✔
Whisker.high                 ✘    ✘    ✔    ✘    ✘
health                       ✔    ✔    ✔    ✔    ✔
social_support               ✔    ✔    ✔    ✔    ✔
happiness_rank               ✔    ✔    ✔    ✔    ✔
Upper Confidence Interval    ✘    ✔    ✘    ✘    ✘
freedom                      ✔    ✔    ✔    ✔    ✔
Whisker.low                  ✘    ✘    ✔    ✘    ✘
economy                      ✔    ✔    ✔    ✔    ✔
Standard Error               ✔    ✘    ✘    ✘    ✘
happiness_score              ✔    ✔    ✔    ✔    ✔
generosity                   ✔    ✔    ✔    ✔    ✔


---

In [24]:
common_columns = list(set.intersection(*(set(df.columns) for df in yearly_happiness_dataframes.values())))

filtered_dataframes = [df[common_columns] for df in yearly_happiness_dataframes.values()]

df = pd.concat(filtered_dataframes, ignore_index=True)

df.head()

Unnamed: 0,country,year,economy,corruption_perception,health,social_support,happiness_rank,freedom,happiness_score,generosity
0,Switzerland,2015,1.39651,0.41978,0.94143,1.34951,1,0.66557,7.587,0.29678
1,Iceland,2015,1.30232,0.14145,0.94784,1.40223,2,0.62877,7.561,0.4363
2,Denmark,2015,1.32548,0.48357,0.87464,1.36058,3,0.64938,7.527,0.34139
3,Norway,2015,1.459,0.36503,0.88521,1.33095,4,0.66973,7.522,0.34699
4,Canada,2015,1.32629,0.32957,0.90563,1.32261,5,0.63297,7.427,0.45811


### **Review of the final dataframe: filling N/A values**

Our final dataframe contains **782 rows** and **9 columns** that compile the data from all five years. Using the info method to view the dataframe details, we observe that Perceptions of Corruption contains a null value; we will examine this case in depth.

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 782 entries, 0 to 781
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                782 non-null    object 
 1   year                   782 non-null    object 
 2   economy                782 non-null    float64
 3   corruption_perception  781 non-null    float64
 4   health                 782 non-null    float64
 5   social_support         782 non-null    float64
 6   happiness_rank         782 non-null    int64  
 7   freedom                782 non-null    float64
 8   happiness_score        782 non-null    float64
 9   generosity             782 non-null    float64
dtypes: float64(7), int64(1), object(2)
memory usage: 61.2+ KB


In [26]:
df[df['corruption_perception'].isna()]

Unnamed: 0,country,year,economy,corruption_perception,health,social_support,happiness_rank,freedom,happiness_score,generosity
489,United Arab Emirates,2018,2.096,,0.67,0.776,20,0.284,6.774,0.186


In such cases, the recommended approach is to **replace the null value with the mean** of this field; we will proceed to do so. Before that, we will use the describe method to see the value we will use, which is `0.125436`.

Before moving on to the next step, I'd like you to consider this: check the minimum values in some columns. Doesn't it seem odd that ***there are values of 0***? We need to look into this more closely as well.

In [27]:
df.describe()

Unnamed: 0,economy,corruption_perception,health,social_support,happiness_rank,freedom,happiness_score,generosity
count,782.0,781.0,782.0,782.0,782.0,782.0,782.0,782.0
mean,0.916047,0.125436,0.612416,1.078392,78.69821,0.411091,5.379018,0.218576
std,0.40734,0.105816,0.248309,0.329548,45.182384,0.15288,1.127456,0.122321
min,0.0,0.0,0.0,0.0,1.0,0.0,2.693,0.0
25%,0.6065,0.054,0.440183,0.869363,40.0,0.309768,4.50975,0.13
50%,0.982205,0.091,0.64731,1.124735,79.0,0.431,5.322,0.201982
75%,1.236187,0.15603,0.808,1.32725,118.0,0.531,6.1895,0.278832
max,2.096,0.55191,1.141,1.644,158.0,0.724,7.769,0.838075


In [28]:
df["corruption_perception"] = (
                                df["corruption_perception"]
                                .fillna(df["corruption_perception"].mean())
                            )

In [29]:
df.query("country == 'United Arab Emirates' & year == '2018'")

Unnamed: 0,country,year,economy,corruption_perception,health,social_support,happiness_rank,freedom,happiness_score,generosity
489,United Arab Emirates,2018,2.096,0.125436,0.67,0.776,20,0.284,6.774,0.186


### **Review of the final dataframe: minimum values of 0**

Upon examining some values, we could determine that certain figures make sense within their context (for example, Somalia, whose economy is listed as 0). Therefore, transforming these values would require deeper investigation into the sources from which this data originates.

In [30]:
df[df.eq(0).any(axis=1)].head(10)

Unnamed: 0,country,year,economy,corruption_perception,health,social_support,happiness_rank,freedom,happiness_score,generosity
73,Indonesia,2015,0.82827,0.0,0.63793,1.08708,74,0.46611,5.399,0.51535
101,Greece,2015,1.15406,0.01397,0.88213,0.92933,102,0.07699,4.857,0.0
111,Iraq,2015,0.98549,0.13788,0.60237,0.81889,112,0.0,4.677,0.17922
119,Congo (Kinshasa),2015,0.0,0.07625,0.09806,1.0012,120,0.22605,4.517,0.24834
122,Sierra Leone,2015,0.33024,0.08786,0.0,0.95571,123,0.4084,4.507,0.21488
147,Central African Republic,2015,0.0785,0.08289,0.06699,0.0,148,0.48879,3.678,0.23835
233,Somalia,2016,0.0,0.3118,0.11466,0.33613,76,0.56778,5.44,0.27225
244,Bosnia and Herzegovina,2016,0.93383,0.0,0.70766,0.64367,87,0.09511,5.163,0.29889
256,Greece,2016,1.24886,0.04127,0.80029,0.75473,99,0.05822,5.033,0.0
268,Sierra Leone,2016,0.36485,0.08196,0.0,0.628,111,0.30685,4.635,0.23897


### **Review of the final dataframe: adding continent column**

A column indicating each country's continent will be added using `country_converter` to perform a more detailed geographical analysis.

In [31]:
cc = coco.CountryConverter()

def continent_conversion(country):
    try:
        return cc.convert(names=country, to='continent')
    except:
        return None

In [32]:
df["continent"] = df["country"].apply(continent_conversion)
df[["continent", "country"]].head(10)

Unnamed: 0,continent,country
0,Europe,Switzerland
1,Europe,Iceland
2,Europe,Denmark
3,Europe,Norway
4,America,Canada
5,Europe,Finland
6,Europe,Netherlands
7,Europe,Sweden
8,Oceania,New Zealand
9,Oceania,Australia


For this case, I would prefer to separate America into North America, Central America and South America. Let's review the unique values and establish the mapping of the continents.

In [33]:
unique_countries = df.drop_duplicates(subset=['country'])
unique_countries['continent'].value_counts()

continent
Africa     50
Asia       50
Europe     41
America    27
Oceania     2
Name: count, dtype: int64

In [34]:
unique_countries.query("continent == 'America'")["country"].unique()

array(['Canada', 'Costa Rica', 'Mexico', 'United States', 'Brazil',
       'Venezuela', 'Panama', 'Chile', 'Argentina', 'Uruguay', 'Colombia',
       'Suriname', 'Trinidad and Tobago', 'El Salvador', 'Guatemala',
       'Ecuador', 'Bolivia', 'Paraguay', 'Nicaragua', 'Peru', 'Jamaica',
       'Dominican Republic', 'Honduras', 'Haiti', 'Puerto Rico', 'Belize',
       'Trinidad & Tobago'], dtype=object)

In [35]:
continent_mapping = {
    "Canada": "North America",
    "Costa Rica": "Central America",
    "Mexico": "North America",
    "United States": "North America",
    "Brazil": "South America",
    "Venezuela": "South America",
    "Panama": "Central America",
    "Chile": "South America",
    "Argentina": "South America",
    "Uruguay": "South America",
    "Colombia": "South America",
    "Suriname": "South America",
    "Trinidad and Tobago": "South America",
    "El Salvador": "Central America",
    "Guatemala": "Central America",
    "Ecuador": "South America",
    "Bolivia": "South America",
    "Paraguay": "South America",
    "Nicaragua": "Central America",
    "Peru": "South America",
    "Jamaica": "Central America",
    "Dominican Republic": "Central America",
    "Honduras": "Central America",
    "Haiti": "Central America",
    "Puerto Rico": "Central America",
    "Belize": "Central America",
    "Trinidad & Tobago": "South America"
}

In [36]:
df["continent"] = df["country"].map(continent_mapping).fillna(df["continent"])
df[["continent", "country"]].head(10)

Unnamed: 0,continent,country
0,Europe,Switzerland
1,Europe,Iceland
2,Europe,Denmark
3,Europe,Norway
4,North America,Canada
5,Europe,Finland
6,Europe,Netherlands
7,Europe,Sweden
8,Oceania,New Zealand
9,Oceania,Australia


### **Review of the final dataframe: reordering the columns**
To make the dataframe easier to read, we will reorder the columns.

In [41]:
new_order = [
    'country',
    'continent',
    'year',
    'economy',
    'health',
    'social_support',
    'freedom',
    'corruption_perception',
    'generosity',
    'happiness_rank',
    'happiness_score'
]

df = df[new_order]
df.head()

Unnamed: 0,country,continent,year,economy,health,social_support,freedom,corruption_perception,generosity,happiness_rank,happiness_score
0,Switzerland,Europe,2015,1.39651,0.94143,1.34951,0.66557,0.41978,0.29678,1,7.587
1,Iceland,Europe,2015,1.30232,0.94784,1.40223,0.62877,0.14145,0.4363,2,7.561
2,Denmark,Europe,2015,1.32548,0.87464,1.36058,0.64938,0.48357,0.34139,3,7.527
3,Norway,Europe,2015,1.459,0.88521,1.33095,0.66973,0.36503,0.34699,4,7.522
4,Canada,North America,2015,1.32629,0.90563,1.32261,0.63297,0.32957,0.45811,5,7.427


## ***Data understanding***