# ***World Happiness Report - Model Training and Metrics***
---
In this notebook, we will carry out the training of a Machine Learning model using the World Happiness Report dataset. This process will include several key steps, such as **Data Preprocessing**, **Dataset Splitting** and **Model Selection and Training**. Finally, we will interpret the obtained results in the next notebook.

## **Setting the notebook**

First we will adjust the directory of our project in order to correctly detect the packages and modules that we are going to use.

In [1]:
import os

try:
    os.chdir("../../etl-workshop-3")
except FileNotFoundError:
    print("You are already in the correct directory.")

We proceed to import the following for this notebook:

### **Dependencies**

* **Pandas** ➜ Used for data manipulation and analysis.

* **scikit-learn** ➜ Used for machine learning, providing simple and efficient tools for data analysis.
    
    * *model_selection.train_test_split* ➜ Splits datasets into random train and test subsets for training and evaluating models.

    * *linear_model.LinealRegression* ➜ Implements ordinary least squares linear regression.

    * *ensemble.RandomForestRegressor* ➜ Implements a random forest regressor to improve predictive accuracy.

    * *ensemble.GradientBoostingRegressor* ➜ Implements gradient boosting regression to enhance accuracy by combining weak models.

    * *metrics.mean_squared_error* ➜ Calculates the mean squared error, a metric to evaluate the quality of predictions.

    * *metrics.r2_score* ➜ Calculates the R² coefficient of determination to measure model fit.

* **joblib** ➜ Library for serializing and deserializing Python objects, useful for saving and loading trained models.

In [2]:
# Data Manipulation
import pandas as pd

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, r2_score
import joblib

## **Reading the data**

We load the CSV generated in the first notebook (*01-EDA*) after performing transformations and merging the 5 datasets from the World Happiness Report.

In [3]:
df = pd.read_csv("./data/world_happiness_report.csv")

In [4]:
df.head()

Unnamed: 0,country,continent,year,economy,health,social_support,freedom,corruption_perception,generosity,happiness_rank,happiness_score
0,Switzerland,Europe,2015,1.39651,0.94143,1.34951,0.66557,0.41978,0.29678,1,7.587
1,Iceland,Europe,2015,1.30232,0.94784,1.40223,0.62877,0.14145,0.4363,2,7.561
2,Denmark,Europe,2015,1.32548,0.87464,1.36058,0.64938,0.48357,0.34139,3,7.527
3,Norway,Europe,2015,1.459,0.88521,1.33095,0.66973,0.36503,0.34699,4,7.522
4,Canada,North America,2015,1.32629,0.90563,1.32261,0.63297,0.32957,0.45811,5,7.427


## ***Data preprocessing and splitting***

In this section, we perform data preprocessing and split the data into training and testing sets using the functions `creating_dummy_variables` and `train_test_split`.

### **Getting dummy values for categorical columns**
First, we convert the `continent` column into dummy variables. This process is quite necessary, since the model would be unable to observe the trends of this variable if it were a column composed of text (categories), it is necessary to translate this categorization into a numerical format.

In [5]:
def creating_dummy_variables(df):
    df = pd.get_dummies(df, columns=["continent"])
    
    columns_rename = {
        "continent_North America": "continent_North_America",
        "continent_Central America": "continent_Central_America",
        "continent_South America": "continent_South_America"
    }

    df = df.rename(columns=columns_rename)
    
    return df

In [6]:
df = creating_dummy_variables(df)

### **Splitting data**

Now, we split the data into training and testing sets using the sklearn function `train_test_split`. First, we need to remove certain columns that would not benefit our model, which are:

* *happiness_score* ➜ This is the target variable.

* *happiness_rank* ➜ It has an inverse correlation with the target variable; including it would only confuse the model.

* *country* ➜ The large amount of categorical data it contains would make the training process too heavy.

In [7]:
X = df.drop(["happiness_score", "happiness_rank", "country"], axis = 1)
y = df["happiness_score"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=200)

In [9]:
print("Train data shape: ", X_train.shape)
print("Test data shape: ", X_test.shape)

Train data shape:  (547, 14)
Test data shape:  (235, 14)


In [10]:
X_test.columns

Index(['year', 'economy', 'health', 'social_support', 'freedom',
       'corruption_perception', 'generosity', 'continent_Africa',
       'continent_Asia', 'continent_Central_America', 'continent_Europe',
       'continent_North_America', 'continent_Oceania',
       'continent_South_America'],
      dtype='object')

## ***Model Selection and Training***

In this section, we evaluate three different regression models to predict countries' happiness scores based on the available socioeconomic and continental variables. Each model offers different advantages and approaches to address our prediction problem.

### **Lineal Regression**

We start with a simple linear regression model, which assumes a linear relationship between the independent variables and the target variable.

**Results:**

- **Mean Squared Error (MSE)**: 0.2109
- **Coefficient of Determination (R²)**: 0.8333

The linear regression model explains approximately 83% of the variance in the happiness scores, indicating a good fit. However, there is room for improvement.

In [11]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lf = lr_model.predict(X_test)

mse_lr = mean_squared_error(y_test, y_pred_lf)
r2_lr = r2_score(y_test, y_pred_lf)

print("Mean Squared Error for Linear Regression: ", mse_lr)
print("R2 Score for Linear Regression: ", r2_lr)

Mean Squared Error for Linear Regression:  0.21087396980793913
R2 Score for Linear Regression:  0.8332893378421595


### **Random Forest Regressor**

Next, we use a Random Forest Regressor, which is an ensemble model that builds multiple decision trees and merges them to get more accurate and stable predictions.

**Results:**

- **Mean Squared Error (MSE)**: 0.1700
- **Coefficient of Determination (R²)**: 0.8656

The Random Forest model achieves a lower MSE and a higher R² score compared to the linear regression, explaining about 86% of the variance. This indicates that the model captures non-linear relationships better.

In [12]:
rf_model = RandomForestRegressor(n_estimators=50, random_state=200)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Mean Squared Error for Random Forest: ", mse_rf)
print("R2 Score for Random Forest: ", r2_rf)

Mean Squared Error for Random Forest:  0.17005002768093916
R2 Score for Random Forest:  0.865563527160472


### **Gradient Boosting Regressor**

Finally, we implement a Gradient Boosting Regressor, an ensemble model that builds trees sequentially, each attempting to correct the errors of the previous one.

**Results:**

- **Mean Squared Error (MSE)**: 0.1702
- **Coefficient of Determination (R²)**: 0.8655

The Gradient Boosting model performs similarly to the Random Forest, with a slightly higher MSE and a similar R² score. This suggests that both ensemble methods are effective in modeling the data.

In [13]:
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print("Mean Squared Error for Gradient Boosting: ", mse_gb)
print("R2 Score for Gradient Boosting: ", r2_gb)

Mean Squared Error for Gradient Boosting:  0.17144932369572535
R2 Score for Gradient Boosting:  0.8644572855252797


### **Conclusions**

- Both ensemble models (Random Forest and Gradient Boosting) outperform the linear regression model, indicating the presence of non-linear relationships in the data.

- The Random Forest Regressor slightly outperforms the Gradient Boosting Regressor in this case.

- The Random Forest (RF) model will be saved for future predictions due to its better performance.

## ***Saving the RF model***

In [14]:
joblib.dump(rf_model, "./model/rf_model.pkl")

['./model/rf_model.pkl']