# **Happiness Score Prediction -  Machine Learning**

-------------------------------------------------------

##  **Objective**
Train a **regression machine learning model** to predict the **happiness score** using **a  CSV files**  on world happiness data from 2015 to 2019.The workflow includes:

- **Extracting features** from raw datasets (ETL process).
- **Training a regression model** using a **70-30 data split** (70% training, 30% testing).
- **Streaming transformed data** to a consumer.
- **Using the trained model** in the consumer to predict happiness scores.
- **Storing predictions** with the corresponding features in a database.
- **Evaluating performance** using **testing data and predicted values**.

---

## **Workflow Overview**

**Feature Engineering**  
   - Normalize `happiness_score` to fit within **[0,10]**.  
   - Scale numerical features using **MinMaxScaler** or **StandardScaler**.  

**Model Training**  
   - Use a **70-30 train-test split** to train the model.  
   - Compare different regression models:
     - **Linear Regression**
     - **Ridge & Lasso Regression**
     - **Random Forest Regressor**
     - **XGBoost Regressor**
   - **Tune hyperparameters** to improve performance.

 **Data Streaming**  
   - Stream **transformed data** to a consumer.
   - Retrieve data from the consumer.
   - Use the **trained model** to make predictions.

 **Database Storage**  
   - Store **predictions along with input features** in a database.  
   - Ensure data **integrity and accessibility** for future analysis.

**Model Evaluation**  
   - Compute **Mean Squared Error (MSE)** and **R²**.  
   - Analyze **residual distributions** for normality.  
   - Validate against **real-world happiness scores**.


## **Metadata**
- **Author:** Natalia López Gallego  
- **Python Version:** 3.12.10  

In [2]:
import pandas as pd
import numpy as np

# Model selection y preprocesamiento
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder


# Modelos de regresión
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor


# Métricas de evaluación
from sklearn.metrics import root_mean_squared_error, r2_score

# Statsmodels para modelos estadísticos
import statsmodels.api as sm
import statsmodels.othermod.betareg as betareg
from statsmodels.genmod.generalized_linear_model import GLM

import pickle

In [3]:
df = pd.read_csv("../data/interim/happiness_data_alternative.csv")

df.head()

Unnamed: 0,health_x_economy,freedom,family,health,economy_t-1_x_health_t-1,family_generosity_ratio,continent,happiness_score,trust,economy_health_ratio,health_x_country_economy_mean,economy,family_t-1_x_freedom_t-1,country_economy_mean
0,0.097017,0.23414,1.02951,0.30335,0.0,2.819726,Asia,3.575,0.09719,1.054259,0.10833,0.31982,0.0,0.357113
1,0.066301,0.1643,0.11037,0.17344,0.097017,0.352969,Asia,3.36,0.07112,2.20392,0.061938,0.38227,0.241049,0.357113
2,0.072566,0.10618,0.581543,0.180747,0.066301,1.864633,Asia,3.794,0.061158,2.221091,0.064547,0.401477,0.018134,0.357113
3,0.08466,0.085,0.537,0.255,0.072566,2.811371,Asia,3.632,0.036,1.30191,0.091064,0.332,0.061748,0.357113
4,0.12635,0.417,0.517,0.361,0.08466,3.271945,Asia,3.203,0.025,0.969502,0.128918,0.35,0.045645,0.357113


In [4]:
# Use 'sparse_output=False' instead of 'sparse=False'
encoder = OneHotEncoder(drop="first", sparse_output=False)
encoded_continent = encoder.fit_transform(df[["continent"]])

# Convert the encoded array to a DataFrame
import pandas as pd
continent_df = pd.DataFrame(encoded_continent, columns=encoder.get_feature_names_out(["continent"]))

# Drop the original 'continent' column and join the new one-hot encoded columns
df = df.drop(columns=["continent"]).join(continent_df)

# Display the updated DataFrame
print(df.head())

   health_x_economy  freedom    family    health  economy_t-1_x_health_t-1  \
0          0.097017  0.23414  1.029510  0.303350                  0.000000   
1          0.066301  0.16430  0.110370  0.173440                  0.097017   
2          0.072566  0.10618  0.581543  0.180747                  0.066301   
3          0.084660  0.08500  0.537000  0.255000                  0.072566   
4          0.126350  0.41700  0.517000  0.361000                  0.084660   

   family_generosity_ratio  happiness_score     trust  economy_health_ratio  \
0                 2.819726            3.575  0.097190              1.054259   
1                 0.352969            3.360  0.071120              2.203920   
2                 1.864633            3.794  0.061158              2.221091   
3                 2.811371            3.632  0.036000              1.301910   
4                 3.271945            3.203  0.025000              0.969502   

   health_x_country_economy_mean   economy  family_t-1_x

In [5]:
df = df.rename(columns={
    'economy_t-1_x_health_t-1': 'economy_t1_x_health_t1',
    'family_t-1_x_freedom_t-1': 'family_t1_x_freedom_t1'
})


In [6]:
# Target variable and feature definition

X = df.drop(columns=["happiness_score"])
y = df["happiness_score"]


In [7]:
# Data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

In [8]:
def evaluate_model(name: str, y_true: np.ndarray, y_pred: np.ndarray, results: dict) -> None:
    """
    Evaluates a regression model's performance and stores the results.

    Parameters:
    - name (str): The name of the model.
    - y_true (np.ndarray): Array of actual target values.
    - y_pred (np.ndarray): Array of predicted target values.
    - results (dict): Dictionary to store evaluation metrics.

    Returns:
    - None: The function updates the results dictionary in-place.
    """
    rmse = root_mean_squared_error(y_true, y_pred)  # Updated function
    r2 = r2_score(y_true, y_pred)
    results[name] = {'RMSE': rmse, 'R2': r2}

#  Dictionary to store results
results = {}

In [9]:
# 1️⃣ Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
evaluate_model("LinearRegression", y_test, y_pred_lr, results)

In [10]:
# 2️⃣ GLM Gaussian (identidad)
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)
glm_gauss = sm.GLM(y_train, X_train_sm, family=sm.families.Gaussian()).fit()
y_pred_glm = glm_gauss.predict(X_test_sm)
evaluate_model("GLM_Gaussian", y_test, y_pred_glm, results)

In [11]:
# 3️⃣ Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
evaluate_model("Ridge", y_test, y_pred_ridge, results)

In [12]:
# 4️⃣ Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
evaluate_model("Lasso", y_test, y_pred_lasso, results)

  model = cd_fast.enet_coordinate_descent(


In [13]:
# 5️⃣ Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
evaluate_model("RandomForest", y_test, y_pred_rf, results)

In [14]:
# 6️⃣ XGBoost Regressor
xgb = XGBRegressor(n_estimators=100, random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
evaluate_model("XGBoost", y_test, y_pred_xgb, results)

In [15]:
# 7️⃣ Beta Regression (requires scaling y between (0,1))
scaler_y = MinMaxScaler()

# Convert Series to NumPy arrays and reshape before scaling
y_train_beta = scaler_y.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_beta = scaler_y.transform(y_test.values.reshape(-1, 1)).flatten()

# Fit the Beta Regression model
glm_beta = GLM(y_train_beta, X_train_sm, family=sm.families.Binomial(link=sm.families.links.logit()))
beta_model = glm_beta.fit()

# Make predictions
y_pred_beta = beta_model.predict(X_test_sm)

# Reverse scaling to get predictions in the original scale
y_pred_beta_rescaled = scaler_y.inverse_transform(y_pred_beta.to_numpy().reshape(-1, 1)).flatten()

# Evaluate the model
evaluate_model("BetaRegression_scaled", y_test, y_pred_beta_rescaled, results)



In [16]:
#CatBoost Regressor
catboost = CatBoostRegressor(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    verbose=0,
    random_seed=42
)
catboost.fit(X_train, y_train)
y_pred_catboost = catboost.predict(X_test)
evaluate_model("CatBoost", y_test, y_pred_catboost, results)

In [17]:

lgbm = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.05,
    max_depth=6,
    random_state=42
)
lgbm.fit(X_train, y_train)
y_pred_lgbm = lgbm.predict(X_test)
evaluate_model("LightGBM", y_test, y_pred_lgbm, results)


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000791 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2358
[LightGBM] [Info] Number of data points in the train set: 625, number of used features: 17
[LightGBM] [Info] Start training from score 5.378979


In [18]:
hgb = HistGradientBoostingRegressor(
    max_iter=100,
    learning_rate=0.05,
    max_depth=6,
    random_state=42
)
hgb.fit(X_train, y_train)
y_pred_hgb = hgb.predict(X_test)
evaluate_model("HistGradientBoosting", y_test, y_pred_hgb, results)

In [19]:
# 📋 Mostrar resultados
results_df = pd.DataFrame(results).T.sort_values(by="RMSE")
print("📊 Model Comparison:")
print(results_df)

📊 Model Comparison:
                           RMSE        R2
CatBoost               0.407501  0.863128
RandomForest           0.411677  0.860308
XGBoost                0.422940  0.852560
HistGradientBoosting   0.433288  0.845257
LightGBM               0.434757  0.844206
Ridge                  0.474136  0.814705
GLM_Gaussian           0.477822  0.811812
LinearRegression       0.477822  0.811812
BetaRegression_scaled  0.485072  0.806058
Lasso                  0.684785  0.613484


In [20]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

# 🔁 Definir la validación cruzada (por ejemplo, KFold con 5 particiones)
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# 📦 Modelos a evaluar (excepto GLM y Beta)
models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, random_state=42),
}

# 📊 Evaluación de cada modelo con cross-validation
cv_results = {}

for name, model in models.items():
    # Pipeline para escalar y ajustar (excepto Random Forest/XGBoost, pero no afecta negativamente)
    pipeline = make_pipeline(StandardScaler(), model)
    scores = cross_val_score(pipeline, X, y, cv=cv, scoring='neg_root_mean_squared_error')  # Neg RMSE
    cv_results[name] = {
        "CV RMSE (mean)": -np.mean(scores),
        "CV RMSE (std)": np.std(scores)
    }

# 📋 Mostrar resultados
cv_df = pd.DataFrame(cv_results).T.sort_values(by="CV RMSE (mean)")
print("🔁 Cross-Validation Results:")
print(cv_df)


🔁 Cross-Validation Results:
                  CV RMSE (mean)  CV RMSE (std)
RandomForest            0.417884       0.022192
XGBoost                 0.424532       0.021658
Ridge                   0.510341       0.028423
LinearRegression        0.510508       0.028562
Lasso                   0.558920       0.034985


In [21]:
# Guardar en formato Pickle
with open("../models/catboost_model.pkl", "wb") as f:
    pickle.dump(catboost, f)