# **Happiness Score Prediction -  Machine Learning**

-------------------------------------------------------

##  **Objective**
Train a **regression machine learning model** to predict the **happiness score** using **a  CSV files**  on world happiness data from 2015 to 2019.The workflow includes:

- **Extracting features** from raw datasets (ETL process).
- **Training a regression model** using a **70-30 data split** (70% training, 30% testing).
- **Streaming transformed data** to a consumer.
- **Using the trained model** in the consumer to predict happiness scores.
- **Storing predictions** with the corresponding features in a database.
- **Evaluating performance** using **testing data and predicted values**.

---

## **Workflow Overview**

**Feature Engineering**  
   - Normalize `happiness_score` to fit within **[0,10]**.  
   - Scale numerical features using **MinMaxScaler** or **StandardScaler**.  

**Model Training**  
   - Use a **70-30 train-test split** to train the model.  
   - Compare different regression models:
     - **Linear Regression**
     - **Ridge & Lasso Regression**
     - **Random Forest Regressor**
     - **XGBoost Regressor**
   - **Tune hyperparameters** to improve performance.

 **Data Streaming**  
   - Stream **transformed data** to a consumer.
   - Retrieve data from the consumer.
   - Use the **trained model** to make predictions.

 **Database Storage**  
   - Store **predictions along with input features** in a database.  
   - Ensure data **integrity and accessibility** for future analysis.

**Model Evaluation**  
   - Compute **Mean Squared Error (MSE)** and **R²**.  
   - Analyze **residual distributions** for normality.  
   - Validate against **real-world happiness scores**.


## **Metadata**
- **Author:** Natalia López Gallego  
- **Python Version:** 3.12.10  

In [21]:
!pip install xgboost
!pip install scikit-learn
!pip install statsmodels



In [9]:
import pandas as pd
import numpy as np

# Model selection y preprocesamiento
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder


# Modelos de regresión
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Métricas de evaluación
from sklearn.metrics import root_mean_squared_error, r2_score

# Statsmodels para modelos estadísticos
import statsmodels.api as sm
import statsmodels.othermod.betareg as betareg
from statsmodels.genmod.generalized_linear_model import GLM

In [10]:
df = pd.read_csv("../data/interim/happiness_data.csv")

df.head()

Unnamed: 0,health_x_economy,economy,health,freedom,family,economy_t-1_x_health_t-1,continent,happiness_score
0,0.097017,0.31982,0.30335,0.23414,1.02951,0.0,Asia,3.575
1,0.066301,0.38227,0.17344,0.1643,0.11037,0.097017,Asia,3.36
2,0.072566,0.401477,0.180747,0.10618,0.581543,0.066301,Asia,3.794
3,0.08466,0.332,0.255,0.085,0.537,0.072566,Asia,3.632
4,0.12635,0.35,0.361,0.417,0.517,0.08466,Asia,3.203


In [11]:
# Use 'sparse_output=False' instead of 'sparse=False'
encoder = OneHotEncoder(drop="first", sparse_output=False)
encoded_continent = encoder.fit_transform(df[["continent"]])

# Convert the encoded array to a DataFrame
import pandas as pd
continent_df = pd.DataFrame(encoded_continent, columns=encoder.get_feature_names_out(["continent"]))

# Drop the original 'continent' column and join the new one-hot encoded columns
df = df.drop(columns=["continent"]).join(continent_df)

# Display the updated DataFrame
print(df.head())

   health_x_economy   economy    health  freedom    family  \
0          0.097017  0.319820  0.303350  0.23414  1.029510   
1          0.066301  0.382270  0.173440  0.16430  0.110370   
2          0.072566  0.401477  0.180747  0.10618  0.581543   
3          0.084660  0.332000  0.255000  0.08500  0.537000   
4          0.126350  0.350000  0.361000  0.41700  0.517000   

   economy_t-1_x_health_t-1  happiness_score  continent_Asia  \
0                  0.000000            3.575             1.0   
1                  0.097017            3.360             1.0   
2                  0.066301            3.794             1.0   
3                  0.072566            3.632             1.0   
4                  0.084660            3.203             1.0   

   continent_Europe  continent_North America  continent_Oceania  \
0               0.0                      0.0                0.0   
1               0.0                      0.0                0.0   
2               0.0                      

In [12]:
# Target variable and feature definition

X = df.drop(columns=["happiness_score"])
y = df["happiness_score"]


In [13]:
# Data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [14]:
def evaluate_model(name: str, y_true: np.ndarray, y_pred: np.ndarray, results: dict) -> None:
    """
    Evaluates a regression model's performance and stores the results.

    Parameters:
    - name (str): The name of the model.
    - y_true (np.ndarray): Array of actual target values.
    - y_pred (np.ndarray): Array of predicted target values.
    - results (dict): Dictionary to store evaluation metrics.

    Returns:
    - None: The function updates the results dictionary in-place.
    """
    rmse = root_mean_squared_error(y_true, y_pred)  # Updated function
    r2 = r2_score(y_true, y_pred)
    results[name] = {'RMSE': rmse, 'R2': r2}

#  Dictionary to store results
results = {}

In [15]:
# 1️⃣ Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
evaluate_model("LinearRegression", y_test, y_pred_lr, results)

In [16]:
# 2️⃣ GLM Gaussian (identidad)
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)
glm_gauss = sm.GLM(y_train, X_train_sm, family=sm.families.Gaussian()).fit()
y_pred_glm = glm_gauss.predict(X_test_sm)
evaluate_model("GLM_Gaussian", y_test, y_pred_glm, results)

In [17]:
# 3️⃣ Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
evaluate_model("Ridge", y_test, y_pred_ridge, results)

In [18]:
# 4️⃣ Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
evaluate_model("Lasso", y_test, y_pred_lasso, results)

In [20]:
# 5️⃣ Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
evaluate_model("RandomForest", y_test, y_pred_rf, results)

In [21]:
# 6️⃣ XGBoost Regressor
xgb = XGBRegressor(n_estimators=100, random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
evaluate_model("XGBoost", y_test, y_pred_xgb, results)

In [22]:
# 7️⃣ Beta Regression (requires scaling y between (0,1))
scaler_y = MinMaxScaler()

# Convert Series to NumPy arrays and reshape before scaling
y_train_beta = scaler_y.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_beta = scaler_y.transform(y_test.values.reshape(-1, 1)).flatten()

# Fit the Beta Regression model
glm_beta = GLM(y_train_beta, X_train_sm, family=sm.families.Binomial(link=sm.families.links.logit()))
beta_model = glm_beta.fit()

# Make predictions
y_pred_beta = beta_model.predict(X_test_sm)

# Reverse scaling to get predictions in the original scale
y_pred_beta_rescaled = scaler_y.inverse_transform(y_pred_beta.to_numpy().reshape(-1, 1)).flatten()

# Evaluate the model
evaluate_model("BetaRegression_scaled", y_test, y_pred_beta_rescaled, results)



In [23]:
# 📋 Mostrar resultados
results_df = pd.DataFrame(results).T.sort_values(by="RMSE")
print("📊 Model Comparison:")
print(results_df)

📊 Model Comparison:
                           RMSE        R2
RandomForest           0.450144  0.830738
XGBoost                0.470869  0.814794
GLM_Gaussian           0.472672  0.813372
LinearRegression       0.472672  0.813372
Ridge                  0.475470  0.811157
BetaRegression_scaled  0.480273  0.807322
Lasso                  0.676568  0.617634


In [48]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

# 🔁 Definir la validación cruzada (por ejemplo, KFold con 5 particiones)
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# 📦 Modelos a evaluar (excepto GLM y Beta)
models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, random_state=42),
}

# 📊 Evaluación de cada modelo con cross-validation
cv_results = {}

for name, model in models.items():
    # Pipeline para escalar y ajustar (excepto Random Forest/XGBoost, pero no afecta negativamente)
    pipeline = make_pipeline(StandardScaler(), model)
    scores = cross_val_score(pipeline, X, y, cv=cv, scoring='neg_root_mean_squared_error')  # Neg RMSE
    cv_results[name] = {
        "CV RMSE (mean)": -np.mean(scores),
        "CV RMSE (std)": np.std(scores)
    }

# 📋 Mostrar resultados
cv_df = pd.DataFrame(cv_results).T.sort_values(by="CV RMSE (mean)")
print("🔁 Cross-Validation Results:")
print(cv_df)


🔁 Cross-Validation Results:
                  CV RMSE (mean)  CV RMSE (std)
RandomForest            0.594741       0.027343
LinearRegression        0.631722       0.019379
Ridge                   0.631722       0.019383
XGBoost                 0.644110       0.025131
Lasso                   0.644536       0.017037
