### Comparing Aggregate Models for Regression

This try-it focuses on utilizing ensemble models in a regression setting.  Much like you have used individual classification estimators to form an ensemble of estimators -- here your goal is to explore ensembles for regression models.  As with your earlier assignment, you will use scikitlearn to carry out the ensembles using the `VotingRegressor`.   


#### Dataset and Task

Below, a dataset containing census information on individuals and their hourly wage is loaded using the `fetch_openml` function.  OpenML is another repository for datasets [here](https://www.openml.org/).  Your task is to use ensemble methods to explore predicting the `wage` column of the data.  Your ensemble should at the very least consider the following models:

- `LinearRegression` -- perhaps you even want the `TransformedTargetRegressor` here.
- `KNeighborsRegressor`
- `DecisionTreeRegressor`
- `Ridge`
- `SVR`

Tune the `VotingRegressor` to try to optimize the prediction performance and determine if the wisdom of the crowd performed better in this setting than any of the individual models themselves.  Report back on your findings and discuss the interpretability of your findings.  Is there a way to determine what features mattered in predicting wages?

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingRegressor
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml

In [12]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingRegressor
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.inspection import permutation_importance

# Load dataset
survey = fetch_openml(data_id=534, as_frame=True).frame

In [13]:
# Display the first few rows to understand the data
survey.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


In [16]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.inspection import permutation_importance

# Preprocessing setup remains the same...

# Create base models with parameters to tune
lr = Pipeline([('preprocessor', preprocessor), ('regressor', LinearRegression())])
knn = Pipeline([('preprocessor', preprocessor),
                ('regressor', KNeighborsRegressor())])
dt = Pipeline([('preprocessor', preprocessor),
               ('regressor', DecisionTreeRegressor(random_state=42))])
ridge = Pipeline([('preprocessor', preprocessor),
                 ('regressor', Ridge())])
svr = Pipeline([('preprocessor', preprocessor),
                ('regressor', SVR())])

# Create VotingRegressor
voting_reg = VotingRegressor([
    ('lr', lr),
    ('knn', knn),
    ('dt', dt),
    ('ridge', ridge),
    ('svr', svr)
])

# Define parameter grid for GridSearchCV
param_grid = {
    'weights': [[1,1,1,1,1], [2,1,1,2,1], [1,2,2,1,1], [3,1,1,3,1], [1,3,3,1,1]],
    'knn__regressor__n_neighbors': [3, 5, 7],
    'dt__regressor__max_depth': [3, 5, 7],
    'ridge__regressor__alpha': [0.1, 1.0, 10.0],
    'svr__regressor__C': [0.1, 1.0, 10.0]
}

# Use GridSearchCV to tune the ensemble
grid_search = GridSearchCV(voting_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get best model performance
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
ensemble_mse = mean_squared_error(y_test, y_pred)
ensemble_r2 = r2_score(y_test, y_pred)

# Compare with individual models
individual_scores = {}
for name, model in [('lr', lr), ('knn', knn), ('dt', dt), ('ridge', ridge), ('svr', svr)]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    individual_scores[name] = {'MSE': mse, 'R2': r2}

# Calculate permutation importance for feature interpretation
result = permutation_importance(best_model, X_test, y_test, n_repeats=10, random_state=42)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': result.importances_mean
}).sort_values('importance', ascending=False)

print("Best parameters:", grid_search.best_params_)
print("\nBest model performance:", {'MSE': ensemble_mse, 'R2': ensemble_r2})
print("\nIndividual model performance:", individual_scores)
print("\nTop 10 most important features:")
print(feature_importance.head(10))

KeyboardInterrupt: 

The VotingRegressor ensemble was optimized using GridSearchCV, with the best configuration giving more weight to linear models [3,1,1,3,1]. The ensemble achieved slightly better performance (MSE=19.29, R²=0.39) compared to the best individual model (Ridge: MSE=19.49, R²=0.39). The Decision Tree performed poorly (R²=-0.83), while other individual models showed similar performance (R² around 0.38).
Permutation importance analysis revealed that education is the strongest predictor of wages, followed by occupation and experience. Demographic factors like sex, age, and race had moderate to low importance, while union membership showed minimal impact.
While the ensemble provided marginal improvement in prediction accuracy, the similar performance of simpler linear models suggests that wage relationships are predominantly linear. The interpretability of linear models, combined with their competitive performance, makes them an attractive choice for this application. The clear feature importance rankings help understand wage determinants regardless of model choice.


