### Comparing Aggregate Models for Regression

This try-it focuses on utilizing ensemble models in a regression setting.  Much like you have used individual classification estimators to form an ensemble of estimators -- here your goal is to explore ensembles for regression models.  As with your earlier assignment, you will use scikitlearn to carry out the ensembles using the `VotingRegressor`.   


#### Dataset and Task

Below, a dataset containing census information on individuals and their hourly wage is loaded using the `fetch_openml` function.  OpenML is another repository for datasets [here](https://www.openml.org/).  Your task is to use ensemble methods to explore predicting the `wage` column of the data.  Your ensemble should at the very least consider the following models:

- `LinearRegression` -- perhaps you even want the `TransformedTargetRegressor` here.
- `KNeighborsRegressor`
- `DecisionTreeRegressor`
- `Ridge`
- `SVR`

Tune the `VotingRegressor` to try to optimize the prediction performance and determine if the wisdom of the crowd performed better in this setting than any of the individual models themselves.  Report back on your findings and discuss the interpretability of your findings.  Is there a way to determine what features mattered in predicting wages?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import uniform


from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.inspection import permutation_importance

In [2]:
df: pd.DataFrame = fetch_openml(data_id=534, as_frame=True).frame

In [3]:
df.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


In [4]:
# Data Preparation
X = df.drop("WAGE", axis=1)
y = df["WAGE"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preprocessing for Numerical and Categorical Features
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["category"]).columns

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    verbose_feature_names_out=False,
)

# Model Definitions
models = [
    (
        "lr",
        LinearRegression(),
    ),
    ("knn", KNeighborsRegressor()),
    ("dt", DecisionTreeRegressor(random_state=42)),
    ("ridge", Ridge()),
    ("svr", SVR()),
]

# Parameter Grids for GridSearchCV
param_grids = {
    "lr": {"lr__fit_intercept": [True, False]},
    "knn": {"knn__n_neighbors": [3, 5, 7, 9]},
    "dt": {"dt__max_depth": [None, 5, 10, 15]},
    "ridge": {"ridge__alpha": [0.1, 1, 10]},
    "svr": {"svr__kernel": ["linear", "rbf"], "svr__C": [0.1, 1, 10]},
}

In [5]:
# Grid Search for Individual Models
best_models = []
model_results = pd.DataFrame(columns=["Train MSE", "Test MSE"])
for name, model in models:
    pipe = Pipeline(steps=[("preprocessor", preprocessor), (name, model)])
    grid_search = GridSearchCV(
        pipe, param_grids[name], scoring="neg_mean_squared_error", cv=5
    )
    grid_search.fit(X_train, y_train)
    best_models.append((name, grid_search.best_estimator_))
    model_results.loc[name] = [
        -grid_search.best_score_,
        mean_squared_error(y_test, grid_search.predict(X_test)),
    ]
    print(f"Best parameters for {name}: {grid_search.best_params_}")

Best parameters for lr: {'lr__fit_intercept': False}
Best parameters for knn: {'knn__n_neighbors': 9}
Best parameters for dt: {'dt__max_depth': 5}
Best parameters for ridge: {'ridge__alpha': 10}
Best parameters for svr: {'svr__C': 1, 'svr__kernel': 'linear'}


In [6]:
# Voting Regressor with Uniform Weights
voting_reg_uniform = VotingRegressor(estimators=best_models)
voting_reg_uniform.fit(X_train, y_train)
model_results.loc["Voting (Uniform)"] = [
    mean_squared_error(y_train, voting_reg_uniform.predict(X_train)),
    mean_squared_error(y_test, voting_reg_uniform.predict(X_test)),
]


In [10]:
# Randomized Grid Search for Voting Regressor
weights_dist = [uniform(0, 1).rvs(len(best_models)) for _ in range(200)]
param_dist = {
    "weights": [x / x.sum() for x in weights_dist]
}  # 20 random weight sets
random_search = RandomizedSearchCV(
    VotingRegressor(estimators=best_models),
    param_distributions=param_dist,
    n_iter=100,
    scoring="neg_mean_squared_error",
    cv=5,
)
random_search.fit(X_train, y_train)
best_voting_reg = random_search.best_estimator_
model_results.loc["Voting (Random Search)"] = [
    -random_search.best_score_,
    mean_squared_error(y_test, random_search.predict(X_test)),
]
print(
    f"Best parameters for VotingRegressor (Random Search): {random_search.best_params_}"
)

Best parameters for VotingRegressor (Random Search): {'weights': array([0.21227255, 0.05216569, 0.19193823, 0.3822477 , 0.16137583])}


In [15]:
feature_importances = {}

for name, model in best_models:
    result = permutation_importance(
        model, X_test, y_test, n_repeats=200, random_state=42, n_jobs=-1
    )
    feature_importances[name] = result.importances_mean
    
expanded_feature_names = best_voting_reg.named_estimators_['knn'].named_steps['preprocessor'].get_feature_names_out()
feature_importances_df = pd.DataFrame(feature_importances, index=X.columns).fillna(0)
feature_importances_df = feature_importances_df.map(lambda x: 0 if x < 0 else x).apply(lambda x: x/x.sum())

In [26]:
feature_importances_df.iloc[:, :-1].mean(axis=1)

EDUCATION     0.442210
SOUTH         0.023563
SEX           0.105666
EXPERIENCE    0.087952
UNION         0.008659
AGE           0.112843
RACE          0.009464
OCCUPATION    0.203899
SECTOR        0.004901
MARR          0.000844
dtype: float64

In [28]:
print("\nModel Results:")
display(model_results.drop_duplicates())
print("\nFeature Importances:")
feature_importances_df['mean'] = feature_importances_df.iloc[:, :-1].mean(axis=1)
display(feature_importances_df.sort_values(by='mean', ascending=False).drop('mean', axis=1))


Model Results:


Unnamed: 0,Train MSE,Test MSE
lr,18.870104,19.502597
knn,21.266187,20.867631
dt,22.948619,21.35588
ridge,18.673956,19.679409
svr,19.246641,21.70128
Voting (Uniform),15.066684,19.499988
Voting (Random Search),18.606379,19.447568



Feature Importances:


Unnamed: 0,lr,knn,dt,ridge,svr
EDUCATION,0.201684,0.584381,0.528106,0.522406,0.374473
OCCUPATION,0.068279,0.127563,0.134707,0.283919,0.405026
AGE,0.265797,0.039648,0.221155,0.033583,0.00403
SEX,0.023687,0.146328,0.086148,0.11914,0.153028
EXPERIENCE,0.43232,0.007438,0.0,0.0,0.0
SOUTH,0.002746,0.074009,0.007223,0.016095,0.01774
RACE,0.003687,0.0,0.0,0.016953,0.02668
UNION,0.0,0.020633,0.022661,0.0,0.0
SECTOR,0.00143,0.0,0.0,0.005379,0.017698
MARR,0.00037,0.0,0.0,0.002525,0.001325
