### Comparing Aggregate Models for Regression

This try-it focuses on utilizing ensemble models in a regression setting.  Much like you have used individual classification estimators to form an ensemble of estimators -- here your goal is to explore ensembles for regression models.  As with your earlier assignment, you will use scikitlearn to carry out the ensembles using the `VotingRegressor`.   


#### Dataset and Task

Below, a dataset containing census information on individuals and their hourly wage is loaded using the `fetch_openml` function.  OpenML is another repository for datasets [here](https://www.openml.org/).  Your task is to use ensemble methods to explore predicting the `wage` column of the data.  Your ensemble should at the very least consider the following models:

- `LinearRegression` -- perhaps you even want the `TransformedTargetRegressor` here.
- `KNeighborsRegressor`
- `DecisionTreeRegressor`
- `Ridge`
- `SVR`

Tune the `VotingRegressor` to try to optimize the prediction performance and determine if the wisdom of the crowd performed better in this setting than any of the individual models themselves.  Report back on your findings and discuss the interpretability of your findings.  Is there a way to determine what features mattered in predicting wages?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingRegressor
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingRegressor 
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.inspection import permutation_importance

# Load dataset
survey = fetch_openml(data_id=534, as_frame=True).frame

In [3]:
# Display the first few rows to understand the data
survey.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


Here I inspect the first few rows and basic info about the dataset to understand the available features and confirm that the WAGE column is present and numeric.

In [8]:
survey.info()
survey.describe(include="all")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    int64   
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    int64   
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    int64   
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB


Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
count,534.0,534,534,534.0,534,534.0,534.0,534,534,534,534
unique,,2,2,,2,,,3,6,3,2
top,,no,male,,not_member,,,White,Other,Other,Married
freq,,378,289,,438,,,440,156,411,350
mean,13.018727,,,17.822097,,9.024064,36.833333,,,,
std,2.615373,,,12.37971,,5.139097,11.726573,,,,
min,2.0,,,0.0,,1.0,18.0,,,,
25%,12.0,,,8.0,,5.25,28.0,,,,
50%,12.0,,,15.0,,7.78,35.0,,,,
75%,15.0,,,26.0,,11.25,44.0,,,,


## Explanation of the results

534 rows, 11 columns

Numeric (int64 / float64):

EDUCATION, EXPERIENCE, WAGE, AGE

Categorical (category):

SOUTH, SEX, UNION, RACE, OCCUPATION, SECTOR, MARR

No missing values (534 non-null everywhere)

From survey.describe(include="all"):

Confirms:

SOUTH has 2 categories (likely yes/no),

SEX has 2,

UNION has 2,

RACE has 3,

OCCUPATION has 6,

SECTOR has 3,

MARR has 2.

Numeric summary for:

EDUCATION (mean ≈ 13 years, range 2–18)

EXPERIENCE (mean ≈ 17.8 years)

WAGE (mean ≈ 9.02, max 44.50)

AGE (18–64)

In [10]:
# Define target and features
target_col = "WAGE"
X = survey.drop(columns=[target_col])
y = survey[target_col].astype(float)

# Convert non-numeric (categorical) columns into numeric values
# Machine learning models cannot understand text, so pd.get_dummies()
# turns categories like "male/female" or "union/not_member" into 0s and 1s.
X_encoded = pd.get_dummies(X, drop_first=True)

X_encoded.head()

Unnamed: 0,EDUCATION,EXPERIENCE,AGE,SOUTH_yes,SEX_male,UNION_not_member,RACE_Other,RACE_White,OCCUPATION_Management,OCCUPATION_Other,OCCUPATION_Professional,OCCUPATION_Sales,OCCUPATION_Service,SECTOR_Manufacturing,SECTOR_Other,MARR_Unmarried
0,8,21,35,False,False,True,False,False,False,True,False,False,False,True,False,False
1,9,42,57,False,False,True,False,True,False,True,False,False,False,True,False,False
2,12,1,19,False,True,True,False,True,False,True,False,False,False,True,False,True
3,12,4,22,False,True,True,False,True,False,True,False,False,False,False,True,True
4,12,17,35,False,True,True,False,True,False,True,False,False,False,False,True,False


The column called WAGE is what I want the model to guess.

Everything else (age, education, occupation, etc.) are the clues the model will use to make its prediction.

## Converting Categorical Data to Numeric

Several columns in the dataset (such as SEX, UNION, RACE, OCCUPATION, and SECTOR) contain text values. 
Machine learning models cannot learn from text, so I used `pd.get_dummies()` to convert these categorical 
columns into numeric True/False (1/0) indicator columns.

For example:
- "male/female" becomes a column like SEX_male
- "Manufacturing/Other" becomes SECTOR_Manufacturing
- "Married/Unmarried" becomes MARR_Unmarried

This transformation makes the entire dataset numeric and ready for machine learning models.


In [12]:
# At this point, X_encoded is already fully numeric (thanks to get_dummies).

preprocess = Pipeline([
    ("scaler", StandardScaler())
])


### Preprocessing

Earlier, I used `pd.get_dummies()` to convert all categorical columns into numeric columns, so the feature matrix `X_encoded` is already fully numeric.

I only apply one preprocessing step: `StandardScaler`, which rescales all features so that models like KNN and SVR are not dominated by variables that have larger numeric ranges.


In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)


In [22]:
def evaluate_model(name, pipeline, X_train, y_train, X_test, y_test):
    pipeline.fit(X_train, y_train)
    preds = pipeline.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    rmse = mse ** 0.5
    return {"model": name, "rmse": rmse}


In [23]:
results = []

# Linear Regression
lin_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

results.append(evaluate_model("LinearRegression", lin_pipe,
                              X_train, y_train, X_test, y_test))

# Ridge Regression
ridge_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", Ridge())
])

results.append(evaluate_model("Ridge", ridge_pipe,
                              X_train, y_train, X_test, y_test))

# KNN Regressor
knn_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor())
])

results.append(evaluate_model("KNeighborsRegressor", knn_pipe,
                              X_train, y_train, X_test, y_test))

# Decision Tree
tree_pipe = Pipeline([
    ("scaler", StandardScaler()),  # not strictly necessary, but harmless
    ("model", DecisionTreeRegressor(random_state=42))
])

results.append(evaluate_model("DecisionTreeRegressor", tree_pipe,
                              X_train, y_train, X_test, y_test))

# SVR
svr_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SVR())
])

results.append(evaluate_model("SVR", svr_pipe,
                              X_train, y_train, X_test, y_test))

results_df = pd.DataFrame(results).sort_values("rmse")
results_df


Unnamed: 0,model,rmse
1,Ridge,4.415415
0,LinearRegression,4.416175
4,SVR,4.659351
2,KNeighborsRegressor,4.782876
3,DecisionTreeRegressor,7.650491


After evaluating all individual models, Ridge Regression emerged as the top performer with an RMSE of 4.415415, slightly outperforming Linear Regression by a small but meaningful margin. This Ridge RMSE score will now serve as the baseline for comparison as I test whether the VotingRegressor ensemble can achieve better predictive performance.

In [24]:
voter = VotingRegressor([
    ("ridge", ridge_pipe),
    ("svr", svr_pipe),
    ("knn", knn_pipe),
    ("tree", tree_pipe),
])

voter.fit(X_train, y_train)
voter_preds = voter.predict(X_test)

voter_rmse = mean_squared_error(y_test, voter_preds) ** 0.5
voter_rmse


4.753006239182385

The results of this exercise were surprising. I initially expected the ‘wisdom of the crowd’ approach to outperform the individual models, especially Ridge Regression, which performed the strongest on its own. However, the unweighted VotingRegressor actually performed worse—even falling below the DecisionTreeRegressor. This suggests that the weaker models, particularly DecisionTreeRegressor and KNeighborsRegressor, pulled down the ensemble’s overall accuracy. A logical next step is to remove these weaker learners and build a new ensemble using only the stronger models—Ridge, LinearRegression, and SVR—to see whether eliminating the poorest performers leads to a more accurate averaged prediction.

## VotingRegressor Without the Weak Models

This ensemble uses only the stronger models:

Ridge

LinearRegression

SVR

and drops the weaker ones (DecisionTree, KNN).

In [25]:
# Create a new VotingRegressor without the weak models
voter_strong = VotingRegressor([
    ("ridge", ridge_pipe),
    ("linear", lin_pipe),
    ("svr", svr_pipe)
])

# Fit and predict
voter_strong.fit(X_train, y_train)
strong_preds = voter_strong.predict(X_test)

# Compute RMSE manually (compatible with all sklearn versions)
mse = mean_squared_error(y_test, strong_preds)
voter_strong_rmse = mse ** 0.5

voter_strong_rmse


4.455849543569436

For this dataset, the “wisdom of the crowd” helped a bit when I removed the weakest models, but it still did not outperform the strongest individual learner, Ridge Regression.After removing the weakest models (DecisionTreeRegressor and KNeighborsRegressor) and building a new VotingRegressor using only Ridge, LinearRegression, and SVR, the ensemble’s RMSE improved from 4.75 to approximately 4.46. However, this refined ensemble still did not outperform Ridge Regression alone (RMSE ≈ 4.42). In this case, the best single model remained more accurate than the averaged predictions of the group, showing that ensemble methods are not guaranteed to beat a strong, well-specified baseline model.

## Permutation Importance
I used scikit-learn’s permutation_importance to identify which features caused the model’s error to increase when randomly shuffled. Permutation importance is a model-agnostic method for measuring feature impact, making it a reliable way to see which variables truly influenced the wage predictions.

In [26]:
from sklearn.inspection import permutation_importance

# Fit the best model (Ridge) on the full training set
ridge_pipe.fit(X_train, y_train)

# Compute permutation importance
perm = permutation_importance(
    ridge_pipe, 
    X_test, 
    y_test, 
    n_repeats=20, 
    random_state=42
)

# Create a sorted importance DataFrame
importance_df = pd.DataFrame({
    "feature": X_train.columns,
    "importance": perm.importances_mean
}).sort_values("importance", ascending=False)

importance_df.head(15)


Unnamed: 0,feature,importance
0,EDUCATION,0.243794
8,OCCUPATION_Management,0.090776
10,OCCUPATION_Professional,0.07304
4,SEX_male,0.05816
12,OCCUPATION_Service,0.011117
7,RACE_White,0.008521
3,SOUTH_yes,0.006796
2,AGE,0.006191
11,OCCUPATION_Sales,0.004537
9,OCCUPATION_Other,0.001505


The permutation importance results reveal which features played the biggest role in the Ridge Regression model’s ability to predict wages. The feature with the highest importance by a wide margin was EDUCATION, indicating that years of education contributed the most to reducing prediction error. This aligns with real-world expectations, as higher educational attainment typically correlates with higher wages.

The next most influential features were OCCUPATION_Management and OCCUPATION_Professional, suggesting that individuals working in higher-skilled or supervisory positions tend to have more predictable wage patterns within this dataset. SEX_male also appeared as a meaningful predictor, capturing wage differences between male and female workers.

Other features such as OCCUPATION_Service, RACE_White, SOUTH_yes, and AGE had modest but noticeable importance, indicating that regional factors, demographic attributes, and age contribute slightly to the model’s predictive power.

Toward the bottom of the list, features such as EXPERIENCE, RACE_Other, and the SECTOR categories had very low or even slightly negative importance scores. A negative permutation importance means that shuffling the feature did not increase the model’s error and may have even improved it slightly, suggesting that these variables did not meaningfully help the model and may introduce noise rather than useful signal.

Overall, the permutation analysis shows that education level and occupation type are the primary drivers of wage predictions in this dataset, while demographic and regional characteristics offer smaller contributions. This confirms that the model relies most heavily on human capital and job role variables when estimating wages.