### Comparing Aggregate Models for Regression

This try-it focuses on utilizing ensemble models in a regression setting.  Much like you have used individual classification estimators to form an ensemble of estimators -- here your goal is to explore ensembles for regression models.  As with your earlier assignment, you will use scikitlearn to carry out the ensembles using the `VotingRegressor`.   


#### Dataset and Task

Below, a dataset containing census information on individuals and their hourly wage is loaded using the `fetch_openml` function.  OpenML is another repository for datasets [here](https://www.openml.org/).  Your task is to use ensemble methods to explore predicting the `wage` column of the data.  Your ensemble should at the very least consider the following models:

- `LinearRegression` -- perhaps you even want the `TransformedTargetRegressor` here.
- `KNeighborsRegressor`
- `DecisionTreeRegressor`
- `Ridge`
- `SVR`

Tune the `VotingRegressor` to try to optimize the prediction performance and determine if the wisdom of the crowd performed better in this setting than any of the individual models themselves.  Report back on your findings and discuss the interpretability of your findings.  Is there a way to determine what features mattered in predicting wages?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingRegressor
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error

In [2]:
survey = fetch_openml(data_id=534, as_frame=True).frame

In [3]:
survey.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


#### Data Prep

Make any necessary transformations before running models

In [4]:
survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    int64   
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    int64   
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    int64   
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB


In [5]:
survey.select_dtypes(include = 'category').nunique()

SOUTH         2
SEX           2
UNION         2
RACE          3
OCCUPATION    6
SECTOR        3
MARR          2
dtype: int64

Can use OHE for all of these; will have 5 OCCUPATION columns but that's not too many

In [6]:
category_column_names = survey.select_dtypes(include='category').columns
category_column_names

Index(['SOUTH', 'SEX', 'UNION', 'RACE', 'OCCUPATION', 'SECTOR', 'MARR'], dtype='object')

In [7]:
for temp_column in category_column_names:
    print(f"{temp_column} unique values: {survey[temp_column].unique()}")

SOUTH unique values: ['no', 'yes']
Categories (2, object): ['no', 'yes']
SEX unique values: ['female', 'male']
Categories (2, object): ['female', 'male']
UNION unique values: ['not_member', 'member']
Categories (2, object): ['member', 'not_member']
RACE unique values: ['Hispanic', 'White', 'Other']
Categories (3, object): ['Hispanic', 'Other', 'White']
OCCUPATION unique values: ['Other', 'Management', 'Sales', 'Clerical', 'Service', 'Professional']
Categories (6, object): ['Clerical', 'Management', 'Other', 'Professional', 'Sales', 'Service']
SECTOR unique values: ['Manufacturing', 'Other', 'Construction']
Categories (3, object): ['Construction', 'Manufacturing', 'Other']
MARR unique values: ['Married', 'Unmarried']
Categories (2, object): ['Married', 'Unmarried']


In [8]:
obj_cols = survey[category_column_names]

#ohe_transformer = OneHotEncoder(sparse_output = False, drop='first')
ohe_transformer = OneHotEncoder(sparse_output = False, drop=['no','female','not_member','White','Clerical','Other','Unmarried'])
encoded_categories = ohe_transformer.fit_transform(obj_cols)
encoded_category_columns = ohe_transformer.get_feature_names_out(category_column_names)

encoded_df = pd.DataFrame(encoded_categories, columns=encoded_category_columns)

X = pd.concat([survey.drop(category_column_names, axis=1), encoded_df], axis=1).drop('WAGE', axis = 1)
y = survey['WAGE']
pd.concat([X, y], axis = 1)

Unnamed: 0,EDUCATION,EXPERIENCE,AGE,SOUTH_yes,SEX_male,UNION_member,RACE_Hispanic,RACE_Other,OCCUPATION_Management,OCCUPATION_Other,OCCUPATION_Professional,OCCUPATION_Sales,OCCUPATION_Service,SECTOR_Construction,SECTOR_Manufacturing,MARR_Married,WAGE
0,8,21,35,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,5.10
1,9,42,57,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,4.95
2,12,1,19,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,6.67
3,12,4,22,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4.00
4,12,17,35,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,7.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
529,18,5,29,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.36
530,12,33,51,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,6.10
531,17,25,48,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,23.25
532,12,13,31,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,19.88


# Train models

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns = X.columns)
X_test = scaler.fit_transform(X_test)
X_test = pd.DataFrame(X_test, columns = X.columns)

In [10]:
X_test.head()

Unnamed: 0,EDUCATION,EXPERIENCE,AGE,SOUTH_yes,SEX_male,UNION_member,RACE_Hispanic,RACE_Other,OCCUPATION_Management,OCCUPATION_Other,OCCUPATION_Professional,OCCUPATION_Sales,OCCUPATION_Service,SECTOR_Construction,SECTOR_Manufacturing,MARR_Married
0,-0.365779,2.304224,2.379151,-0.699206,-1.077632,-0.443203,-0.216506,-0.355036,-0.268328,-0.699206,-0.502331,3.188521,-0.467099,-0.196875,-0.513956,0.722999
1,-0.365779,-1.118361,-1.276285,-0.699206,0.927961,-0.443203,-0.216506,-0.355036,-0.268328,1.430194,-0.502331,-0.313625,-0.467099,-0.196875,1.945691,-1.383128
2,0.023224,-0.7924,-0.841114,1.430194,0.927961,-0.443203,-0.216506,-0.355036,-0.268328,1.430194,-0.502331,-0.313625,-0.467099,-0.196875,1.945691,0.722999
3,-0.365779,0.266971,0.203296,-0.699206,0.927961,-0.443203,-0.216506,-0.355036,-0.268328,-0.699206,-0.502331,3.188521,-0.467099,-0.196875,-0.513956,0.722999
4,0.023224,-0.71091,-0.75408,-0.699206,0.927961,2.256304,-0.216506,2.816617,-0.268328,1.430194,-0.502331,-0.313625,-0.467099,-0.196875,1.945691,-1.383128


In [11]:
models = {
    'linreg': LinearRegression(),
    'ridge': Ridge(random_state = 42),
    'lasso': Lasso(random_state = 42),
    'knn': KNeighborsRegressor(),
    'dt': DecisionTreeRegressor(max_depth = 10, random_state = 42),
    'svr': SVR()
}

accuracy_dict = {}

for model_name, model in zip(models.keys(), models.values()):
    model.fit(X_train, y_train)
    accuracy_dict[model_name] = root_mean_squared_error(model.predict(X_test), y_test)

accuracy_dict

{'linreg': np.float64(4.201096615067697),
 'ridge': np.float64(4.206633452836996),
 'lasso': np.float64(4.920733361979767),
 'knn': np.float64(4.5990852562254),
 'dt': np.float64(5.341195329519985),
 'svr': np.float64(4.361364839745163)}

## Ensemble
Create voting classifier

In [12]:
voter = VotingRegressor(
    [('linreg', LinearRegression()), ('ridge', Ridge()), ('lasso', Lasso()),
     ('knn', KNeighborsRegressor()),
     ('dt', DecisionTreeRegressor(max_depth = 10)),
     ('svr', SVR())], # going to leave everythign on default at first, except for DT
    n_jobs = -1 # run on all processors    
)

voter.fit(X_train, y_train)

In [13]:
voter.score(X_test, y_test)

0.3356612105476823

In [14]:
accuracy_dict['ensemble'] = root_mean_squared_error(voter.predict(X_test), y_test)
accuracy_dict

{'linreg': np.float64(4.201096615067697),
 'ridge': np.float64(4.206633452836996),
 'lasso': np.float64(4.920733361979767),
 'knn': np.float64(4.5990852562254),
 'dt': np.float64(5.341195329519985),
 'svr': np.float64(4.361364839745163),
 'ensemble': np.float64(4.2916636173840725)}

In [15]:
accuracy_frame = pd.DataFrame({
    'Predictor': ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'KNR', 'Decision Tree Regressor', 'SVR', 'Ensemble'],
    'RSME': accuracy_dict.values()
})

accuracy_frame

Unnamed: 0,Predictor,RSME
0,Linear Regression,4.201097
1,Ridge Regression,4.206633
2,Lasso Regression,4.920733
3,KNR,4.599085
4,Decision Tree Regressor,5.341195
5,SVR,4.361365
6,Ensemble,4.291664


# Get feature importance

In [16]:
# Dictionary to store feature importances or coefficients
feature_importances = {}

# Extracting feature importances or coefficients
for name, model in models.items():
    result = permutation_importance(model, X_train, y_train, n_repeats=50, random_state=42, n_jobs=-1)
    feature_importances[name] = result.importances_mean
    print(f"{name} permutation completed.")

# Convert the dictionary to a DataFrame for easier comparison
importance_df = pd.DataFrame(feature_importances, index=X_train.columns)

# Print the DataFrame
importance_df

linreg permutation completed.
ridge permutation completed.
lasso permutation completed.
knn permutation completed.
dt permutation completed.
svr permutation completed.


Unnamed: 0,linreg,ridge,lasso,knn,dt,svr
EDUCATION,0.383768,0.148127,0.100899,0.066767,1.084264,0.095688
EXPERIENCE,1.616035,0.026898,0.0,0.016775,0.425906,0.019495
AGE,0.833826,0.019827,0.010161,0.032415,1.232024,0.031386
SOUTH_yes,0.004027,0.004113,0.0,0.00621,0.030925,0.019921
SEX_male,0.071044,0.070396,0.001253,0.059088,0.429197,0.055764
UNION_member,0.055084,0.054871,0.000944,0.062424,0.307035,0.060176
RACE_Hispanic,0.002726,0.00276,0.0,0.039865,0.003472,0.019205
RACE_Other,0.005681,0.005657,0.0,0.018956,0.042648,0.020663
OCCUPATION_Management,0.074128,0.074087,0.0,0.141389,0.335072,0.05103
OCCUPATION_Other,-1.4e-05,-1.4e-05,0.0,0.021652,0.071947,0.021002


Normalize all importance scores, then weight by inverse of normalized error metric

In [17]:
norm_importance_df = (importance_df - importance_df.min()) / (importance_df.max() - importance_df.min())
norm_importance_df

Unnamed: 0,linreg,ridge,lasso,knn,dt,svr
EDUCATION,0.237482,1.0,1.0,0.447975,0.879728,1.0
EXPERIENCE,1.0,0.181667,0.0,0.078158,0.343846,0.073934
AGE,0.515974,0.133935,0.100707,0.193852,1.0,0.218465
SOUTH_yes,0.0025,0.027855,0.0,0.0,0.022345,0.07911
SEX_male,0.04397,0.475293,0.012423,0.391172,0.346525,0.514749
UNION_member,0.034095,0.370488,0.009357,0.415848,0.247089,0.568383
RACE_Hispanic,0.001696,0.018723,0.0,0.248963,0.0,0.07041
RACE_Other,0.003524,0.038278,0.0,0.09429,0.031887,0.088136
OCCUPATION_Management,0.045879,0.500205,0.0,1.0,0.269911,0.457215
OCCUPATION_Other,0.0,0.0,0.0,0.114231,0.055736,0.09226


In [18]:
accuracy_df = pd.DataFrame(accuracy_dict, index = ['RMSE']).T.drop('ensemble')
accuracy_df = 1 - (accuracy_df - accuracy_df.min()) / (accuracy_df.max() - accuracy_df.min())
accuracy_df

Unnamed: 0,RMSE
linreg,1.0
ridge,0.995144
lasso,0.368794
knn,0.650917
dt,0.0
svr,0.859426


In [19]:
final_importance_df = norm_importance_df.copy()

for xform in norm_importance_df.columns:
    final_importance_df[xform] = final_importance_df[xform] * accuracy_df.loc[xform,'RMSE']

final_importance_df

Unnamed: 0,linreg,ridge,lasso,knn,dt,svr
EDUCATION,0.237482,0.995144,0.368794,0.291595,0.0,0.859426
EXPERIENCE,1.0,0.180785,0.0,0.050875,0.0,0.06354
AGE,0.515974,0.133284,0.03714,0.126182,0.0,0.187754
SOUTH_yes,0.0025,0.02772,0.0,0.0,0.0,0.067989
SEX_male,0.04397,0.472985,0.004582,0.25462,0.0,0.442389
UNION_member,0.034095,0.368689,0.003451,0.270683,0.0,0.488484
RACE_Hispanic,0.001696,0.018632,0.0,0.162054,0.0,0.060512
RACE_Other,0.003524,0.038092,0.0,0.061375,0.0,0.075746
OCCUPATION_Management,0.045879,0.497776,0.0,0.650917,0.0,0.392942
OCCUPATION_Other,0.0,0.0,0.0,0.074355,0.0,0.079291


In [20]:
importance_mean_df = pd.DataFrame(final_importance_df.T.mean(), columns = ['Weighted Mean Permutation Importance']).sort_values('Weighted Mean Permutation Importance', ascending=False)
linreg_coefs = pd.DataFrame(models['linreg'].coef_, columns = ['LinReg Coefficient'], index = models['linreg'].feature_names_in_)
importance_df = pd.concat([importance_mean_df, linreg_coefs], axis = 1)

importance_df

Unnamed: 0,Weighted Mean Permutation Importance,LinReg Coefficient
EDUCATION,0.45874,2.210446
OCCUPATION_Management,0.264586,0.991424
EXPERIENCE,0.215867,4.563895
SEX_male,0.203091,0.936873
UNION_member,0.194233,0.850898
AGE,0.166723,-3.299029
OCCUPATION_Professional,0.123741,0.757706
SECTOR_Manufacturing,0.080219,0.459977
OCCUPATION_Service,0.079328,-0.299166
RACE_Hispanic,0.040483,-0.170437
