### Try-it 9.2: Predicting Wages

This activity is meant to summarize your work with regularized regression models.  You will use your earlier work with data preparation and pipelines together with what you've learned with grid searches to determine an optimal model.  In addition to the prior strategies, this example is an excellent opportunity to utilize the `TransformedTargetRegressor` estimator in scikitlearn.

### The Data

This dataset is loaded from the openml resource library.  Originally from census data, the data contains wage and demographic information on 534 individuals. From the dataset documentation [here](https://www.openml.org/d/534)

```
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. 
```

In [73]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import fetch_openml

In [75]:
wages = fetch_openml(data_id=534, as_frame=True)

In [77]:
wages = wages.frame
wages.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


#### Task

Build regression models to predict `WAGE`.  Incorporate the categorical features and transform the target using a logarithm.  Build `Ridge` models and consider some different amounts of regularization.  

After fitting your model, interpret the model and try to understand what features led to higher wages.  Consider using `permutation_importance` that you encountered in module 8.  Discuss your findings in the class forum.

For an in depth example discussing the perils of interpreting the coefficients see the example in scikitlearn examples [here](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html).

In [80]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor, ColumnTransformer
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.express as px
import pandas as pd

In [82]:
# Transforming WAGE using logarithm
wages['LOG_WAGE'] = np.log(wages['WAGE'])

# Splitting the data into features and target variable
X = wages.drop(columns=['WAGE', 'LOG_WAGE'])
y = wages['LOG_WAGE']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [84]:
# Preprocessing pipeline for categorical features
categorical_features = ['EDUCATION', 'SOUTH', 'SEX', 'UNION', 'RACE', 'OCCUPATION', 'SECTOR', 'MARR']
numeric_features = ['EXPERIENCE', 'AGE']

# OneHotEncoding for categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)])

In [86]:
# Ridge regression model
ridge_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge())])

# Fit the model with different regularization strengths
alphas = np.logspace(-4, 4, 50)

mses = []
min_mse = {'Test MSE': np.inf}

for alpha in alphas:
    ridge_model.set_params(regressor__alpha=alpha)
    ridge_model.fit(X_train, y_train)
    
    # Evaluate the model
    train_pred = ridge_model.predict(X_train)
    test_pred = ridge_model.predict(X_test)

    train_mse = mean_squared_error(y_train, train_pred)
    test_mse = mean_squared_error(y_test, test_pred)

    mse = {
        'Alpha': alpha,
        'Train MSE': train_mse,
        'Test MSE': test_mse,
        'Model': ridge_model,
        'Train Preds': train_pred,
        'Test Preds': test_pred
    }

    mses.append(mse)

    if(test_mse < min_mse['Test MSE']):
        min_mse = mse

print(min_mse['Alpha'], min_mse['Train MSE'], min_mse['Test MSE'])  

0.009102981779915217 0.16766016580718118 0.1903104867484097


In [96]:
# Retrieve the coefficients from the linear model within the pipeline
coefficients = min_mse['Model'].named_steps['regressor'].coef_

# Get the feature names from the selector (e.g., RFE or any feature selector in your pipeline)
feature_names = min_mse['Model'].named_steps['preprocessor'].get_feature_names_out()

# Create a DataFrame with feature names and their corresponding coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})

# Display the DataFrame
coef_df

Unnamed: 0,Feature,Coefficient
0,num__EXPERIENCE,0.003119
1,num__AGE,0.004703
2,cat__EDUCATION_3,-1.5e-05
3,cat__EDUCATION_5,5.5e-05
4,cat__EDUCATION_6,-5.5e-05
5,cat__EDUCATION_7,-0.000101
6,cat__EDUCATION_8,-0.000196
7,cat__EDUCATION_9,-0.000164
8,cat__EDUCATION_10,-9.1e-05
9,cat__EDUCATION_11,-0.000266


In [102]:
# Permutation importance
r = permutation_importance(min_mse['Model'], X_test, y_test, n_repeats=30, random_state=42)

# Ensure correct length between importance and feature names
perm_importance = pd.DataFrame(r.importances_mean, 
                               index=min_mse['Model'].named_steps['preprocessor'].get_feature_names_out()[:len(r.importances_mean)], 
                               columns=['Importance'])

# Sort and display the importance
perm_importance.sort_values(by='Importance', ascending=False, inplace=True)

print(perm_importance)

                   Importance
cat__EDUCATION_9     0.001930
cat__EDUCATION_7     0.001235
num__EXPERIENCE      0.001056
cat__EDUCATION_3     0.000899
num__AGE             0.000498
cat__EDUCATION_11    0.000223
cat__EDUCATION_8     0.000223
cat__EDUCATION_6     0.000223
cat__EDUCATION_10    0.000118
cat__EDUCATION_5    -0.000459
