### Try-it 9.2: Predicting Wages

This activity is meant to summarize your work with regularized regression models.  You will use your earlier work with data preparation and pipelines together with what you've learned with grid searches to determine an optimal model.  In addition to the prior strategies, this example is an excellent opportunity to utilize the `TransformedTargetRegressor` estimator in scikitlearn.

### The Data

This dataset is loaded from the openml resource library.  Originally from census data, the data contains wage and demographic information on 534 individuals. From the dataset documentation [here](https://www.openml.org/d/534)

```
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. 
```

In [1]:
from sklearn.datasets import fetch_openml

In [2]:
wages = fetch_openml(data_id=534, as_frame=True)

  warn(


In [3]:
wages.frame.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


#### Task

Build regression models to predict `WAGE`.  Incorporate the categorical features and transform the target using a logarithm.  Build `Ridge` models and consider some different amounts of regularization.  

After fitting your model, interpret the model and try to understand what features led to higher wages.  Consider using `permutation_importance` that you encountered in module 8.  Discuss your findings in the class forum.

For an in depth example discussing the perils of interpreting the coefficients see the example in scikitlearn examples [here](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html).

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector
import numpy as np
import plotly.express as px
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split

In [23]:
# convert sklearn Bunch to DataFrame
df = wages.frame
# check for nulls
df.isnull().sum()
df.shape

(534, 11)

In [28]:
# convert & incorporate the categorical features
# SOUTH SEX UNION RACE OCCUPATION SECTOR MARR

df = pd.get_dummies(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   EDUCATION                534 non-null    int64  
 1   EXPERIENCE               534 non-null    int64  
 2   WAGE                     534 non-null    float64
 3   AGE                      534 non-null    int64  
 4   SOUTH_no                 534 non-null    uint8  
 5   SOUTH_yes                534 non-null    uint8  
 6   SEX_female               534 non-null    uint8  
 7   SEX_male                 534 non-null    uint8  
 8   UNION_member             534 non-null    uint8  
 9   UNION_not_member         534 non-null    uint8  
 10  RACE_Hispanic            534 non-null    uint8  
 11  RACE_Other               534 non-null    uint8  
 12  RACE_White               534 non-null    uint8  
 13  OCCUPATION_Clerical      534 non-null    uint8  
 14  OCCUPATION_Management    5

In [33]:
X = df.drop(columns='WAGE')
y = df[['WAGE']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [34]:
standard_scaler = StandardScaler()
linear_regression = LinearRegression()

pipe = Pipeline([
    ('scaler', standard_scaler),
    ('lreg', linear_regression)
])
pipe.fit(X_train, y_train)

In [35]:
pipe.score(X_test, y_test)

0.38699631666233625

In [36]:
pipe.get_params()

{'memory': None,
 'steps': [('scale', StandardScaler()), ('lreg', LinearRegression())],
 'verbose': False,
 'scale': StandardScaler(),
 'lreg': LinearRegression(),
 'scale__copy': True,
 'scale__with_mean': True,
 'scale__with_std': True,
 'lreg__copy_X': True,
 'lreg__fit_intercept': True,
 'lreg__n_jobs': None,
 'lreg__positive': False}

In [37]:
ridge = Ridge()

pipe_ridge = Pipeline([
    ('scaler', standard_scaler),
    ('ridge', ridge)
])
param_alpha_dict = {'ridge__alpha': 10**np.linspace(-5,5,100)}
model_finder = GridSearchCV(pipe_ridge, param_grid=param_alpha_dict)
model_finder.fit(X_train, y_train)

best_model = model_finder.best_estimator_
best_model

In [39]:
ridge = Ridge(alpha=73.90722033525775)
ttr = TransformedTargetRegressor(regressor=ridge)
pipe_ttr = Pipeline([
    ('scaler', standard_scaler),
    ('ttr', ttr)
])
param_alpha_dict = {}
model_finder = GridSearchCV(pipe_ttr, param_grid=param_alpha_dict)
model_finder.fit(X_train, y_train)

best_model = model_finder.best_estimator_
best_model

In [41]:
model_finder.best_score_

0.2513691428009535

In [43]:
ridge = Ridge(alpha=73.90722033525775)
ttr = TransformedTargetRegressor(regressor=ridge, func=np.log, inverse_func=np.exp)
pipe_ttr = Pipeline([
    ('scaler', standard_scaler),
    ('ttr', ttr)
])
param_alpha_dict = {}
model_finder = GridSearchCV(pipe_ttr, param_grid=param_alpha_dict)
model_finder.fit(X_train, y_train)

best_model = model_finder.best_estimator_
best_model

In [44]:
model_finder.best_score_

0.21597242359920615

In [48]:
# permutation importance
perm_importane = permutation_importance(model_finder, X_test, y_test, random_state=13)
pd.DataFrame({
    'Features': X.columns,
    'Relevance Score': perm_importane.importances_mean
}).sort_values(by='Relevance Score', ascending=False)

Unnamed: 0,Features,Relevance Score
0,EDUCATION,0.19598
13,OCCUPATION_Management,0.067922
15,OCCUPATION_Professional,0.054427
17,OCCUPATION_Service,0.035785
6,SEX_male,0.026466
5,SEX_female,0.026466
14,OCCUPATION_Other,0.017011
2,AGE,0.016864
16,OCCUPATION_Sales,0.012166
3,SOUTH_no,0.005666
