### Try-it 9.2: Predicting Wages

This activity is meant to summarize your work with regularized regression models.  You will use your earlier work with data preparation and pipelines together with what you've learned with grid searches to determine an optimal model.  In addition to the prior strategies, this example is an excellent opportunity to utilize the `TransformedTargetRegressor` estimator in scikitlearn.

### The Data

This dataset is loaded from the openml resource library.  Originally from census data, the data contains wage and demographic information on 534 individuals. From the dataset documentation [here](https://www.openml.org/d/534)

```
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. 
```

In [1]:
from sklearn.datasets import fetch_openml

In [2]:
wages = fetch_openml(data_id=534, as_frame=True)

In [3]:
wages.frame.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8,no,female,21,not_member,5.1,35,Hispanic,Other,Manufacturing,Married
1,9,no,female,42,not_member,4.95,57,White,Other,Manufacturing,Married
2,12,no,male,1,not_member,6.67,19,White,Other,Manufacturing,Unmarried
3,12,no,male,4,not_member,4.0,22,White,Other,Other,Unmarried
4,12,no,male,17,not_member,7.5,35,White,Other,Other,Married


#### Task

Build regression models to predict `WAGE`.  Incorporate the categorical features and transform the target using a logarithm.  Build `Ridge` models and consider some different amounts of regularization.  

After fitting your model, interpret the model and try to understand what features led to higher wages.  Consider using `permutation_importance` that you encountered in module 8.  Discuss your findings in the class forum.

For an in depth example discussing the perils of interpreting the coefficients see the example in scikitlearn examples [here](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html).

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.express as px
import pandas as pd
import warnings

In [5]:
df = wages.frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    int64   
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    int64   
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    int64   
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB


### Train_Test_Split

#### Split first, then fit encoders only on the training set.

Why? Data leakage. If you one-hot encode before the split, information from the test set sneaks into training.

The right workflow

   Train–test split

   Fit OneHotEncoder on the training data

   Transform both train and test using that fitted encoder

In [6]:
X = df.drop('WAGE',axis = 1)
y = df['WAGE']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 42, test_size = 0.2)
X_train.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,AGE,RACE,OCCUPATION,SECTOR,MARR
489,16,no,female,16,not_member,38,White,Professional,Other,Married
299,12,no,male,19,not_member,37,White,Clerical,Other,Married
526,15,no,male,10,not_member,31,White,Professional,Other,Unmarried
513,12,no,male,39,member,57,White,Professional,Other,Married
312,13,yes,female,0,not_member,19,White,Clerical,Other,Unmarried


### Data Proprecessing

In [6]:
df_num = df.select_dtypes(include = 'number')
df_num

Unnamed: 0,EDUCATION,EXPERIENCE,WAGE,AGE
0,8,21,5.10,35
1,9,42,4.95,57
2,12,1,6.67,19
3,12,4,4.00,22
4,12,17,7.50,35
...,...,...,...,...
529,18,5,11.36,29
530,12,33,6.10,51
531,17,25,23.25,48
532,12,13,19.88,31


In [7]:
df_obj = df.select_dtypes(exclude = 'number')
df_obj

Unnamed: 0,SOUTH,SEX,UNION,RACE,OCCUPATION,SECTOR,MARR
0,no,female,not_member,Hispanic,Other,Manufacturing,Married
1,no,female,not_member,White,Other,Manufacturing,Married
2,no,male,not_member,White,Other,Manufacturing,Unmarried
3,no,male,not_member,White,Other,Other,Unmarried
4,no,male,not_member,White,Other,Other,Married
...,...,...,...,...,...,...,...
529,no,male,not_member,White,Professional,Other,Unmarried
530,no,female,not_member,Other,Professional,Other,Married
531,no,female,member,Other,Professional,Other,Married
532,yes,male,member,White,Professional,Other,Married


In [8]:
df_obj['SOUTH'].unique()

['no', 'yes']
Categories (2, object): ['no', 'yes']

In [9]:
for i in df_obj:
    print(i,df_obj[i].nunique())

SOUTH 2
SEX 2
UNION 2
RACE 3
OCCUPATION 6
SECTOR 3
MARR 2


In [10]:
df_binary = df_obj[['SOUTH','SEX','UNION','MARR']]
df_binary.head()

Unnamed: 0,SOUTH,SEX,UNION,MARR
0,no,female,not_member,Married
1,no,female,not_member,Married
2,no,male,not_member,Unmarried
3,no,male,not_member,Unmarried
4,no,male,not_member,Married


In [11]:
df_multi = df_obj[['RACE','OCCUPATION','SECTOR']]
df_multi.head()

Unnamed: 0,RACE,OCCUPATION,SECTOR
0,Hispanic,Other,Manufacturing
1,White,Other,Manufacturing
2,White,Other,Manufacturing
3,White,Other,Other
4,White,Other,Other


### OneHotEncoder()

In [18]:
encoder = OneHotEncoder(sparse_output = False)
encoded = encoder.fit_transform(df_obj)
encoded

array([[1., 0., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       ...,
       [1., 0., 1., ..., 1., 1., 0.],
       [0., 1., 0., ..., 1., 1., 0.],
       [1., 0., 0., ..., 0., 1., 0.]], shape=(534, 20))

In [19]:
df_encoded = pd.DataFrame(encoded, columns = encoder.get_feature_names_out(['SOUTH','SEX','UNION','RACE','OCCUPATION','SECTOR','MARR']))
df_encoded.head()

Unnamed: 0,SOUTH_no,SOUTH_yes,SEX_female,SEX_male,UNION_member,UNION_not_member,RACE_Hispanic,RACE_Other,RACE_White,OCCUPATION_Clerical,OCCUPATION_Management,OCCUPATION_Other,OCCUPATION_Professional,OCCUPATION_Sales,OCCUPATION_Service,SECTOR_Construction,SECTOR_Manufacturing,SECTOR_Other,MARR_Married,MARR_Unmarried
0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


### make_column_transformer

In [8]:
cat_col = ['EDUCATION','SOUTH','SEX','UNION','RACE','OCCUPATION','SECTOR','MARR']
num_col = ['EXPERIENCE','AGE']

In [12]:
transformers = make_column_transformer((StandardScaler(), num_col),
                                     (OneHotEncoder(handle_unknown = 'ignore', sparse_output = False), cat_col))


### Ridge and Lasso alpha values

In [36]:
#ridge_alpha = [0.01,0.1,1,10,100] it is wrong, must be dictionary
#lasso_alpha = [0.001,0.01,0.1,1,10]
ridge_alpha = {'ridge__alpha': 10**np.linspace(-5,5,100)}
lasso_alpha = {'lasso__alpha': 10**np.linspace(-2,2,100)}
sfs_dict = {'selector__n_features_to_select':[2,3,4,5]}

### Piplines

In [14]:
ridge_pipeline = Pipeline([('transformer',transformers),
                           ('ridge',Ridge())])

In [15]:
lasso_pipeline = Pipeline([('transformer',transformers),
                          ('lasso',Lasso())])

In [32]:
sfs_pipeline = Pipeline([('transformer', transformers),
                         ('selector', SequentialFeatureSelector(LinearRegression())),
                        ('linreg',LinearRegression())])

### GridSearchCV

#### Ridge

In [19]:
ridge_model = GridSearchCV(ridge_pipeline, param_grid = ridge_alpha)
ridge_model.fit(X_train, y_train)
best_ridge = ridge_model.best_estimator_
best_ridge

In [21]:
best_ridge.named_steps['ridge'].coef_

array([-1.08056708,  1.910245  , -0.10682812, -0.25068104,  0.4545629 ,
       -0.62670435, -0.63027507, -0.89453673, -1.41280314, -0.63805366,
       -0.31638068, -0.69016487, -0.06807872,  1.81624837,  0.51499042,
       -0.00837898,  1.80195977,  1.05512389,  0.28486165, -0.28486165,
       -0.79773634,  0.79773634,  1.01834252, -1.01834252, -0.53939057,
       -0.0971508 ,  0.63654138, -0.50773714,  2.22055482, -0.67502303,
        1.37258578, -0.93096863, -1.4794118 , -0.18781519,  0.61926254,
       -0.43144735,  0.16674292, -0.16674292])

In [22]:
ridge_pipeline = Pipeline([('transformer',transformers),
                           ('ridge',Ridge(alpha = 9.11))])
ridge_pipeline.fit(X_train,y_train)
ridge_pipeline.score(X_test,y_test)

0.3403930275898016

In [25]:
ridge_pred = ridge_pipeline.predict(X_test)
ridge_mse = mean_squared_error(y_test,ridge_pred)
ridge_mse

21.027288895417527

### Lasso

In [26]:
lasso_model = GridSearchCV(lasso_pipeline, param_grid = lasso_alpha)
lasso_model.fit(X_train, y_train)
best_lasso = lasso_model.best_estimator_
best_lasso

In [27]:
best_lasso.named_steps['lasso'].coef_

array([-2.44748223e+00,  3.25204747e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -8.80239113e-01, -0.00000000e+00,  0.00000000e+00, -2.67591756e-01,
        0.00000000e+00,  2.07615596e+00,  0.00000000e+00, -0.00000000e+00,
        1.95807622e+00,  7.02817297e-01,  4.43375675e-01, -0.00000000e+00,
       -1.55733318e+00,  6.53318143e-14,  2.00789224e+00, -5.23494668e-15,
       -4.51878773e-02, -0.00000000e+00,  7.58323312e-01,  0.00000000e+00,
        2.99663440e+00, -0.00000000e+00,  2.02014921e+00, -2.21763059e-01,
       -8.96697031e-01,  0.00000000e+00,  7.39587890e-01, -2.48240630e-01,
        1.94920250e-01, -7.32465231e-17])

In [28]:
lasso_pipeline= Pipeline([('transformer',transformers),
                         ('lasso',Lasso(alpha = 0.0254))])
lasso_pipeline.fit(X_train,y_train)
lasso_pipeline.score(X_test,y_test)

0.35246155189555817

In [29]:
lasso_pred = lasso_pipeline.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
lasso_mse

20.642562296498713

### Sequential Feature Selection

sfs = SequentialFeatureSelector(LinearRegression(),
                                n_features_to_select = 5,
                                direction='forward',
                                scoring = 'neg_mean_squared_error')
sfs.fit

In [37]:
sfs_model = GridSearchCV(sfs_pipeline, param_grid = sfs_dict)
sfs_model.fit(X_train,y_train)

best_sfs = sfs_model.best_estimator_
best_sfs

In [39]:
best_sfs.named_steps['linreg'].coef_

array([ 0.86155809, -1.81665037, -2.15540527,  4.35658273,  3.65644202])

In [40]:
best_n_features = 5

In [41]:
best_sfs_pipeline = Pipeline([('transformer', transformers),
                         ('selector', SequentialFeatureSelector(LinearRegression(),n_features_to_select = 5,scoring = 'neg_mean_squared_error')),
                        ('linreg',LinearRegression())])

In [42]:
best_sfs_pipeline.fit(X_train,y_train)
best_sfs_pipeline.score(X_test,y_test)

0.2830380511173628

In [43]:
sfs_pred = best_sfs_pipeline.predict(X_test)
sfs_mse = mean_squared_error(y_test, sfs_pred)
sfs_mse

22.855680210732253

In [44]:
best_sfs_pipeline.get_support()

AttributeError: 'Pipeline' object has no attribute 'get_support'

In [45]:
selected_feature = X_train.columns(best_sfs_pipeline.get_support())
selected_feature

AttributeError: 'Pipeline' object has no attribute 'get_support'