# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

Here again with limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [19]:

import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In this exercise, we will progressively define the classification pipeline
and later tune its hyperparameters.

Our pipeline should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each group of columns.

In [20]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [21]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude='object')
numerical_columns = numerical_columns_selector(data)

categorical_columns_selector = selector(dtype_include='object')
categorical_columns = categorical_columns_selector(data)

In [29]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

numerical_preprocessor = StandardScaler()
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [30]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
('cat_preprocessor', categorical_preprocessor, categorical_columns),
('num_preprocessor', numerical_preprocessor, numerical_columns)
])

Assemble the final pipeline by combining the above preprocessor
with a logistic regression classifier. Force the maximum number of
iterations to `10_000` to ensure that the model will converge.

In [31]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(preprocessor, LogisticRegression(max_iter=10_000))

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [25]:
[param for param in model.get_params() if 'C' in param or 'with_' in param]

['columntransformer__num_preprocessor__with_mean',
 'columntransformer__num_preprocessor__with_std',
 'logisticregression__C']

In [32]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

param_distributions = {
    'logisticregression__C':loguniform(0.001, 10),
    'columntransformer__num_preprocessor__with_mean':[True, False],
    'columntransformer__num_preprocessor__with_std':[True, False]    
}

model_random_search = RandomizedSearchCV(model, 
                                        param_distributions=param_distributions, 
                                        n_iter=20, 
                                        n_jobs=-1)

model_random_search.fit(data_train, target_train)

print(model_random_search.best_params_)

{'columntransformer__num_preprocessor__with_mean': False, 'columntransformer__num_preprocessor__with_std': False, 'logisticregression__C': 0.8360860766334274}
