# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

Here again with limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [1]:

import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In this exercise, we will progressively define the classification pipeline
and later tune its hyperparameters.

Our pipeline should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each group of columns.

In [2]:
from sklearn.compose import make_column_selector

# Write your code here.
cat_selector = make_column_selector(dtype_include=object)
num_selector = make_column_selector(dtype_include=int)
cat_columns = cat_selector(data_train)
num_columns = num_selector(data_train)


In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Write your code here.
cat_preprocessor = OneHotEncoder(
    handle_unknown='ignore')

scaler = StandardScaler()

Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [4]:
from sklearn.compose import ColumnTransformer

# Write your code here.
preprocessor = ColumnTransformer([
    ('ohe', cat_preprocessor, cat_columns),
    ('ss', scaler, num_columns)],
     remainder='passthrough')

Assemble the final pipeline by combining the above preprocessor
with a logistic regression classifier. Force the maximum number of
iterations to `10_000` to ensure that the model will converge.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Write your code here.
model = Pipeline([('preprocessor', preprocessor),
                  ('log_reg', LogisticRegression(max_iter=10000))])

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [6]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# Write your code here.
param_grid = {
    'log_reg__C': loguniform(0.001, 10),
    'preprocessor__ss__with_mean': [True, False],
    'preprocessor__ss__with_std': [True, False]
}

model_grid_search = RandomizedSearchCV(
    model, param_distributions=param_grid, n_iter=20, n_jobs=2)

model_grid_search.fit(data_train, target_train)

RandomizedSearchCV(estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('ohe',
                                                                               OneHotEncoder(handle_unknown='ignore'),
                                                                               ['workclass',
                                                                                'education',
                                                                                'marital-status',
                                                                                'occupation',
                                                                                'relationship',
                                                                                'race',
                                                                                's

In [7]:
model_grid_search.best_params_

{'log_reg__C': 0.715348825774979,
 'preprocessor__ss__with_mean': False,
 'preprocessor__ss__with_std': False}