# Tune hyperparameters using cross-validation

In this notebook, we will tune hyper-parameters of a simple text classification pipeline. 

Starting from the raw text data, we will encode it using bag of words (*hyperparameter 1*: number of words in the vocabulary), and then train a Logisitic Regression classifier (*hyperparameter 2*: regularization parameter). We will evaluate performance using (repeated) cross-validation.

Metrics from each of the run will be stored with **MLFlow tracking API**. That's the output we want to version with **DVC**.

In [None]:
# Parameters
"""
:param str input_csv_file: Path to input file
:param List[float] C_list: List of inverse of regularisation coefficient values
:param List[int] max_features_list: List the maximum number of features
:param str mlflow_output: MLflow metrics directory
:dvc-in input_csv_file: ./poc/data/data_train.csv
:dvc-out mlflow_output : ./poc/data/cross_valid_metrics
:dvc-extra: --C-list .1 1.0 --max-features-list 100 500 1000
"""
# Value of parameters for this Jupyter Notebook only
# the notebook is in ./poc/pipeline/notebooks
input_csv_file = "../../data/data_train.csv"
C_list = [.1, 1.0]
max_features_list = [100, 500, 1000]
mlflow_output='../../data/cross_valid_metrics'

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate, RepeatedStratifiedKFold
import mlflow
from itertools import product

In [None]:
df = pd.read_csv(input_csv_file).dropna()

In [None]:
def log_results(d):
    for metrics, values in d.items():
        mlflow.log_metric(metrics + '_avg', values.mean())
        mlflow.log_metric(metrics + '_std', values.std())

In [None]:
mlflow.set_tracking_uri(mlflow_output)

In [None]:
for C, max_features in product(C_list, max_features_list):
    with mlflow.start_run():
        mlflow.log_param('C', C)
        mlflow.log_param('max_features', max_features)
        classifier = LogisticRegression(C=C,
                                        solver='lbfgs',
                                        multi_class='multinomial')
        vectorizer = CountVectorizer(max_features=max_features,
                                     stop_words='english')
        pipeline = Pipeline([('vectorizer', vectorizer),
                         (classifier.__repr__().split('(')[0], classifier)])
        d = cross_validate(pipeline,
                           X=df['data'],
                           y=df['target'],
                           scoring=['accuracy', 'precision_macro', 'f1_micro', 'f1_macro'],
                           cv=RepeatedStratifiedKFold(n_splits=3, n_repeats=1, random_state=0))
        log_results(d)
