# Hyperparameter Optimization

This suject is related with the extraction of the best performance of the models.

In this sense, we can say, improvements can be done in every model. 
In this context, a good model can be improved do fine-tunning of them hyperparameters control the
training/fitting process of the model. Thus the task is find the best value combination that comes up with the best model result.

This field is under research, for instance:
- On Linear Identifiability of Learned Representations
- Linear Mode Connectivity and the Lottery Ticket Hypothesis
- Deep Ensembles: A Loss Landscape Perspective

In general we have two techniques:
- **grid search:** high time consumption
- **random search:** less time consumption

With **grid search** we provide the possible values to
that each hyperparameter can take. The technique will run the model for all value combinations. However, this technique easly take a lot of time for computation all possibilities if the dataset is too large. For these reasons, **grid search is not very popular.**

Here we are going use this dataset:
- https://www.kaggle.com/iabhishekofficial/mobile-price-classification

Video explain from author:
https://www.youtube.com/watch?v=5nYqK-HaoKY&list=PLjMBCjnfVRHQZGxbCcpd41Fm4nfBPVnCa&index=28

In [None]:
# https://www.kaggle.com/iabhishekofficial/mobile-price-classification
# rf_grid_search.py
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection

if __name__ == "__main__":
    # read the training data
    df = pd.read_csv("input/mobile_train.csv")
    
    # features are all columns without price_range
    # note that there is no id column in this dataset
    # here we have training features
    x = df.drop("price_range", axis=1).values
    
    # and the targets
    y = df.price_range.values
    
    # define the model here
    # random forest with n_jobs=-1
    # n_jobs=-1 => use all cores
    classifier = ensemble.RandomForestClassifier(n_jobs=-1)

    # define a grid of parameters
    # this can be a dictionary or a list of
    # dictionaries
    param_grid = {
        "n_estimators": [100, 200, 250, 300, 400, 500],
        "max_depth": [1, 2, 5, 7, 11, 15],
        "criterion": ["gini", "entropy"]}
    
    # initialize grid search
    # estimator is the model that we have defined
    # param_grid is the grid of parameters
    # we use accuracy as our metric. you can define your own
    # higher value of verbose implies a lot of details are printed
    # cv=5 means that we are using 5 fold cv (not stratified)
    model = model_selection.GridSearchCV(estimator=classifier,
                                         param_grid=param_grid,
                                         scoring="accuracy",
                                         verbose=10,
                                         n_jobs=1,
                                         cv=5)
    # fit the model and extract best score
    model.fit(x, y)
    
    print(f"Best score: {model.best_score_}")
    print("Best parameters set:")
    
    best_parameters = model.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print(f"\t{param_name}: {best_parameters[param_name]}")

In [None]:
In **random search**, a set of parameter combination was chossen randomly and
calculate the cross-validation score is calculated. Here, we choose how many times we want to evaluate our models. 
This define the time consumption.

In [None]:
# rf_random_search.py
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection

if __name__ == "__main__":
    
    # read the training data
    df = pd.read_csv("input/mobile_train.csv")
    
    
    # features are all columns without price_range
    # note that there is no id column in this dataset
    # here we have training features
    X = df.drop("price_range", axis=1).values
    
    # and the targets
    y = df.price_range.values
    
    # define the model here
    # random forest with n_jobs=-1
    # n_jobs=-1 => use all cores
    classifier = ensemble.RandomForestClassifier(n_jobs=-1)

    # define a grid of parameters
    # this can be a dictionary or a list of
    # dictionaries
    param_grid = {"n_estimators": np.arange(100, 1500, 100),
                  "max_depth": np.arange(1, 31),
                  "criterion": ["gini", "entropy"]
    }
    
    # initialize random search
    # estimator is the model that we have defined
    # param_distributions is the grid/distribution of parameters
    # we use accuracy as our metric. you can define your own
    # higher value of verbose implies a lot of details are printed
    # cv=5 means that we are using 5 fold cv (not stratified)
    # n_iter is the number of iterations we want
    # if param_distributions has all the values as list,
    # random search will be done by sampling without replacement
    # if any of the parameters come from a distribution,
    # random search uses sampling with replacement
    model = model_selection.RandomizedSearchCV(estimator=classifier,
                                               param_distributions=param_grid,
                                               n_iter=20,
                                               scoring="accuracy",
                                               verbose=10,
                                               n_jobs=1,
                                               cv=5)
    
    # fit the model and extract best score
    model.fit(X, y)
    
    print(f"Best score: {model.best_score_}")
    print("Best parameters set:")
    
    best_parameters = model.best_estimator_.get_params()
    
    for param_name in sorted(param_grid.keys()):
        print(f"\t{param_name}: {best_parameters[param_name]}")

The result obtained gives the:
- Best score: 0.888
- Best parameters set:
    - criterion: entropy
    - max_depth: 25
    - n_estimators: 1100

Assume you have two text columns to predict a class. We can build a strategy:
- apply tf-idf in a semi supervised manner
- after that, use Singular Value Decomposition with SVM

For this strategy works we need:
- select the components of SVD
- and, to tune the parameters of SVM.

In [None]:
# pipeline_search.py
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import model_selection
from sklearn import pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

def quadractic_weighted_kappa(y_true, y_pred):
    """
    Create a wrapper for cohen's kappa
    with quadratic weights
    """
    return metrics.cohen_kappa_score(y_true,y_pred,
                                     weights="quadratic")

if __name__ == '__main__':
    # Load the training file
    train = pd.read_csv('../input/train.csv')

    # we dont need ID columns
    idx = test.id.values.astype(int)
    train = train.drop('id', axis=1)
    test = test.drop('id', axis=1)
    
    # do some lambda magic on text columns
    train_data = list(train.apply(lambda x:'%s %s' % (x['text1'], x['text2']),axis=1))
    test_data = list(test.apply(lambda x:'%s %s' % (x['text1'], x['text2']),axis=1))
    
    # tfidf vectorizer
    tfv = TfidfVectorizer(min_df=3,
                          max_features=None,
                          strip_accents='unicode',
                          analyzer='word',
                          token_pattern=r'\w{1,}',
                          ngram_range=(1, 3),
                          use_idf=1,
                          smooth_idf=1,
                          sublinear_tf=1,
                          stop_words='english')
    
    # Fit TFIDF
    tfv.fit(traindata)
    
    X = tfv.transform(traindata)
    X_test = tfv.transform(testdata)

    # Initialize SVD
    svd = TruncatedSVD()

    # Initialize the standard scaler
    scl = StandardScaler()
    
    # We will use SVM here..
    svm_model = SVC()
    
    # Create the pipeline
    clf = pipeline.Pipeline([('svd', svd),
                             ('scl', scl),
                             ('svm', svm_model)])
    
    # Create a parameter grid to search for
    # best parameters for everything in the pipeline
    param_grid = {'svd__n_components' : [200, 300],
                  'svm__C': [10, 12]}

    # Kappa Scorer
    kappa_scorer = metrics.make_scorer(quadratic_weighted_kappa,
                                       greater_is_better=True)
    
    # Initialize Grid Search Model
    model = model_selection.GridSearchCV(estimator=clf,
                                         param_grid=param_grid,
                                         scoring=kappa_scorer,
                                         verbose=10,
                                         n_jobs=-1,
                                         refit=True,
                                         cv=5)
    
    # Fit Grid Search Model
    model.fit(X, y)

    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:")

    best_parameters = model.best_estimator_.get_params()
    
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
        # Get best model
        best_model = model.best_estimator_
        # Fit model with best parameters optimized for QWK
        best_model.fit(X, y)
        preds = best_model.predict(...)    

### Hyperparameter Optimization using Gaussian Process

Using `gp_minimize` function allow us to use Bayesian optimization with Gaussian process.

**Remember:** We cannot minimize the accuracy, but we can
minimize it when we multiply it by -1. This way, we are minimizing the negative
of accuracy, but in fact, we are maximizing accuracy.

In [None]:
# rf_gp_minimize.py
import numpy as np
import pandas as pd

from functools import partial

from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection

from skopt import gp_minimize
from skopt import space

def optimize(params, param_names, x, y):
    """
    The main optimization function.
    This function takes all the arguments from the search space
    and training features and targets. It then initializes
    the models by setting the chosen parameters and runs
    cross-validation and returns a negative accuracy score
    :param params: list of params from gp_minimize
    :param param_names: list of param names. order is important!
    :param x: training data
    :param y: labels/targets
    :return: negative accuracy after 5 folds
    """
    
    # convert params to dictionary
    params = dict(zip(param_names, params))

    # initialize model with current parameters
    model = ensemble.RandomForestClassifier(**params)

    # initialize stratified k-fold
    kf = model_selection.StratifiedKFold(n_splits=5)
    
    # initialize accuracy list
    accuracies = []
    
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        
        xtrain = x[train_idx]
        ytrain = y[train_idx]

        xtest = x[test_idx]
        ytest = y[test_idx]
        
        # fit model for current fold
        model.fit(xtrain, ytrain)

        #create predictions
        preds = model.predict(xtest)

        # calculate and append accuracy
        fold_accuracy = metrics.accuracy_score(ytest,preds)
        accuracies.append(fold_accuracy)
    
    # return negative accuracy
    return -1 * np.mean(accuracies)


if __name__ == "__main__":
    # read the training data
    df = pd.read_csv("../input/mobile_train.csv")

    #features are all columns without price_range
    #note that there is no id column in this dataset
    #here we have training features
    X= df.drop("price_range", axis=1).values
    #and the targets
    y= df.price_range.values

    # define a parameter space
    param_space = [
        # max_depth is an integer between 3 and 15
        space.Integer(3, 15, name="max_depth"),
        # n_estimators is an integer between 50 and 1500
        space.Integer(100, 1500, name="n_estimators"),
        # criterion is a category. here we define list of categories
        space.Categorical(["gini", "entropy"], name="criterion"),
        # you can also have Real numbered space and define a
        # distribution you want to pick it from
        space.Real(0.01, 1, prior="uniform", name="max_features")]
    
    param_names= ["max_depth",
                  "n_estimators",
                  "criterion",
                  "max_features"]
    
    # by using functools partial, 
    # new function is created which has same parameters as the
    # optimize function except for the fact that
    # only one param, i.e. the "params" parameter is
    # required. this is how gp_minimize expects the
    # optimization function to be. you can get rid of this
    # by reading data inside the optimize function or by
    # defining the optimize function here.
    optimization_function = partial(optimize,
                                    param_names=param_names,
                                    x=X, y=y)
    
    # now we call gp_minimize from scikit-optimize
    # gp_minimize uses bayesian optimization for
    # minimization of the optimization function.
    # we need a space of parameters, the function itself,
    # the number of calls/iterations we want to have
    result = gp_minimize(optimization_function,
                         dimensions=param_space,
                         n_calls=15,
                         n_random_starts=10,
                         verbose=10)
    
    # create best params dict and print it
    best_params = dict(zip(param_names,result.x))
    
    print(best_params)

## Hyperopt

The library `hyperopt` for hyperparameter optimization uses Tree-structured Parzen
Estimator (TPE) to find the most optimal parameters.

In [None]:
# rf_hyperopt.py
import numpy as np
import pandas as pd
from functools import partial
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope

def optimize(params, x, y):
    """
    The main optimization function.
    This function takes all the arguments from the search space
    and training features and targets. It then initializes
    the models by setting the chosen parameters and runs
    cross-validation and returns a negative accuracy score
    :param params: dict of params from hyperopt
    :param x: training data
    :param y: labels/targets
    :return: negative accuracy after 5 folds
    """
    
    # initialize stratified k-fold
    kf = model_selection.StratifiedKFold(n_splits=5)
    
    # initialize accuracy list
    accuracies = []
    
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        
        xtrain = x[train_idx]
        ytrain = y[train_idx]

        xtest = x[test_idx]
        ytest = y[test_idx]
        
        # fit model for current fold
        model.fit(xtrain, ytrain)

        #create predictions
        preds = model.predict(xtest)

        # calculate and append accuracy
        fold_accuracy = metrics.accuracy_score(ytest,preds)
        accuracies.append(fold_accuracy)
    
    # return negative accuracy
    return -1 * np.mean(accuracies)

    
if __name__ = "__main__":
    # read the training data
    df = pd.read_csv("../input/mobile_train.csv")
    
    # features are all collumns without price_ranges
    # note there is no id in this dataset
    # here we have training features
    X = df.drop("price_range", axis=1).values
    # and the targets
    y = df.price_range.values
    
    # define a parameter space
    # now we use hyperopt
    param_space = {
        # quniform gives round(uniform(low, high) / q) * q
        # we want int values for depth and estimators
        "max_depth": scope.int(hp.quniform("max_depth", 1, 15, 1)),
        "n_estimators": scope.int(hp.quniform("n_estimators", 100, 1500, 1)),
        
        # choice chooses from a list of values
        "criterion": hp.choice("criterion", ["gini", "entropy"]),
        
        # uniform chooses a value between two values
        "max_features": hp.uniform("max_features", 0, 1) 
    }
    # partial function
    optimization_function = partial(optimize,
                                    x=X,
                                    y=y)
    
    # initialize trials to keep logging information
    trials = Trials()
    
    # run hyperopt
    hopt = fmin(fn=optimization_function,
                space=param_space,
                algo=tpe.suggest,
                max_evals=15,
                trials=trials)
    
    print(hopt)

The are most usually ways of tuning hyperparameters and they can be used in
- linear regression,
- logistic regression
- tree-based methods
- gradient boosting models
    - xgboost,
    - lightgbm,
    - and even neural networks