# **Hyperparameter**

Machine Learning models are composed of two different types of parameters:

*    **Hyperparameters** = are all the parameters which can be arbitrarily set by the user before starting training (eg. number of estimators in Random Forest).
*    **Model parameters** = are instead learned during the model training (eg. weights in Neural Networks, Linear Regression).

The model parameters define how to use input data to get the desired output and are learned at training time. Instead, Hyperparameters determine how our model is structured in the first place.

Machine Learning models tuning is a type of optimization problem. We have a set of hyperparameters and we aim to find the right combination of their values which can help us to find either the minimum (eg. loss) or the maximum (eg. accuracy) of a function

**Definition**

Wikipedia states that “hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm”. So what is a hyperparameter?

``    A hyperparameter is a parameter whose value is set before the learning process begins.``

Some examples of hyperparameters include penalty in logistic regression and loss in stochastic gradient descent.

In sklearn, hyperparameters are passed in as arguments to the constructor of the model classes.

<img src = 'a_img.png'>

## **Tuning Strategies**

### **1.    Manual Search**
Advantage of manual tuning is:
-    You can learn the behavior of hyperparameters by heart and use your knowledge in another project. Therefore, I would recommend doing a manual tuning of major models at least once.

Disadvantage is :
-    Manual works are required.
-    You may overthink about the unexpected movement of the score without trying many and checking if it was generalized movement.
    
### **2.    Random Search**
Advantage of the use of random search is:
-    You do not have to worry about the run time because you can control the number of parameter searches.

Disadvantage is:
-    There should be some compromise that the finally selected hyperparameter set might not be the true best out of the ranges you put in the search.
-    Depending on the number of searches and how large the parameter space is, some parameters might not be explored enough.
    
### **3.    Grid Search**
Advantage of this approach is:
-    You can cover all possible prospective sets of parameters. No matter how you strongly believed one set is most viable, who knows, the neighbor could be more successful. You do not lose that possibility with grid search.

The disadvantage is that it is:
-    One run for one hyperparameter set takes some while. The run time of the whole parameter sets can be huge, and therefore the number of parameters to explore has practical limitations.
    
### **4.    Automated Hyperparameter Tuning (Bayesian Optimization, Genetic Algorithms)**
### **5.    Artificial Neural Networks (ANNs) Tuning**

<hr>

## 1. **Random Search**
In Random Search, we create a grid of hyperparameters and train/test our model on just some random combination of these hyperparameters. In this example, I additionally decided to perform Cross-Validation on the training set.

When performing Machine Learning tasks, we generally divide our dataset in training and test sets. This is done so that to test our model after having trained it (in this way we can check it’s performances when working with unseen data). ``When using Cross-Validation, we divide our training set into N other partitions to make sure our model is not overfitting our data.``

One of the most common used Cross-Validation methods is K-Fold Validation. In K-Fold, we divide our training set into N partitions and then iteratively train our model using N-1 partitions and test it with the left-over partition (at each iteration we change the left-over partition). Once having trained N times the model we then average the training results obtained in each iteration to obtain our overall training performance results (Figure 3).
<img src = 'b_img.png'>

Using Cross-Validation when implementing Hyperparameters optimization can be really important. In this way, we might avoid using some Hyperparameters which works really good on the training data but not so good with the test data.

We can now start implementing Random Search by first defying a grid of hyperparameters which will be randomly sampled when calling **RandomizedSearchCV()**. 

For this example, I decided to divide our training set into **4 Folds (cv = 4)** and select 80 as the number of combinations to sample **(n_iter = 80)**. Using the scikit-learn best_estimator_ attribute, we can then retrieve the set of hyperparameters which performed best during training to test our model.


<hr>

## 2. **Grid Search**
In Grid Search, we set up a grid of hyperparameters and train/test our model on each of the possible combinations.

In order to choose the parameters to use in Grid Search, we can now look at which parameters worked best with Random Search and form a grid based on them to see if we can find a better combination.

Grid Search can be implemented in Python using scikit-learn GridSearchCV() function. Also on this occasion, I decided to divide our training set into 4 Folds (cv = 4).

``When using Grid Search, all the possible combinations of the parameters in the grid are tried. In this case, 128000 combinations (2 × 10 × 4 × 4 × 4 × 10) will be used during training.`` Instead, in the Grid Search example before, just 80 combinations have been used.

## **Cancer Dataset - Logistic Regression**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

In [2]:
# membaca dataset 
data = pd.read_csv("cancer.csv")

#mengahapus kolom yang tidak digunakan
data.drop(["Unnamed: 32","id"], axis=1, inplace=True)

# merubah label M (ganas) = 1 dan B (jinak) = 0
data.diagnosis = [1 if each == "M" else 0 for each in data.diagnosis]

# menampilkan sample data
data.head(3) 

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


### **Spliting Dataset**

In [3]:
x = data.drop(['diagnosis'], axis=1)
y = data['diagnosis']

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.10, shuffle=False)

### **Fitting Model**
solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

    Algorithm to use in the optimization problem.

-        For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.

-        For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

-        ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

-        ‘liblinear’ and ‘saga’ also handle L1 penalty

-        ‘saga’ also supports ‘elasticnet’ penalty

-        ‘liblinear’ does not support setting penalty='none'


In [5]:
model_LogReg_Asli = LogisticRegression()
model_LogReg_Asli.fit(x_train, y_train)
print(model_LogReg_Asli.coef_)
print(model_LogReg_Asli.intercept_)

m = model_LogReg_Asli.coef_[0][0]
c = model_LogReg_Asli.intercept_[0]

[[-0.89361242 -0.20515277 -0.24553851  0.0074053   0.03321902  0.15998303
   0.21445026  0.09171891  0.04857537  0.01017327 -0.03335621 -0.40578448
  -0.10764678  0.10272549  0.00334861  0.03446109  0.04458414  0.01190454
   0.01255993  0.00318775 -0.91634081  0.27399961  0.23891563  0.017254
   0.0619081   0.49553281  0.59057666  0.17641094  0.1556528   0.0481576 ]]
[-0.16180843]


### **Predict**

In [6]:
# prediksi
y_pred = model_LogReg_Asli.predict(x_test) 

### **Model Performance**

In [7]:
model_LogReg_Asli.score(x_train, y_train)

0.947265625

In [8]:
model_LogReg_Asli.score(x_test, y_test)

0.9649122807017544

### **Model Parameter**

In [9]:
# Parameter yang dipakai di model asli
model_LogReg_Asli.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [10]:
# parameter model linear regression yang akan dituned + nilai yang mungkin

penalty = ['l1', 'l2', 'elasticnet', 'none']
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [1, 10, 100, 1000, 10000]

param = {'penalty': penalty, 'solver': solver, 'max_iter': max_iter}
param

{'penalty': ['l1', 'l2', 'elasticnet', 'none'],
 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
 'max_iter': [1, 10, 100, 1000, 10000]}

**Hyper-parameter Tuning**:

    - Randomized Search Cross Validation
    - Grid Search Cross Validation = 4 (pilihan penalty) * 5 * 5 = 100

## **Randomized Search CV**

In [11]:
# mencari parameter terbaik pada model: logistic regression

from sklearn.model_selection import RandomizedSearchCV
model_LR = LogisticRegression()
model_LR_RS = RandomizedSearchCV(
    estimator = model_LR, param_distributions= param, cv = 5
)

# model_LR2_GS = GridSearchCV( model_LR2, param, cv=5, error_score=0.0)

In [12]:
model_LR_RS.fit(x_train, y_train)
model_LR_RS.best_params_

{'solver': 'newton-cg', 'penalty': 'none', 'max_iter': 10000}

In [13]:
model_LogReg_Asli.score(x_test, y_test)

0.9649122807017544

In [14]:
model_LogReg_Baru = LogisticRegression(solver='saga', penalty = 'l2', max_iter = 10000)

model_LogReg_Baru.fit(x_train, y_train)
model_LogReg_Baru.score(x_test, y_test)

0.9473684210526315

# **Grid Search CV**

In [15]:
from sklearn.model_selection import GridSearchCV
model_LR2 = LogisticRegression()
model_LR2_GS = GridSearchCV(
    model_LR2, param, cv = 5
)

In [16]:
model_LR2_GS.fit(x_train, y_train)
model_LR2_GS.best_params_

{'max_iter': 10000, 'penalty': 'none', 'solver': 'lbfgs'}

In [17]:
model_LogReg_Asli.score(x_test, y_test)

0.9649122807017544

In [18]:
model_LogReg_Baru_2 = LogisticRegression(solver='lbfgs', penalty = 'none', max_iter = 10000)

model_LogReg_Baru_2.fit(x_train, y_train)
model_LogReg_Baru_2.score(x_test, y_test)

0.9649122807017544

## **Reference**:
- Tara Boyle, "Hyperparameter Tuning", https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624
- Pier Paolo Ippolito, "Hyperparameters Optimization", https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d
- Jiahao Weng, "Hyperparameter Tuning: A Practical Guide and Template", https://towardsdatascience.com/hyperparameter-tuning-a-practical-guide-and-template-b3bf0504f095
- Moto DEI, "Hyperparameter Tuning Explained — Tuning Phases, Tuning Methods, Bayesian Optimization, and Sample Code!", https://towardsdatascience.com/hyperparameter-tuning-explained-d0ebb2ba1d35