## Hyperparameter tuning, model evaluation

Hyperparameter optimization, also known as hyperparameter tuning, is a crucial aspect of machine learning that involves selecting the optimal set of hyperparameters for a learning algorithm. Hyperparameters are parameters that control the learning process and are typically set before the training begins. Examples of hyperparameters include learning rate, regularization strength, the number of hidden layers in a neural network, and the number of trees in a random forest. The performance of a model can vary significantly depending on the values of these hyperparameters.

Hyperparameter tuning is a challenging task as the optimal values are problem-specific and dependent on the available data. In practice, a variety of strategies are employed to tackle this issue. One common approach is **grid search**, where the hyperparameter space is discretized, and a set of combinations is tested exhaustively. Another approach is **random search**, where the hyperparameters are sampled randomly from a given distribution. **Bayesian optimization** is another popular method that models the performance of the learning algorithm as a function of hyperparameters and updates the distribution over the hyperparameter space to minimize the expected loss.

In this practical, we will explore the grid search method of hyperparameter tuning. By the end of this practical, you will have a better understanding of how to systematically search for the hyperparameters of your learning algorithm to achieve optimal performance.

In [2]:
import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state
import pandas as pd
import seaborn as sns

### 1. Load the dataset
A subset of the MNIST database of handwritten digits is used in this example. The dataset consists of 70,000 image samples belonging to 10 digit classes. The digits in the dataset have been standardized in terms of size and centered within fixed-size images. The original black and white images from NIST were adjusted to fit a 20x20 pixel box while maintaining their aspect ratio. As a result of the normalization algorithm's anti-aliasing technique, the images now contain varying shades of gray. Additionally, the images were further centered within a 28x28 image by calculating the center of mass of the pixels and shifting the image to position this point at the center of the 28x28 field.

In [None]:
# Load data from https://www.openml.org/d/554
X, y = fetch_openml("mnist_784", version=1,  return_X_y=True, as_frame=True)

#### 2. **Perform dataset inspection:**
Perform taks like previweing image samples from each class in the dataset, exploring the dataset dimensions, examining the dataset's structure, and examining and visualizing the class distribution.

In [None]:
# Generate train test splits (Considering the constraints of limited time and resources available for experimentation, use only 5000 samaples for training and 1000 samples for testing)
train_samples = 5000
test_samples = 1000
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_samples, test_size=test_samples, random_state = 400, stratify = y)

In [2]:
#Preprocess or standardize the train and test sets using the correct approach: use StandardScaler


#### 3. **Train a multiclass logistic regression classifier**

Multiclass classification is a type of supervised learning problem where the goal is to assign an input sample to one of several classes. It involves classifying data into more than two mutually exclusive classes. 

Multinomial logistic regression is a well-known method for multiclass classification which provides a probabilistic framework for assigning class labels. It can handle cases where the classes are not linearly separable and allows for more complex decision boundaries in the feature space. This makes it suitable for a wide range of applications, including image recognition, text categorization, and multi-class sentiment analysis.

**Train a multinomial logistic regression model on the preprocessed data (scikit learn LogisticRegression())**: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


The following are some important hyperparameters that can significantly influence the performance of the model:

**regularization type (penalty = {'l1', 'l2', 'elasticnet', None}, default='l2')** 

**regularization strength (C)**

**class_weight**

**solver algorithm (solver)** 

**and maximum iterations (max_iter)** 

You task is to systematically tune these hyperparameters by conducting experiments and evaluating the model's performance with different parameter combinations.



In [14]:
#Train a multinomial logistic regression model on the preprocessed data using the default parameters.
#Scikit-learn page on how to train the model: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# Call this model the deault_model. 




In [None]:
# Compute the accuracy (or error) of the default_model on the training set 

In [None]:
# Compute the accuracy (or error) on the test set


Overfitting: When a model becomes too complex and fits the training data exceptionally well, but struggles to generalize to new, unseen test data.

Underfitting: Occurs when a model is too simple or lacks complexity, resulting in poor performance on both the training and test data because it fails to capture the underlying patterns in the dataset.

In [3]:
# Are there any issues of overfitting or underfitting?






#### 4. Train a different model with different set of hyperparameters: 
For example,  use C = 50.0/train_samples, penalty="l1", solver="saga", tol=0.1

Compute the train and test set accuracy. Did the train and test set accuracy improve?

#### 5. Question
Is it considered a valid and appropriate practice to tune the model parameters solely based on evaluating the performance on the train and test sets? Are there any limitations or potential drawbacks to this approach?

#### 6. Grid Search

As you notice in the above tasks, manual hyperparameter tuning requires lots of effort and experiments. what alternative approaches can you suggest to mitigate the significant time and effort involved in the process?


Utilizing **grid search within a cross-validation framework** is a more systematic and accurate method of tuning an algorithm's hyperparameters. Instead of dividing the dataset into solely a train and test set, it is divided into three sets: a train set, a test set, and a validation set. The parameters are then tuned based on the validation set's performance in the cross-validation framework. The test set is reserved solely for reporting the model's accuracy on unseen data.

Scikit-learn provides functions for performing grid search: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

**Perform the following task**: 

Perform grid search on the parameters to obtain your best performing model. Use a 10-fold cross-validation strategy. Use 'accuracy' as the scoring metric (maximize the accuracy). Evaluate and tune the parameters: C, penalty, tol, and solver to achieve the best results.




In [6]:
## Implement grid search here. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV



In [7]:
#What set of a hyperparameters did the grid searh found in your experiment? 


#What is the achieved accuracy? Is is better or worse than the manually set hyperparameters in the previous taks. 


#Analyze the results to gain a better understanding of the relationship between hyperparameters and model performance.



#Is accuracy alone a good metric for evaluating the performance?

#### 7. Questions
Grid search is known to be computationally extensive. What makes grid search a computationally expensive technique?

Can you suggest some approaches to make grid search more efficient in terms of computational resources and time?