## Introduction to Model Selection

#### Choice of Model Selection :
  - Trial and Error
  - Domain Knowledge : Influenced all below very heavily
       - Model Selection
       - Model Evaluation
       - Model Comparison
       
The central issue in all of machine learning is **how do we extrapolate learnings from a finite amount of available data to all possible inputs ‘of the same kind’**? Training data is always finite, and yet, the model is supposed to learn all about the task at hand from it and perform well on unseen data.

### Occam's Razor 

**A predictive model has to be as simple as possible, simple the better**. Often referred to as the **Occam’s Razor**, this is not just a convenience but a fundamental tenet of all of machine learning. Before we explore this further, let’s first get some intuitive understanding of what it means for a model to be ’simple’. To measure the simplicity we often use its complementary notion — that of the complexity of a model. More complex the model, less simple it is. There is no univer- sal definition for the complexity of a model used in machine learning. However here are a few typical ways of looking the complexity of a model.

  1. Numberofparametersrequiredtospecifythemodelcompletely.Forexampleinasimple linear regression for the response attribute y on the explanatory attributes $x_{1}$,$x_{2}$,$x_{3}$ the model y = a$x_{1}$ + b$x_{2}$ is **’simpler’** than the model y = a$x_{1}$ +b$x_{2}$ +c$x_{3}$ ,the later requires 3 parameters compared to the 2 required for the first model.
  
  2. The **degree** of the function, if it is a **polynomial**. Considering regression again, the model y = a$x_{1}^{2}$ +b$x_{2}^{3}$ would be a more complex model because it is a polynomial of degree 3.
  
  3. Size of the best-possible representation of the model. For instance the number of bits in a binary encoding of the model. For instance more complex (messy, too many bits of precision, large numbers, etc.) the coefficients in the model, more complex it is. For example the expression (0.552984567 ∗ $x^{2}$ + 932.4710001276) could be considered to be more **’complex’** than say (2x + 3$x^{2}$ + 1), though the latter has more terms in it.

  4. The depth or size of a decision tree.
  
Intuitively more complex the model, more **’assumptions’** it entails. Occam’s Razor is there- fore a simple thumbrule  given two models that show similar ’performance’ in the finite train- ing or test data, we should pick the one that makes fewer assumptions about the data that is yet to be seen. That essentially means we need to pick the ’simpler’ of the two models. In general among the ’best performing’ models on the available data, we pick the one that makes fewest assumptions, equivalently the simplest among them.


**Ques 1** : The central issue in machine learning can be said to be the study of --------:

**Ans** : How to extrapolate learnings from a finite amount of data to explain or predict all possible inputs of the same type (Machine learning does not simply involve building models to fit the available data. The real challenge is to find patterns that can be used to explain the behaviour of similar unseen)

**Ques 2** : What does Occam's razor, a fundamental principle, suggest?

**Ans** : A model should be as simple as possible but robust (Occam’s razor does not suggest that a model should be unjustly simplified until no further simplification is possible. It suggests that when faced with a trade-off between a complex and a simple model, with all other things being roughly equal, you are better off choosing the simpler one.)

**Ques 3** : Which of the following is the simplest regression model (all lowercase alphabets are features)?

**Ans** : Y = x + 3w

**Note** : A **learning algorithm** learns from training data and produces a **model**.

**Ques 1** : Which of the following statements is true?

**Ans** : The learning algorithm is instructed what needs to be done; it figures out how it needs to be done and returns a model.


### Simplicity, Complexity and Overfitting

When choosing model, some practical issues we have to deal with :
   - Type of data
   - Data quality in terms of missing values etc.
   - Dimensionality of Data

Class of Models in terms of Handling, Models which can :
   1. High dimensional data
   2. Noisy data
   3. Real time data
   4. Large amount data
   5. Missing data

Every Class of Models has its own strength and its own limitation. Also 

   - Which one do I pick ?
   - How do I make the decision ?
   
**Do not use the training data for Model Evaluation. When we test two models we will have to test it on data that these models have not encountered before, have not been trained on.**


#### Choice of Learning Algorithm

If we decide to choose Regression Model --> What kind of Regression it should be :
   1. Straight Line or
   2. Complex curve
   
Error committed by Straight line is :
    $\sum_{i = 1}^{n} (Error)^{2}$  > 0 , 

Error is never going to be 0. so we will never get Linear regression model which makes 0 error. So **Straight Line** would be **Hypothesis Class** for Linear Regression.

And for **Curve Shape**, **Hypothesis Class** we consider is **Polynomial**.


### Simple vs Complex Model :

   1. A simpler model is usually more generic than a complex model. This becomes important because generic models are bound to perform better on unseen data sets.
   2. A simpler model requires fewer training data points. This becomes extremely important because in many cases, one has to work with limited data points.
   3. A simple model is more robust and does not change significantly if the training data points undergo small changes.
   4. A simple model may make more errors in the training phase but is bound to outperform complex models when it views new data. This happens because of overfitting.
   
   
**Ques 1** : Why are simpler models considered to be better than complex models? (Note: More than one option may be correct.)

**Ans** : 

   1. Simpler models are generic, i.e., they apply to a wider range of data.
   2. Complex models make assumptions about the data, which are likely to be wrong.
   3. Simpler models require less training data compared with complex models.
   4. Simpler models are more robust.
   
**Ques 1** : Which of the following is an extreme case of overfitting?

**Ans** : The first person has mugged up all the possible questions from numerous textbooks and preparation material.

**Ques 2** : Which of the following is a disadvantage that the first person is likely to face because of a complex model? (Note: More than one option may be correct.)

**Ans** : 

   1. They will need more training data to ‘learn’.
   2. Despite the training, they may not learn and perform poorly in the real world.
   
   
### Overfitting

Overfitting is a phenomenon wherein a model becomes highly specific to the data on which it is trained and fails to generalise to other unseen data points in a larger domain. A model that has become highly specific to a training data set has ‘learnt’ not only the hidden patterns in the data but also the noise and the inconsistencies in it. In a typical case of overfitting, a model performs quite well on the training data but fails miserably on the test data. 

Where model seems to do extremely well on training data, the moment we take it outside the training data, model fails, model doesn't do as well and that is the typical sign of a model that has started to overfit.


**Ques 1** : Why does the possibility of overfitting exist primarily?

**Ans** : Models are trained on a set of training data, but their efficacy is determined by their ability to perform well on unseen (test) data. (It is possible to memorise the training data while failing to truly learn the underlying trends and patterns. On unseen data (read tricky but unseen exam questions), memorising is bound to fail.)

**Ques 2** : Which of the following is a clear sign of overfitting in linear regression?

**Ans** : The R-squared value is 0.90 and 0.30 on train and test data, respectively. (In overfitting, the model fits the training data quite well because it has memorised it.)

**Ques 3** : Suppose you have 1,00,000 Google images as training observations, and you are trying to build a neural network to classify these images in the following three classes: nature, cities and others. You use another 50,000 observations to test it, and the accuracy on the test set comes out to be 10%. Which of the following can you use to check whether the model has overfitted?

**Ans** : Accuracy on the training set (If the training accuracy is higher than 10%, it is likely to have overfitted. Neural networks, that you will learn about later, can be made extremely complex using a number of hyperparameters such as hidden layers and the number of neurons. However, the basic idea remains the same - memorisation of training data and failure on test data.)


### Bias-Variance Tradeoff

**Variance** : How sensitive is the model to any change into input data (training data), How **Consistent** is the model.

**Bias** : How much error the model likely to make in test data, More about **Correctness**

For Complex Model : Variance is **High** and Bias is **Low**

For Simple Model : Variance is **Low** and Bias is **High**

Total Error = Variance + Bias

We considered the example of a model memorising the entire training data set. If you change the data set slightly, this model will also need to change drastically. The model is, therefore, **unstable and sensitive to changes in training data**, and this is called **high variance**.

The ‘variance’ of a model is the **variance in its output** on some test data with respect to the changes in the training data. In other words, variance here refers to the **degree of changes in the model itself** with respect to changes in the training data.

**Variance** refers to changes in the model as a whole when trained on different dataset. Since the polynomial is trying to overfit the data, it will change drastically with respect to it.

**Bias** quantifies how **accurate the model is likely to be** on future (test) data. Extremely simple models are likely to fail in predicting complex real-world phenomena. Simplicity has its own disadvantages.

Ideally, we want to reduce both bias and variance because the expected total error of a model is the sum of the errors in bias and variance, as shown in the figure given below.

![image.png](attachment:image.png)

In practice, however, we often cannot have a model with a low bias and a low variance. As the model complexity increases, the bias reduces, whereas the variance increases and, hence, the trade-off.

Recall that in the competitive exam analogy, the first person learns using a much more complex mental model than the second one.


**Ques 1** : What do the mental models of the first and the second person, respectively, have? (Note: More than one option may be correct.)

**Ans** : 
   
   1. Student 1 = high variance, student 2 = high bias
   2. Student 1 = low bias, student 2 = low variance

**Ques 2** : Suppose Rohit builds two linear regression models to solve the car pricing problem. Model 1 has three features, and model 2 has 11 features. Which of these two models is likely to undergo a larger change when a new training data set is used?

**Ans** : Model 2


### Comprehension - Bias Variance Tradeoff

An artificially generated data set was used to generate data of the form (x, 2x + 55 + e), where e is a normally distributed noise with a mean of zero and variance of 1. The following three regression models have been created to fit the data: linear, a degree-15 polynomial and a higher degree polynomial that passes through all the training points.

![image-2.png](attachment:image-2.png)

**Ques 1** : Which of the following is the correct order of bias in the three models?

**Ans** : Straight Line > Degree 15 > Polynomial

**Ques 2** : Why is the variance in the higher degree polynomial said to be higher than the other two models?

**Ans** : The model will change drastically from its current state if the current training data is altered


### Regularization

Having established that we need to find the **correct balance between model bias and variance** or between **simplicity and complexity**, we need tools that can reduce or increase the complexity. So Regularization methods that are used to carefully observe model complexity.

   1. To prevent the model from becoming complex
   2. It is part of the learning algorithm
  
So every Machine Learning algorithm in practice will necessarily have some **regularization** steps which is built into it, that will ensure that the output of learning algorithm i.e. model is not unnecessarily complex.

   1. **For Regression** this involves adding a regularization term to the cost that adds up the absolute values or the squares of the parameters of the model.
   2. **For decision trees** this could mean ’pruning’ the tree to control its depth and/or size.
   3. **For neural networks** a common strategy is to include a dropout—dropping a few neurons and/or weights at random


**Ques 1** : When is regularization typically performed?

**Ans** : While the learning algorithm uses the training data to produce a model (The learning algorithm performs the process while it is learning from the training data.)


### Regularization and Hyperparameters

**Regularization** discourages the model from becoming highly complex even if it explains the (training) observations better. In the previous session, you were introduced to this term that is used to find the optimal point between extreme **complexity** and **simplicity**.

**Hyperparameters** are parameters that we pass on to the learning algorithm to control the complexity of the final model. They are choices that the algorithm designer makes to ‘tune’ the behaviour of the learning algorithm. Therefore, the choice of hyperparameters has a lot of bearing on the final model produced by the learning algorithm.

**Hyperparameters** ----->>>>(pass) **Learning Algorithm** ----->>>>(Control the Complexity) **Final Model**

For Regression :

min(a,b) = $\sum_{i = 1}^{n} (y_{i}-ax_{i}-b)^{2} + \lambda(a^{2}+b^{2})$

$(y_{i}-ax_{i}-b)^{2}$ = Error Term

$\lambda(a^{2}+b^{2})$ = Regularization Term

$\lambda$ = Hyperparameter

In terms of Error Term = $\lambda$ minimum

Regualarization Term = $\lambda$ maximum

Training data ====>>>> Validation Data (Hyperparameter uses it for tuning model for optimally complex) ====>>>> Test data (Unseen for Model)

Hyperparameter ====>>>>(Input) Learning Algorithm ====>>>>(Output) ====>>>> Model Parameter

**The concept of hyperparameters has been summarised below : **

   1. Hyperparameters are used to 'fine-tune' or regularize the model to keep it optimally complex.
   2. The learning algorithm is given the hyperparameters as the input, and it returns the model parameters as the output.
   3. Hyperparameters are not part of the final model output.


### Model Evaluation and Cross Validation

The key point to remember here is that a **model should never be evaluated on data that it has already seen before**. With that in mind, you will have either one of the following two cases: 
   1. The training data is abundant(Huge) 
   2. The training data is limited
   
In first case we can use as many observations as per our preference to train and test the model.

In second case, we will need to find some **‘hack’** so that the model can be evaluated on unseen data and, simultaneously, does not eat up the data available for training. This hack is called **cross-validation**.

**Cross-validation** is a resampling procedure used to evaluate machine learning models on a limited data sample.

The general procedure is as follows:

   1. Shuffle the dataset randomly.
   2. Split the dataset into k groups
   3. For each unique group:
   4. Take the group as a hold out or test data set
   5. Take the remaining groups as a training data set
   6. Fit a model on the training set and evaluate it on the test set
   7. Retain the evaluation score and discard the model
   8. Summarize the skill of the model using the sample of model evaluation scores

![image-3.png](attachment:image-3.png)


**Ques 1** : Why is it often not possible to use the validation set approach?

**Ans** : The data available for training and testing is limited.

**Note** : The **hyperparameter** for the **Linear Regression** model is the **number of features** being used for training.

**Note** : Fitting a polynomial to the data is known as **Polynomial Regression**.

For linear regression of degree 1, you fit the curve of the following form:

y = $\beta_{0} + \beta_{1}x_{1}$

For polynomial regression of degree n, you fit the curve of the following form:

y = $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{1}^{2} + \beta_{3}x_{1}^{3} + ------ + \beta_{n}x_{1}^{n}$

![image-4.png](attachment:image-4.png)

As you can observe, the training score is increasing while the test score is decreasing as the degree of the polynomial is increasing. This is a clear sign of **overfitting**.


### Cross-Validation: Motivation

**Cross-Validation** use to select the **number of features** for a linear regression model.

The number of input variables or features in a linear regression model is a **hyperparameter**. RFE can be used to select this number. The process of selecting an optimal number is a tough task. What is the reason for this? If the number of features is extremely large (for example, 150), it becomes impossible to do it manually.

In such a case, you have an option to automate it using **cross-validation**.

The Problems associated with **Manual Hyperparameter tuning** are as follows:
   1. **Split into Train and Test sets** : Tuning a hyperparameter makes model **see** the data. Also, the results are dependent on the specific train-test split.
   2. **Split into Train,Validation and Test sets** : The validation data would eat into training set.
   3. **Cross-Validation** : Split the data into train and test sets and train multiple models by sampling the train set. Finally you only use test set to test the hyperparameter once.
   
Specifically, we will perform **k-fold cross-validation**, wherein you divide the training data into **k-folds**.

![image-5.png](attachment:image-5.png)

In **K-fold CV**, you divide the training data into K-groups of samples. If K=4 (say), you use **K-1** folds to build the model and test the model on the **Kth fold**. 

### Cross-Validation: Hyperparameter Tuning

You use GridsearchCV to tune the hyperparameters in Python. The grid is a matrix that is populated on each iteration. Please refer to the image given below.

![image-6.png](attachment:image-6.png)

You can train the model using a different value of the hyperparameter each time. The estimated score is then populated in the grid.

You can use the plot of **mean test** and **train scores** to choose the **optimal number for the hyperparameter**.

The various types of cross-validation are as follows:

   1. K-fold cross-validation
   2. Leave one out (LOO) cross-validation
   3. Leave P-out (LPO) cross-validation
   4. Stratified K-Fold cross-validation
   
#### The following points regarding model evaluation are worth reiterating:

**Validation Data**:
   - Since hyperparameters need some unseen data to tune the model, the validation set is used.
   - It prevents the learning algorithm to 'peek' at the test data while tuning the hyperparameters.
   - A severe and practically frequent limitation of this approach is that data is often not abundant.
   
**Cross-Validation**:
   - It is a statistical technique that enables you to make extremely efficient use of the available data.
   - It divides the data into several pieces or 'folds' and uses each piece as test data one at a time.


### Practice Questions

**Ques 1** : Why should we have disjoint training and test data sets? (This means that a model should not be tested with data on which it is trained.)

**Ans** : Any model needs to be tested on how well it would work in the proverbial ‘real’ world because once a model has seen the data, it can attempt to ‘memorise’ it, and once that is done, testing it on the same data set will not help in determining its performance on unseen data. In an ideal scenario wherein we have plenty of data, we should divide the data into three sets. The first one would be the training data on which we shall train the model. The second one would be the validation data on which we shall test the model and tune the hyperparameters. The third one would be the test data that we will use for assessing our model.

**Ques 1** : Which of the following is more likely to have a low variance?

**Ans** : Weak Learner (Weak learners create simpler models that have a lower variance. They are not able to model complex relationships and, hence, create a more generic model.)

**Ques 1** : Suppose two linear regression models are provided on the same data set with 100 attributes. Model A has 10 attributes, whereas model B has 90 attributes. What can you say about the two models?

**Ans** : 

Model B has tried to memorise the data, and when the training data changes slightly, the expected results will change.(Model B has many attributes and is probably overfitted.)

Model B will have a much higher variance than model A.

Model A will have a higher bias. (Model A has made numerous assumptions about the data to keep the model simple, leading to a high bias.)

**Ques 2** : Suppose data is generated via a polynomial equation of degree 4 (i.e., the said polynomial equation will perfectly fit the given data). Which of the following statements is true in this case?

**Ans** : 

Linear regression will have a high bias and a low variance. (Linear regression would create a degree-1 polynomial that would be less complex than the degree-4 polynomial and, hence, would have a higher bias. Since the model is less complex and will not overfit, it would have a low variance.)

A polynomial equation of degree 4 will have a low bias and a low variance.


### Graded Questions

**Ques 1** : How do you measure the variance of a model?

**Ans** : By measuring how much does the estimates of the model change on the test data on changing the training data.(Variance measures the extent of change in a model with respect to the training data.)

**Ques 2** : What is regularization?

**Ans** : It is a technique that is used to strike a balance between model complexity and model accuracy on training data. (Regularization does not improve accuracy; it improves the balance between accuracy and complexity.)

**Ques 1** : Which of the following is incorrect with respect to creating simpler models?

**Ans** : Simpler models will always have fewer test errors than a complex model.

**Ques 2** : How would you quantify the simplicity of a model?

**Ans** : 

    By the number of features used in the model
    By the number of nodes and depth of trees in case of a tree model

**Ques 1** : Which of the following statements is correct with respect to k-fold cross validation?

**Ans** : Both A and B (Training happens for k times, and a higher k would imply a higher run time for training with k-fold cross validation. Also, a higher k implies that the training set is bigger and is a better representation of the actual data each time.)

**Ques 2** : Which of the following about k-fold cross validation is not true?

**Ans** : 
    
    You repeat the process k-1 times. (With k-fold cross validation, training happens k times while each sample is used as validation data at least once.)
    A model trained with k-fold cross validation will never overfit. (Overfitting is possible with cross validation. Cross-validation does not prevent overfitting by itself, but it may help in identifying a case of overfitting.)
    