# Model Evaluation Exercise: Hyperparameter Tuning

In today's session, you learned about how to evaluate a model to determine whether it has good performance. In this exercise, you are going to use what you have learned in carrying out **hyperparameter tuning**.   
  
Hyperparameters are values in our machine learning model that we have to set, but that are not learned from the data. For example, when we used a random forest model, we used the following code: 
  
```
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
``` 
  
Here, `n_estimators` and `random_state` are the hyperparameters. We set them and give them values and the model changes depending on what those values are, but there is no way to directly *learn* the values from the data. `n_estimators` is the number of decision trees in our random forest and `random_state` is the "seed" used in random number generation, which guarantees our model will be reproducible.

In this notebook, you are going to vary the hyperparameters of your random forest model that you used to classify HNSC patients based on their HPV status. By carrying out multiple steps of model evaluation, you are going to build the most optimal random forest model that you can to solve this problem.  

If you get stuck at any point in this notebook, you can look at the `model_evaluation.ipynb` notebook that we went through in class for guidance or you can raise your hand and one of the course instructors will come over to help you

## 1) Build a random forest classifier for the HPV status  
  
This is exactly the same as your homework task: you can copy and paste over the code needed to do this.  
  

Remember you need to: 
- import libraries
- read in data 
- split into features and outcome
- train test split
- build a random forest model. 
  
In the below cells, complete the code to train the model.

In [None]:
import pandas as pd 
import numpy as np 

from sklearn...

In [4]:
df = pd.read_csv("dataset/hnsc_dataset_scaled.csv", index_col = 0)

y = df["HPV_Status"]
X = df.drop(columns = "HPV_Status")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(...)

Unnamed: 0_level_0,PRAME,ACTC1,MYBPC1,DES,MAGEA4,GSTM1,UGT1A7,TGM3,CRNN,MAGEA6,...,MT1L,UPK1A,CIDEA,SPOCK1,FABP6,PGLYRP4,ZNF681,TNNT2,FOXC2,GAD1
Patient ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-4P-AA8J,1.209821,1.420797,0.866643,1.385693,1.533709,0.610027,-0.106147,-1.383249,-0.149420,0.855603,...,1.080776,1.004259,0.937387,1.457463,1.092352,-0.229496,-0.070576,1.880138,0.552338,0.633321
TCGA-BA-4074,-0.321485,0.445424,-0.100255,0.306600,-0.548650,1.038343,-0.809892,-1.474619,-1.492317,0.201494,...,1.052736,-0.338622,-1.213358,0.970665,2.406067,-0.735174,0.127294,-0.820846,0.027972,1.095503
TCGA-BA-4076,1.093123,-1.007401,-1.040044,-1.177666,1.424220,-1.008177,-0.128543,0.506885,0.546852,1.260065,...,-2.276450,1.103977,1.761758,0.378451,-0.207212,0.701801,1.123930,0.454497,0.385118,0.813089
TCGA-BA-4078,1.311364,-0.479122,-0.198360,-0.604511,-0.307156,1.650475,-0.407305,-1.277495,-0.191219,-0.565218,...,-1.127546,0.029151,-0.267558,1.186041,-2.263153,-0.969619,-0.821687,0.278721,-0.541685,0.303970
TCGA-BA-5149,-0.900534,0.657561,0.205038,0.513543,1.520438,-0.010684,-1.114986,-0.578728,-1.265923,-0.491914,...,0.518186,-0.486232,-0.827207,0.429422,-0.012086,-0.184506,0.483262,-0.485898,1.764313,-0.870219
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-UF-A7JT,-1.393825,0.234813,-0.120132,0.136731,-0.822515,0.412666,-1.114986,-0.005774,0.761446,-0.920868,...,1.345021,-0.152118,-0.751511,-0.758733,-1.195614,0.051202,-0.951983,0.612405,1.325086,0.363228
TCGA-UF-A7JV,-1.393825,-0.968131,-1.178289,-1.523572,-1.077263,-1.245706,-0.812928,-1.509589,-0.373040,-1.012838,...,3.525508,-0.991271,-1.213358,2.256465,-1.043933,-1.098901,0.363054,-1.215998,1.274656,0.132083
TCGA-UP-A6WW,-1.195943,-1.078513,-0.056759,-1.069811,-1.077263,1.233223,-0.566821,-1.714329,-1.602238,-1.020937,...,-0.506783,-0.627394,-0.821688,-0.212496,2.033128,0.360568,-1.106719,-0.646637,-0.932549,2.157792
TCGA-WA-A7GZ,1.149458,1.025579,0.912684,1.074526,-0.210438,0.999969,-0.907596,0.391393,0.261639,1.551346,...,-1.483303,0.900544,0.082143,-0.298082,-0.667375,-0.000756,1.005915,0.963146,1.189116,-1.298754


In [None]:
rf_model = 

## 2) Now evaluate that model. 
  
Some ways to evaluate the model that you might want to use are 
* accuracy 
* f1-score
* the ROC curve
* the area under the ROC curve

## 3) Now build a new model and evaluate it
  
Now, using the same data, try to build a new random forest model. You can change the hyperparameters any way that you want. The table below shows all of the hyperparameters and some example values that they can take.  
  
Once you've built the model, carry out model evaluation (as you did above) to see whether the model you have built is better or worse than the base model

| Hyperparameter           | What It Controls                                             | Example Values             |
|--------------------------|--------------------------------------------------------------|----------------------------|
| `n_estimators`           | Number of trees in the forest                                | 100, 200, 500, 1000              |
| `max_depth`              | Maximum depth of each tree (limits how complex a tree can be)| None, 5, 10, 20            |
| `min_samples_split`      | Minimum number of samples to split an internal node          | 2, 5, 10                   |
| `min_samples_leaf`       | Minimum number of samples at a leaf node                     | 1, 2, 4                    |

## 4) What about another set of hyperparameters?  
  
If you've got this far, how about trying to come up with another set of hyperparameters? How does that affect the model?  
  
In reality, hyperparameter optimisation for machine learning is an extremely long and tedious process! You try out hundreds of different sets of hyperparameters to build the best possible model. We are giving you a flavour of what that is like here. 

### Bonus. 
  
There is an error in what we have asked you to do in this notebook. Can you think what it might be?    
It relates to the train-test splitting of models. Remember that the test set must stay independent and unseen, what have we done wrong here?