<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

# Homework Task 1: HPV Status and Head and Neck Squamous Cell Carcinoma

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## Clinical Background: Identifying HPV Status from RNA Data in Head and Neck Cancer

Human papillomavirus (HPV) is a key etiological factor in a subset of head and neck squamous cell carcinomas (HNSCC), particularly those arising in the oropharynx[1]. HPV-positive HNSCC is clinically and biologically distinct from HPV-negative disease[2]. Patients with HPV-positive tumors generally have a better prognosis and respond more favorably to treatment[3], which has led to the development of de-escalation strategies in clinical trials[4]. Therefore, accurate determination of HPV status is critical for diagnosis, prognosis, and treatment planning.

</div>

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Traditional Detection Methods

Clinically, HPV status is typically determined through p16 immunohistochemistry (as a surrogate marker), in situ hybridization, or PCR-based methods for detecting viral DNA or RNA[5]. However, with the increasing availability of high-throughput sequencing data, computational approaches now offer alternative ways to infer HPV status using RNA expression profiles[6].

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Why Use RNA-seq Data?

RNA sequencing (RNA-seq) provides a comprehensive snapshot of gene expression levels across the entire transcriptome. Tumors driven by HPV often exhibit distinct transcriptional signatures, not only due to the presence of viral transcripts, but also because HPV-related oncogenesis affects host gene expression patterns[7]. This makes RNA-seq a valuable tool for distinguishing between HPV-positive and HPV-negative cancers using machine learning techniques[8].

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### About the Dataset: TCGA and Pan-Cancer Atlas

The data used in this task comes from **The Cancer Genome Atlas (TCGA)**, specifically the **Pan-Cancer Atlas** collection[9]. TCGA is a landmark cancer genomics program that has molecularly characterized over 20,000 primary cancer and matched normal samples across 33 cancer types. The Pan-Cancer Atlas is an integrated dataset that combines genomic, transcriptomic, and clinical information across multiple cancer types, including head and neck squamous cell carcinoma (HNSC)[10].

For this assignment, you'll be working with **RNA-seq data** from **head and neck cancer biopsies** in TCGA, where the **HPV status** of each patient has been annotated. The goal is to use these RNA profiles to build a model that can **classify patients as HPV-positive or HPV-negative** based on their gene expression patterns.  
  
We have pre-processed the data to make the task more enjoyable. Whilst RNA-seq can pick up over 24,000 genes, we've taken only 1000 of them. We've also normalised the gene expression levels so that they all appear between 0 and 1. Machine learning methods work better when all of the features (genes in this case) have values that are on the same scale as each other (rather than some being much larger or smaller than others!).

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## The Task

The assignment here is to use the RNA profiles provided to build a machine learning model that will be able to classify patients as HPV-positive or HPV-negative based on their gene expression patterns.  
  
We will use a **random forest algorithm** to do this. Random forests are a great, general-purpose machine learning algorithm when you are dealing with **tabular data**. This is data that can be conveniently written down in a table (like a spreadsheet).  
  
More complicated machine learning methods, such as neural networks, tend to be too complex to work well in tabular data. They often end up fitting noise in the data, rather than overall trends, leading to them performing worse than random forests.

### Checklist
The tasks are as follows: 

1) Import the libraries you need for your work
2) Load in the dataset that you have been provided
3) Perform train-test splitting to prepare the data for analysis
4) Build a random forest model to do the classification task
  
Wherever you see a cell with "(Help)", you can ignore it if you want! It is just to provide support/help if you are stuck.

## If you get stuck
  
A key part of coding is getting stuck and working out the problem. If you get stuck, the following things may help you: 
* read the error message that is being produced 
* google the error that you are getting: maybe someone else has had it before
* use a chatbot like ChatGPT or Deepseek to try to fix your problem
  

We **strongly urge** that you try to fix your problems using other methods before getting ChatGPT (or other chatbot) to fix them. Whilst it will probably work, it will not help you learn as much. If you try to solve the problem yourself, you will be better at fixing it if it ever happens again.

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 1) Import the libraries you need for your work  

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

The first task is *always* to import the libraries you need for your work. This is the same as `library(dplyr)` in `R`. Libraries are the fundamental unit of reproducible code. It prevents you having to redo something that you've done before, and allows you to share solutions with other people.  
  
Remember in `Python`, there are two ways that you can import things
1) `import numpy as np`  
    * This imports the `numpy` library, so if you want any function from inside `numpy`, you use `np.function()`  
    * This is useful if you are going to use a lot of functions that all come from the same library.  
    * The "as" here is the name that we are going to use for the library in our code. It's convenient to use a short version. For `numpy`, we normally use `np`. For `pandas`, we use `pd`. You can use anything you want though
  
2) `from numpy import sum`  
    * This imports the `sum` function from inside the `numpy` library.  
    * This is useful if you are using functions that coming from different parts of the same library (e.g. in `sklearn`)

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

For this task, you are going to need the following libraries: 
* the `matplotlib.pyplot` library 
* the `pandas` library
* the `numpy` library 
* the function `RandomForestClassifier` from the `sklearn.ensemble` library
* the function `train_test_split` from the `sklearn.model_selection` library
  
Write Python code in the cell below to import these libraries

In [1]:
### Write your Python code in this cell
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd 

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 2) Load in the dataset that we are going to classify

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Now we need to load in the dataset that we are going to be working with. The file that we are going to install is in the `dataset` folder and is called `hnsc_dataset_scaled.csv`.  
  
A `.csv` file is a common file format used by data scientists. It is a "comma separated values" file. If you try to open this file, you would see all of the data separated by commas ",". This is a file format that is used in Excel Spreadsheets as well. 
  
To read in a `.csv` file, we use the `read_csv()` function from the `pandas` library.  
  
Remember, if you used `import pandas` in the previous section, then you will need to change this from `pd.read_csv` to `pandas.read_csv`
  
In the cell below, use `read_csv()` to read in the dataset. Inside the brackets, put the path to the file in speech marks. The "path" tells the code how to get from the `.ipynb` notebook where it is currently running to where the data that it is going to read is.

In [2]:
### Put in the path to the file here

df = pd.read_csv("../../course_content/medicine/session_2/dataset/hnsc_dataset_scaled.csv")

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

It is good practice to check that the data you have is what you expected. Remember that you can use `df.head()` to show the top 5 rows of the dataset.  
  
Use that here, check that the data seems to be read in sensibly. 

In [3]:
### View the first 5 rows of your dataset
df.head()

Unnamed: 0,Patient ID,PRAME,ACTC1,MYBPC1,DES,MAGEA4,GSTM1,UGT1A7,TGM3,CRNN,...,UPK1A,CIDEA,SPOCK1,FABP6,PGLYRP4,ZNF681,TNNT2,FOXC2,GAD1,HPV_Status
0,TCGA-4P-AA8J,1.209821,1.420797,0.866643,1.385693,1.533709,0.610027,-0.106147,-1.383249,-0.14942,...,1.004259,0.937387,1.457463,1.092352,-0.229496,-0.070576,1.880138,0.552338,0.633321,Positive
1,TCGA-BA-4074,-0.321485,0.445424,-0.100255,0.3066,-0.54865,1.038343,-0.809892,-1.474619,-1.492317,...,-0.338622,-1.213358,0.970665,2.406067,-0.735174,0.127294,-0.820846,0.027972,1.095503,Positive
2,TCGA-BA-4076,1.093123,-1.007401,-1.040044,-1.177666,1.42422,-1.008177,-0.128543,0.506885,0.546852,...,1.103977,1.761758,0.378451,-0.207212,0.701801,1.12393,0.454497,0.385118,0.813089,Positive
3,TCGA-BA-4078,1.311364,-0.479122,-0.19836,-0.604511,-0.307156,1.650475,-0.407305,-1.277495,-0.191219,...,0.029151,-0.267558,1.186041,-2.263153,-0.969619,-0.821687,0.278721,-0.541685,0.30397,Positive
4,TCGA-BA-5149,-0.900534,0.657561,0.205038,0.513543,1.520438,-0.010684,-1.114986,-0.578728,-1.265923,...,-0.486232,-0.827207,0.429422,-0.012086,-0.184506,0.483262,-0.485898,1.764313,-0.870219,Positive


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

* Does the data the correct sensible number of rows and columns?  
  
* We are expecting 1000 columns for the 1000 genes, 1 column for the HPV-status (positive/negative) and 1 column with the Patient-ID 
  
* How many patients are there?
  
* To view the shape of the dataset (number of rows and columns), you can use `df.shape` 

In [4]:
### Print out the shape of the dataset here. Is it what you expected? How many patients are there?
print(df.shape)

# 487 patients

(487, 1002)


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

We need to separate the outcome that we are trying to predict (HPV-status) from the features (the gene expression levels) that we are going to use to predict it. 
  
In the cell below, create new variables `X` and `y`.  
* `y` should be the `HPV_Status` column from `df`
* `X` should have only the columns that include gene expression information
  
Remember, in pandas, if you have a dataset called `data`, then you can get the column called "`outcome`", by using `data["outcome"]`. To get all columns that are *not* called "outcome" and "id", you could use `data.drop(["outcome", "id"], axis = 1)`.
  
Remember to adapt this code for our situation (where our "outcome" and "id" columns are called different things)

In [5]:
y = df["HPV_Status"]
X = df.drop(["HPV_Status", "Patient ID"], axis = 1)

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 3) Train-Test Splitting  
  
As was explained in the lecture, train-test splitting is a crucial part of building machine learning models.  
The core task in machine learning is to build a model/algorithm that will predict the quantity of interest on new samples that it hasn't seen before.  
In our case, we want to build a machine learning model that can predict an HNSC patient's HPV status based on their gene expression levels.  
  
We use a **training set** to build the model. However, because the model has seen this code before, the performance on this training data is a *biased* estimate of how the model would perform on new data. We also need a **testing set** that we *only* use right at the end to evaluate the model. We will discuss model evaluation in the next session.  
  
For the time being, we will just do train-test splitting here. The scikit-learn library in `Python` provides tools for doing this.

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

************************************************************************************************************************************************************************************************************   
  
(HELP)  
  
Let's say that you had a dataset called `data` and a quantity you were trying to predict called `outcome`. You would do train-test splitting using `scikit-learn` with the following code: 
```
X_train, X_test, y_train, y_test = train_test_split(data, outcome, test_size = 0.01, random_state = 50)
```  
Here, I've used a `test_size` of 0.01 (meaning 1% of my data is used for the testing set, and 99% is used for the training set) and a `random_state` of 50.  
  
The `random_state` is a variable that allows my splitting to be reproduced by other people. Every time this code is run, no matter the computer, if `random_state` is not changed, then the train test splitting will be identical
************************************************************************************************************************************************************************************************************

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">
In the following cell, do train-test splitting of the HNSC dataset. Use a test_size of 0.2 and a random_state of 42

In [6]:
### Do train-test splitting here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

In the "Markdown" cell below, answer the following questions. You can type `1)` and then write your answer to have them appear

1) Why is it important to do train-test splitting?
2) What should you do with the test set once you have produced it?
3) The `random_state` variable in the `train_test_split()` function ensures reproducibility: we get the same train-test split every single time we run the code, why might this be important?
4) My model isn't working very well, what if I used the testing set for more training, and then used it again to test? Would this be incorrect?

#### Your Answers  
1)  To get an unbiased estimate of the model's performance on data it hasn't seen yet
  
2)  Do not use it to train the model. Keep it separate until the very end. Only use it for final evaluation.
   
3)  To allow other people to use the code, to systematically improve model without random chance, to ensure that we don't get values that change over time
   
4)  Not correct! Would be using the test set to train the data --> not allowed


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 4) Building a Random Forest Model

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Finally, we get to actually building our machine learning model.  
  
As discussed before, we are going to use "random forest". This is a great, general-purpose machine learning model that should be good for most purposes. 
  
The actual algorithm was discussed in the lecture. If you want some more information, feel free to check out some of the following resources: 
* StatQuest with Josh Starmer: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ
* Stemplicity: https://www.youtube.com/watch?v=Y85XV45x0VU
* Victor Zhou: https://victorzhou.com/blog/intro-to-random-forests/ 

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

We are *not* going to ask you to implement a random forest yourself.  
  
In general, unless you are inventing new machine learning algorithms, it is almost always a bad idea to try to implement an algorithm completely from scratch yourself.  
  
For personalised medicine, we will always use versions of these algorithms that someone has already created and saved as a library (like `scikit-learn`).  
This is for a few reasons:   
* the library version has been checked and validated by other people, so it is more likely to be error-free
* the library version has been optimised and written efficiently, so it will be quicker than any version we can write
* the library version is *safe*. The code is saved open-source online, so we know exactly what it is doing


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

For random forest, we are going to use the `RandomForestClassifier` function from the `sklearn.ensemble` library. You should have imported that earlier in this notebook.  
  
The next cell will use the python `help` function to show the help page for RandomForestClassifier. You will see that it contains a lot of information! The same information is summarised here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html  
  
*You do not need to read all this information*  
  
We are including it to show how many parameters there are that go into the `RandomForestClassifier` algorithm

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Most machine learning algorithms have **hyperparameters**. These are parameters that affect how the algorithm behaves, and can change how accurately it can classify data. However, they are not learned directly from the training data. They are parameters that you (the user) needs to set manually.  
  
**Hyperparameter Optimisation** is the task of picking these hyperparameters so that our machine learning algorithm performs as well as it possibly can. It is an important part of using machine learning models. However, we are not going to handle it in this course. If you are interested, you can check out this link (https://www.geeksforgeeks.org/random-forest-hyperparameter-tuning-in-python/).  
  
For random forests, the most important hyperparameters are: 
* `random_state`: As with train-test splitting, a number that ensures our model is reproducible every time we run the code 
* `n_estimators`: The number of decision trees in our random forest. Normally, 100 is a sensible number. 
* `max_depth`: How deep each decision tree in the forest is allowed to go before we build a new one. Normally, 10 is a sensible number. 
  
There are other hyperparameters too, but we will not adjust these

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

In the cell below, build a random forest model using `RandomForestClassifier`. For its hyperparameters, use: 
* `random_state = 42`
* `n_estimators = 100`
* `max_depth = 10`


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">  

********************************************************************************************************* 
(HELP)  
If I was trying to build a `Forest` model with a hyperparameter called `Size` with value 10, then I would use this code: 
```
model = Forest(Size = 10)
```
********************************************************************************************************* 

In [7]:
### Build your model here
model = RandomForestClassifier(n_estimators = 100, max_depth = 10, random_state = 42)

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Now it is time to train your model. This involves providing examples to the model that it can use to adjust its parameters and produce the best algorithm for classifying samples. 
  
For this process, we provide the model both:
* training data (i.e. the `X_train` from earlier that contains gene expression levels)
* labels which tell the model what the prediction should be for each of the training examples (i.e. the `y_train` from earlier)  


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Machine learning models "learn" by seeing training examples and outcomes. They infer patterns in the training examples and associate these with the outcomes. When we give them new examples, they try to find those patterns and "predict" an outcome based on what they've seen before.  
  
The more training examples that a model is provided, the better these patterns will be and the better the model will perform.  
  
This is why you will often hear that machine learning models "need more data". It is about providing more training examples to reinforce those patterns.

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

In the cell below, train your machine learning model  
  
You can do this by calling `model.fit(a, b)`. `a` should be your training examples (e.g. `X_train` from above) and `b` should be your outcomes for each of the training examples (e.g. `y_train` from above)


In [8]:
### Train your machine learning model here
model.fit(X_train, y_train)

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

# Summary and Next Steps
  
In this homework, you have learned how to: 
* import python libraries for data science
* load a dataset and prepare it for model building with train-test splitting 
* build and train a random forest machine learning model 
  
This workflow is the same for most machine learning applications, so these are important stages to understand.  
  
## Next Session 
  
A course instructor will go through this completed workbook with you, offering feedback and checking your work.  
  
We have only built the model, not tested how well it works. This is the process of "model evaluation". We will discuss this in detail in the next session. 



<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## What improvements could be made? 
  
The workflow that we have shown (data load -> train-test split -> build model) will be the same for nearly all machine learning applications, but there are many more advanced things that we might do along the way. Here, we talk about some of those

### 1) Data Cleaning  
  
We have deliberately prepared a dataset for this task that is clean. In reality, this is nearly never the case. Data is messy; with errors, missing values and bias everywhere. A large part of the job of a data scientist is "data cleaning". This is the process of cleaning up the dataset by whatever means you can, so that it can be sensibly used in a machine learning application. This might include things like: 
* removing clear errors in the data 
* replacing missing values with some suitable alternative (maybe an average, or dropping them altogether)
* visualising and exploring the data to check that it looks sensible 
* obtaining as much data as possible, to build better models

### 2) Cross-Validation

We mentioned earlier how important train-test splitting is for producing an unbiased estimate of how a model performs. It is *always* used. 
  
However in many cases, this isn't really enough! We want to get confidence intervals on model performance, and understand how the model might perform if we used a different train-test split.  
We do this with cross-validation. This is a way of training the model multiple times, using different splits of the data. Rather than getting one number for how a model performs, we get a distribution of values. We can use these to estimate things like the mean model performance, or how its performance might vary in different datasets. 
  
In practice, we almost always use cross-validation when training models. It produces better estimates of model performance, especially when we only have a limited amount of data. 
  

### 3) Hyperparameter Tuning
  
Earlier, we mentioned that machine learning algorithms have hyperparameters. These are parameters that the user sets that are not learned from the data (e.g. the number of trees in the random forest). Choosing sensible values of these is a critical part of building a machine learning model, and there are many different ways of doing it. We will not discuss these much here, but this article summarises some of the more important ones: https://www.geeksforgeeks.org/hyperparameter-tuning/

### 4) Trying other models 
  
We've talked about a random forest model; and for good reason, it is often considered "good-enough" for most use cases in machine learning on tabular data. 
  
  
*However, it is not the only model!*
  
  
There are many, many different machine learning models that are all built on slightly different algorithms. A good practice for data science is to build different machine learning models to test the same problem. Other models may perform better.  

### Closing Thoughts
  
The most important thing to remember is that data science is an *iterative* process. It involves progressively improving models, trying small changes and different techniques to try to produce the best model at the task you are dealing with. We hope in this course to give you an introduction to these ideas, but a lot of work is needed to use them in clinical practice.

## References 

[1]: Gillison, M. L., et al. (2000). Evidence for a causal association between human papillomavirus and a subset of head and neck cancers. *Journal of the National Cancer Institute*, 92(9), 709–720.  
[2]: Ang, K. K., et al. (2010). Human papillomavirus and survival of patients with oropharyngeal cancer. *New England Journal of Medicine*, 363(1), 24–35.  
[3]: Fakhry, C., et al. (2008). Improved survival of patients with human papillomavirus–positive head and neck squamous cell carcinoma in a prospective clinical trial. *Journal of the National Cancer Institute*, 100(4), 261–269.  
[4]: Marur, S., et al. (2016). De-intensification of therapy in HPV-positive oropharyngeal cancer: ongoing clinical trials and future directions. *Oral Oncology*, 62, 50–56.  
[5]: Lewis Jr, J. S., et al. (2012). Human papillomavirus testing in head and neck carcinomas: guideline from the College of American Pathologists. *Archives of Pathology & Laboratory Medicine*, 136(11), 1267–1277.  
[6]: Tang, J., et al. (2013). A novel approach for classification of HPV-positive and HPV-negative head and neck squamous cell carcinomas based on RNA-seq data. *Bioinformatics*, 29(3), 275–281.  
[7]: Seiwert, T. Y., et al. (2015). Integrative and comparative genomic analysis of HPV-positive and HPV-negative head and neck squamous cell carcinomas. *Clinical Cancer Research*, 21(3), 632–641.  
[8]: Zhang, Y., et al. (2020). Machine learning algorithms for predicting HPV status from gene expression data. *BMC Bioinformatics*, 21(1), 1–13.  
[9]: The Cancer Genome Atlas Network. (2015). Comprehensive genomic characterization of head and neck squamous cell carcinomas. *Nature*, 517(7536), 576–582.  
[10]: Hoadley, K. A., et al. (2018). Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. *Cell*, 173(2), 291–304.  