# Machine Learning for Biomarker Discovery: Practical Exercises

## Objectives 

These exercises are to help you work through some of the concepts from the lecture, to get a practical idea of how to use machine learning to help in biomarker discovery.  
  
This is by no means a comprehensive notebook! It's intended to give you a flavour of some of the tools that are out there for this problem. Ultimately, biomarker discovery is a research area in its own right, and machine learning takes much longer than a couple of hours to learn!  

The main goals for this practical exercise are:
* to understand how Jupyter notebooks can be used as a Python development environment
* to appreciate how to use documentation and LLMs to aid you in writing code
* to learn how to use Pandas to read in and manipulate a dataset
* to perform exploratory data analysis using Pandas on a dataset
* to use UMAP for dimensionsality reduction and visualisation
* to understand how to build a ML-based classifier for biomarkers
* to use clustering on a dataset

## Tips
  
* The only way to learn how to code is to try, and fail: it's okay to make mistakes 
* More of your time coding is spent fixing problems ("debugging") than it is writing new things
* Ask for help from your peers and your instructors
* You can use ChatGPT/other AI for help, and it will be very good at it, but **try to write the code yourself first**. This is a learning exercise, it is not assessed, and you will learn better if you try to solve the problem yourself first


### 1) Writing Python in a Jupyter Notebook

Python files come in two varieties.  

Files that end `.py` are *scripts*. This contains code that is executed once when you run the file. You can run these from your command line, using something like ``python file.py``. You can also use an integrated development environment (IDE), like `Spyder`, `Pycharm`, `IDL`.  

When developing a complicated program, which is sometimes called a library, you use these scripts to break up your workflow into several little pieces. This separates concerns: it means you can test each file separately and ensure that errors are easy to track down. This is the workflow that is absolutely used in practice.  
  
For learning, experimentation, or data exploration, people like to use *Jupyter Notebooks*. This file is one of those. They have the file ending `.ipynb`. These show the intermediate results of the code as you run it, in plain text. This is great for communicating or summarising results, and quick experimentation. We're going to use them exclusively.  

It's worth saying: if you were trying to do machine learning as part of a project, then you should eventually transition to using Python scripts. But it is common to start off with `.ipynb` notebooks until you work out what you are trying to do.

Jupyter notebooks divide the project into cells, which can be run separately. There are two types of cells, `Code` and `Markdown`.  
`Code` cells contain Python code that is to be run, `Markdown` cells are plain text, and they are useful for writing some commentary on what you are doing so that you can go back later. At the top of 
  
It is important to remember that Jupyter notebooks contain persistent memory. .