# Ingham Medical Physics Coding Challenge - September 2020

This Jupyter notebook describes the coding challenge for the Radiotherapy Computer Scientist position within the Ingham Institute Medical Physics Group hiring in September 2020. The goal of this challenge is to train a model to predict outcomes for cancer patients and present the results.

## Data
This task makes use of data obtained from The Cancer Imaging Archive: Head-Neck-Radiomics-HN1 (https://wiki.cancerimagingarchive.net/display/Public/Head-Neck-Radiomics-HN1) which is available under the Attribution-NonCommercial 3.0 Unported licence. This dataset includes clinical data and computed tomography (CT) from 137 head and neck squamous cell carcinoma (HNSCC) patients treated with radiotherapy. Structures within the CT images have also been manually delineated by an experienced radiation oncologist.

Two CSV files provided alongside this notebook in the **data** directory:

#### HN_ClinicalData.csv
This sheet contains the clinical data of the patients included within the Head-Neck-Radiomics-HN1 dataset. It provides information such as the patient's age, stage of disease and various outcomes. Additionally, these patients have also been randomly split into a **train** and **test** set (see the dataset column).

#### HN_Radiomics.csv
Radiomic features have been generated using the patient's image data available in the Head-Neck-Radiomics-HN1 dataset. The **pyradiomics** library was used to extract first-order and shape features from the patients CT scan. Features are computed per structure (region of interest).

A structure of particular significance for radiotherapy is the Gross Tumour Volume (GTV). This describes the position and extent of the tissue identified as tumour (See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1434601/ for more information). Note that patients may have more than one GTV, therefore these are named using GTV-1, GTV-2,... GTV-*n* where *n* is the number of tumour volumes for that patient.

## Task

Using the data found in the two CSV files, train a model which can predict an outcome for a patient. A common outcome to predict would be the overall survival for the patient (can be found in the column *overall_survival_in_days* within the clinical data). Some different outcomes are also available within this clinical data such as the *recurrence_metastatic_free_survival_in_days*, *local_recurrence_in_days* and *distant_metastases_in_days*.

Make use of the clinical data and radiomic features to attempt to predict these outcomes. Hint: The GTV will probably be the most useful structure to help you predict this since this describes the cancerous tissue. Since multiple GTV's are available for many patients, you will need think about a good way to combine these rows for those patients. There are also many radiomic features available, think about selecting a subset of these to train your model which you think might be useful to predict a particular outcome for a patient.

Train the model using the patients in the **train** dataset (dataset column in the clinical data). Then test your model using the patients in the **test** dataset.

Finally, generate one or more plots which show how well your model is performing to predict a certain outcome.

## Note

The aim of this challenge is not to build a model with excellent results, so don't worry if your model isn't performing all that well. This is a cutting-edge topic of active research and is not easy to solve. What we want to see is how you approach a problem like this, how you present your results and your overall coding style.

## Submission

In this Jupyter notebook some Python code is provided to get you started with the challenge. The libraries you'll need are defined in the accompanying *requirements.txt* file. To complete the challenge, you can extend this notebook with your code. If you prefer, you can provide your solution in a separate file (or files) as well.

If you would prefer to complete this task in a different programming language, no problem! Feel free to use R, MATLAB or anything else you feel is appropriate.

The suggested way to submit your result to this challenge is to fork this GitHub repository and commit your results to your fork. Once complete just send us a link (phillip.chlap@unsw.edu.au) to your forked repository. This will mean your submission is publicly visible. If you would prefer to keep your submission private, this is also no problem. You will just need to duplicate this repository 
(https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/duplicating-a-repository), then add **@pchlap** as a user to your private repository so that we can see your results.

**Due Date:** September 30th @ 11.59pm AEST.

If you have any questions, post them as an issue on this GitHub repository or directly email phillip.chlap@unsw.edu.au.

## Resources

 - **pyradiomics** features: https://pyradiomics.readthedocs.io/en/latest/features.html
 - **pandas**: https://pandas.pydata.org/docs/
 - **scikit-learn**: https://scikit-learn.org/stable/user_guide.html
 - **seaborn**: https://seaborn.pydata.org/index.html
 
### Good luck!

In [None]:
from pathlib import Path

# Define paths to our data
data_path = Path("data")
radiomics_path = data_path.joinpath("HN_Radiomics.csv")
clinical_data_path = data_path.joinpath("HN_ClinicalData.csv")

In [None]:
import pandas as pd

# Load the data
df_clinical_data = pd.read_csv(clinical_data_path)
df_radiomics = pd.read_csv(radiomics_path)

### Extract and combine specific features

This cell demonstrates how you might extract radiomic features (VoxelVolume and SurfaceArea) for all GTVs. Since there can be multiple GTVs per patient, these are combined by summing the values for each patient here.

You'll probably want to extend this to extract more features. Think about how you would combine other features, in other cases computing the mean value might be more appropriate or perhaps you don't want to combine them at all?

Also, take a look at what else is available in the clinical data, perhaps you'd like to use some of these features as well (patient age or cancer stage).

In [None]:
df_gtv_radiomics = df_radiomics[df_radiomics["Structure"].str.startswith("GTV")]
df_gtv_radiomics = df_gtv_radiomics.groupby("id")[["VoxelVolume", "SurfaceArea"]].sum()


# TODO: Extract more/different features







### Merge feature(s) with clinical data

This cell combines the feature with the clinical data in a DataFrame.

In [None]:
df = df_clinical_data.merge(df_gtv_radiomics, on="id")

### Plot our data

Here we plot the features we just extracted against the patient outcome (overall survival in days).

In [None]:
import seaborn as sns

pair_grid = sns.PairGrid(df, y_vars=["overall_survival_in_days"], x_vars=["VoxelVolume", "SurfaceArea"], height=6, hue="dataset")
ax = pair_grid.map(sns.scatterplot)
ax = pair_grid.add_legend()

### Fit your model

Using the data you have prepared above, fit a model to see if you can predict the outcome of the patients. If you're not sure where to start, try using a linear regression...

Regression not working well? Try turning this into a classification problem and see if you can instead predict a "good" or a "bad" outcome.

In [None]:
from sklearn.linear_model import LinearRegression

X_train = df[df["dataset"]=="train"][["VoxelVolume", "SurfaceArea"]]
X_test = df[df["dataset"]=="test"][["VoxelVolume", "SurfaceArea"]]

y_train = df[df["dataset"]=="train"]["overall_survival_in_days"]
y_test = df[df["dataset"]=="test"]["overall_survival_in_days"]


# TODO: Fit model...









### Plot Results

Visualize the performance of your model with some plots. Try to be creative and think about some unique ways to allow others to explore your results.

In [None]:
# TODO: Plot results...




