# Capstone Project: Neuroblastoma gene expression data analysis
## Advanced Python for Life Sciences @ Physalia courses (Summer 2025)
### Marco Chierici, Fondazione Bruno Kessler - Data Science For Health

## Objective

In this project, you will go through the analytical steps of a typical data science workflow in Python on a real-world dataset of biological relevance. The full dataset consists of gene expression values of a cohort of 498 neuroblastoma patients with associated clinical data: see the [publication](https://www.ncbi.nlm.nih.gov/pubmed/26109056) for more detailed information.

The dataset is publicly available through NCBI's Gene Expression Omnibus (GEO) with accession number [GSE49711](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49711).

You main goal is to investigate the relationship between gene expression and the survival status of neuroblastoma patients.

<img src="zhang_cover.png" width="65%" />

## Overview

The clinical information includes a number of variables that can be used as targets ("endpoints") in order to study the relationship between gene expression and these outcomes. Such endpoints include the "overall survival" (whether the patient was alive at the end of the follow-up), an extreme disease outcome (favorable vs. unfavorable), and the neuroblastoma risk stratification by the Children’s Oncology Group (high vs. low).

Additional clinical information include the status of the MYCN proto-oncogene genomic amplification and the tumor stage.

The staging is defined by the International Neuroblastoma Staging System and ranges from spontaneous regression (stage 4S) to gradual maturation (stages 1 − 2) to aggressive and often fatal ganglioneuroma (stages 3 − 4).

The following tables summarize the clinical information of the cohort.

<img src="zhang_table1.png" width="34%" />

<img src="zhang_table2.png" width="95%" />

## Data exploration

The log-normalized gene expression data is in the text file named `GSE49711_SEQC_NB_MAV_G_log2.20121127.txt.gz`, available on the study's GEO page. 

The clinical information for all patients ("phenotype data") is stored in the SOFT-formatted file available on GEO.

Download data and metadata from GEO using Python and import them into a Pandas dataframe.

Inspect the expression dataframe. In particular, look at the gene symbols and IDs: for the downstream analysis, keep only the genes that have a NCBI Gene ID and a RefSeq transcript ID.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you are curious about the genes with strange names without NCBI/RefSeq IDs, those are from AceView, a curated and comprehensive annotation database: more information on its <a href="https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/">website</a>.</p>
</div>

Check for duplicated gene names (column `#Gene`) and remove them (hint: the method `.duplicated()` returns True/False on duplicated elements of a column).

Check for missing *numerical* values in any column of the expression dataframe (`isnan()`) and, if any, handle them accordingly (hint: Pandas `dropna()`).

Inspect the phenotype dataframe. Retain only the columns `title` (with patient IDs) and those matching `characteristics_ch1*`. Optionally, rename them to shorter but informative names of your choice.

Focus only on a subset of the 498 patients with the label "1" in the `dataset` column of the phenotype dataframe (they should be 249). Check for missing *numerical* values in this new subset, and, if any, handle them accordingly. 

Finally, join the phenotype and expression dataframes. You can use a Pandas `merge` for this operation.

Get some basic statistics out of the data. For example, the number or proportion of patients belonging to each level of the categorical clinical variables, such as "class label", "death from disease", and "mycn status" (hint: `.value_counts()`).

Create barplots of the patient counts for each level of the clinical variables to visually check the patient distribution.

Plot the distribution (histogram, density plot) of a few genes, such as MYCN, CD3E, CD274 - globally and stratified by stage, sex, class label, death from disease, low/high risk.

Create a scatterplot matrix (e.g., `sns.pairplot()`) of genes such as GZMH, GZMK, CTLA4, TIGIT, LAMP3, BTLA, KLRK1, KLRC2, CD274. These genes are related to T-cell activity (GZMK, CTLA4, and TIGIT), control the function of intratumoral DCs (LAMP3, BTLA) and NK cells (KLRK1 and KLRC2), while CD274 is a gene encoding immune checkpoints.

Create two versions of the scatterplot matrix where you highlight the status of MYCN amplification and the favorable status.

You are also free to explore other genes or gene combinations.

## (optional) Gene pre-filtering

Filter the genes so to retain only the most variable ones, i.e. those whose standard deviation is above a predetermined threshold. Remember, choosing the right threshold is critical: you can pick a conservative threshold (keeping more genes) or a more stringent one (keeping less genes). You can check a histogram of the gene standard deviations to decide on a reasonable value. Compute the number of "surviving" genes for a few values of the threshold, then pick one.

Save the filtered dataframe to a new variable.

## Statistical analysis

1. Conduct statistical tests to determine if the differences in gene expression are significant between different patient groups (`death_from_disease` condition).
3. Adjust the p-values for multiple testing
4. (optional) compute the (log) fold changes for each gene as `mean(gene expr on condition2) - mean(gene expr on condition1)` (gene expressions are already on the log scale)
5. Pick the top 10-ish genes with the lower p-values
6. Make boxplots for them, breaking down by "death_from_disease" values

## Machine learning analysis

1. Pick `death_from_disease` as the target variable
2. Split the data into training/test partitions, using for example 70/30% proportions
3. Perform a PCA on the training test with 10 components, print the explained variance ratioes, and create a scatterplot of the first two principal components, colouring the points according to the target variable
4. Build one or more classification models to predict the target variable based on gene expressions
5. Evaluate the model(s) using appropriate metrics (e.g. accuracy, precision, recall, MCC)

## Interpretation

1. If the classification model natively ranks the genes, get the top 10-ish ranked genes, otherwise get them from the results of your statistical analysis.
2. Make boxplots for them, breaking down by "death_from_disease" values
3. Use the top 100 ranked genes to conduct a pathway enrichment analysis.

Hints: trained `RandomForestClassifier` objects have a `.feature_importances_` attribute that you can use to rank the features; trained `SVC` objects (with `kernel="linear"` only) have a `.coef_` attribute with similar meaning.

## Wrap up

Finally, try to touch up your notebook as if it were a report. Jupyter notebook are actually a useful tool to create dynamic reports including text, code, and figures.

---