# Introduction to Machine Learning

**In this practical session you will:**

   - Learn the essential idea behind Machine Learning including several statistical concepts and the implementation steps under the point of view of the Data Science cycle.
   - Download, explore and implement the preliminary processing of a multi-omics cancer dataset that will be used throughout the course.


## Definition:

Machine learning is a subfield of artificial intelligence (AI) that focuses on developing algorithms and models that allow computers to learn patterns and make predictions or decisions without being explicitly programmed. In the context of data science and mathematical modeling, machine learning plays a crucial role in building models that represent real-world systems using mathematical concepts and language. A subfield of Machine learning is Deep learning, which uses a type of models called neural networks that are inspired in the architechture of human brains.

[![AI diagram](http://danieljhand.com/images/AI_ML_DL_circles.jpeg)](http://danieljhand.com/the-relationship-between-artificial-intelligence-ai-machine-learning-ml-and-deep-learning-dl.html)



## Characteristics of Machine Learning:

1. **Learning from Data:**
   - Machine learning systems learn from data rather than relying on explicit programming by using some statistical techniques.
   - Algorithms use available data to identify patterns, relationships, and trends.

<!-- Add an empty line here -->

2. **Model Development:**
   - Machine learning involves creating models that can generalize patterns from the training data to make predictions or decisions on new, unseen data.

<!-- Add an empty line here -->

3. **Adaptability:**
   - Machine learning models can adapt and evolve as new data becomes available, making them suitable for dynamic and changing environments.

<!-- Add an empty line here -->

## Data Science Lifecycle:

The data science lifecycle involves several key steps that machine learning implementations follows:

<!-- Add an empty line here -->

1. **Identification of the problem:**
   - Allows the decision on the suitable model and algorithm and definition of training and test datasets.

<!-- Add an empty line here -->

2. **Data Collection:**
   - Gather relevant data from various sources, ensuring it is representative and suitable for the problem at hand.
   - It is highly important to perform exploratory analysis to evaluate the quality of the data and, if suitable, define the subsequent necessary processing steps.

<!-- Add an empty line here -->

3. **Data Processing:** This is probably the most relevant step: independently of a succesful implementation of the previous steps, if the data does not contain the information relevant to solve the problem and or present in an inadequate state for the algorithm to learn from, the resulting model will be useless (garbage-in -> garbage-out). It mainly consists of two steps.

    3.1. **Data Pre-processing:**
    - Clean and preprocess the data to handle missing values, outliers, and format issues.

    <!-- Add an empty line here -->

    3.2. **Feature Engineering:**
    - Necessary in some cases but optional in others.
    - Select or create features that are relevant and informative for the machine learning model.
    - Common approaches are grouped into *Filter-based*, *Wrapper-based* and *Embedded-based* categories.
   
<!-- Add an empty line here -->

4. **Data modelling:** During this iterative process, each model's performance is assessed using different metrics depending if the algorithm works with categorical or continous variables.

    <!-- Add an empty line here -->

    4.1. **Model Training:**
    - Use a learning algorithm to train the model on a labeled dataset, allowing it to learn patterns and relationships.

    <!-- Add an empty line here -->

    4.2. **Model Optimization:**
    - Adjust model parameters and features to improve performance, often involving techniques like hyperparameter tuning. Within this step it is important to avoid overfitting (the model could be generalized to datasets beyond the training ones).

    <!-- Add an empty line here -->

    4.3. **Model Testing:**
    - Validate the model on new, unseen data to ensure it generalizes well (without overfitting) and provides accurate predictions (without underfitting).

<!-- Add an empty line here -->

5. **Deployment:**
   - Deploy the model into a real-world environment, integrating it into decision-making processes.

<!-- Add an empty line here -->

[![Data Science LyfeCycle](https://www.onlinemanipal.com/wp-content/uploads/2022/09/Data-Science-Life-cycle-768x767.png.webp)](https://www.onlinemanipal.com/blogs/data-science-lifecycle-explained)

## Types of Machine Learning

Machine learning is broadly categorized into several types, each serving different purposes and solving distinct problems. Here are the main types:

<!-- Add an empty line here -->

[![AI diagram](https://www.freecodecamp.org/news/content/images/2020/08/ml-1.png)](https://www.freecodecamp.org/news/machine-learning-for-managers-what-you-need-to-know/)

<!-- Add an empty line here -->

### Supervised Learning

In supervised learning, the algorithm is trained on a **labeled** dataset, where each input is paired with the corresponding output. The goal is to learn a mapping from inputs to outputs, and hence, **predict an output based on input**.

The usefulness of these models is evaluated immediately since both the input and corresponding correct outputs are provided in the testing dataset.


**a. Regression:**
   - **Objective:** Predict a continuous target variable.
   - **Examples:** Linear Regression, Polynomial Regression.

**b. Classification:**
   - **Objective:** Predict a discrete target variable (class labels).
   - **Examples:** Logistic Regression, Decision Trees or Random Forest and Support Vector Machines.

<!-- Add an empty line here -->

### Unsupervised Learning

Unsupervised learning involves training on **unlabeled** data, and the algorithm tries to **discover patterns or relationships in the data** without explicit guidance on the output.

Since the output is unknown in the training data, the usefulness is implicitly derived from the structure and relationships discovered in the data.

**a. Clustering:**
   - **Objective:** Group similar data points together.
   - **Examples:** K-Means Clustering, Hierarchical Clustering.

**b. Dimensionality Reduction:**
   - **Objective:** Reduce the number of input features while preserving important information. It is also commonly used as a pre-processing step for feature extraction.
   - **Examples:** Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE).

**c. Association Rule Learning:** It will not be covered on this course.
   - **Objective:** Discover interesting relationships between variables in large datasets.
   - **Examples:** Apriori Algorithm, Eclat Algorithm.


### Reinforcement Learning

Reinforcement learning involves an **agent interacting with an environment**, learning to make decisions by receiving feedback in the form of rewards or penalties (mimics the human trial and error behaviour). Hence, the objective is to **learn a policy to make decisions achieving the most optimal result**.

On this type of algorithms, the performance depends on the environment provided by the agent (reward or penalty) after each action, guiding it towards learning a successful policy for optimization (https://www.youtube.com/@aiwarehouse). It is usually employed for training AI for videogames rather than on -omics data analysis, so it won't be covered on this course.

**a. Model-Based Reinforcement Learning:**

   - **Objective:** Build an explicit model of the environment to make decisions.
   - **Examples:** Monte Carlo Tree Search.

**b. Model-Free Reinforcement Learning:**

   - **Objective:** Learn to make decisions without an explicit model of the environment.
   - **Examples:** Q-Learning, Deep Q Network (DQN).

## Relationship with Statistical Concepts

1. **Pattern Recognition:**
   - Machine learning involves finding patterns in data, a concept deeply rooted in statistics.
   - Depeding on the types of problem, and hence, the employed algorithm, different kinds of patterns can be extracted from data.

<!-- Add an empty line here -->

[![Pattern types](https://www.researchgate.net/profile/Gordon-Elger/publication/352727978/figure/fig2/AS:1153327744192512@1651986170131/Machine-learning-tasks-most-relevant-for-PdM.png)](https://www.researchgate.net/figure/Machine-learning-tasks-most-relevant-for-PdM_fig2_352727978)

<!-- Add an empty line here -->

2. **Cross-Validation:** A key concept for supervised models when the available dataset is smaller than the optimal for the validation purposes.
   - To assess a supervised model's generalization ability, cross-validation techniques are used to evaluate performance on multiple subsets of the data.
   - There are multiple methodologies (https://www.turing.com/kb/different-types-of-cross-validations-in-machine-learning-and-their-explanations) although the most common is the **K-fold cross-validation** which involves partitioning the entire dataset into k number of random subsets, where k-1 are used for training and 1 for testing purposes. This is repeated for a number of iterations and the model is evaluated through the metrics obtained across interations.

<!-- Add an empty line here -->

[![Bias and Variance](https://d2mk45aasx86xg.cloudfront.net/image5_11zon_af97fe4b03.webp)](https://www.turing.com/kb/different-types-of-cross-validations-in-machine-learning-and-their-explanations) 

<!-- Add an empty line here -->

3. **Bias-Variance Tradeoff:** Also a key concept when dealing with supervised models.
   - In statistics, the bias of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. On the other hand, the variance of an estimator measures how much the estimates from the estimator are likely to vary or spread out around the true, unknown parameter, through repeated sampling.

<!-- Add an empty line here -->

   [![Bias and Variance](https://nvsyashwanth.github.io/machinelearningmaster/assets/images/bias_variance.jpg)](https://nvsyashwanth.github.io/machinelearningmaster/bias-variance/)

<!-- Add an empty line here -->

   - If we consider Machine learning predictions as estimations these two concepts acquire the following meaning in this context. 
       
       - **Variance** is the consistency of the model predictions for a particular sample instance (for instance applying the model multiple times on subsets of the training dataset). In other words, is the sensitivity of the model to the randomness of the training dataset.
       
       - In contrast, **Bias** could be seen as the measure of the distance between predictions and the correct values (the labels) if we rebuild the model multiple times with different training datasets. Therefore, is the measure of the systematic error not due to randomness in the training data.
             
   - These two concepts are intrinsically related, and therefore, the bias-variance tradeoff is a fundamental concept in machine learning: there is an optimal model complexity that allows for good performance on the training data but still keeping the ability to generalize to new data. In complex omics datasets, we often face the **Curse of Dimensionality**.

      * **High Bias (Underfitting):** The model is too simple (e.g., linear regression on non-linear biological data).
      * **High Variance (Overfitting):** The model captures noise (e.g., a decision tree with infinite depth memorizing patient IDs).
   
<!-- Add an empty line here -->

   [![Underfitting and overfitting](https://www.endtoend.ai/assets/blog/misc/bias-variance-tradeoff-in-reinforcement-learning/underfit_right_overfit.png)](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)

<!-- Add an empty line here -->

4. **Statistical Metrics:**
   - Various statistical metrics are used to quantify the performance of both unsupervised, and mostly, supervised machine learning models.
   - The type of metric used is related with the type of problem/algorithm used.
   
   <!-- Add an empty line here -->
   
   [![Supervised metrics](https://www.kdnuggets.com/wp-content/uploads/anello_machine_learning_evaluation_metrics_theory_overview_11.png)](https://www.kdnuggets.com/machine-learning-evaluation-metrics-theory-and-overview)

## Case of use: Cancer genomics

Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. At the very core of the etiology of cancer is somatic mutations: permanent alterations in the genetic material (either resulting from spontaneous errors during the DNA replication or as a result of DNA damage) originated throughout the somatic development (from the very first mitotic divisions of the Zygot to the human adult tissues).

As sequencing technologies advanced in the past decade, the number of available tumoral whole genomes have increased exponentially, revealing that different tumors accumulate mutations with a variability of up to three orders of magnitude.

<!-- Add an empty line here -->

![ICGC TMB](images/ICGC_muts.png)

<!-- Add an empty line here -->

Not only the total number of mutation varies, but also the composition. The endogenous mutational processes active in a tissue as well as the mutagens a person has been exposed during their lifetime, e.g ultraviolet (UV)-light or tobacco smoking, define a set of probabilities for each nucleotide to mutate provided of its neighboring
sequence. These probabilities can be inferred can be decomposed from the observed data into several
components that roughly reflect the individual mutational processes affecting the cell, the so-called ‘mutation signatures’, some linked to specific mechanisms.

<!-- Add an empty line here -->

<!-- Add an empty line here -->

**Tobacco-related signature of single base substitutions (SBS) 4**

<!-- Add an empty line here -->

[![Tobacco Signature](https://cog.sanger.ac.uk/cosmic-signatures-production/images/v2_signature_profile_4.original.png)](https://cancer.sanger.ac.uk/signatures/signatures_v2/)

<!-- Add an empty line here -->

<!-- Add an empty line here -->

**Ultraviolet light-related signature of single base substitutions (SBS) 7**

<!-- Add an empty line here -->

[![Tobacco Signature](https://cog.sanger.ac.uk/cosmic-signatures-production/images/v2_signature_profile_7.original.png)](https://cancer.sanger.ac.uk/signatures/signatures_v2/)

<!-- Add an empty line here -->

<!-- Add an empty line here -->

Hence, the study of mutations within the Cancer Genomics field, integrated with other -omic data such as transcriptomics or epigenomics as well as clinical data has paved the latest advances in Cancer Research.

---

Several international consortium have generated multi-omic cancer datasets. One of them, enmarked within The Cancer Genome Atlas (TCGA) is the Pan Cancer Analysis of Whole Genomes (PCAWG) initiative. Public available data is stored at the International Cancer Genome Consortium (ICGC) database: https://dcc.icgc.org/releases/PCAWG

UPDATE: The data from the PCAWG, which was available at the ICGC portal, is no longer of public access due to the closure in June 24 of 2024. The large files that cannot be stored on GitHub directly are on the Google Drive link of the documentation.

Some files are particularly interesting for analysis with Machine Learning techniques:

* **Clinical:** `pcawg_donor_clinical_August2016_v9.xlsx`
* **Sample Sheet:** `pcawg_sample_sheet.tsv`
* **RNA-Seq:** `pcawg.rnaseq.transcript.expr.tpm.tsv.gz`
* **Mutational Signatures:** `SignatureAnalyzer_COMPOSITE.SBS.txt`

All files are available in the data folder already, with the exception of **pcawg.rnaseq.transcript.expr.tpm.tsv.gz** which needs to be downloaded from https://drive.google.com/file/d/1V2_deNxYowAG2mvb9OqMCg0WC2uPX3LM

### Exploratory analysis: Clinical data

In [1]:
import pandas as pd
import plotly.express as px

clinical_df = pd.read_excel('data/pcawg_donor_clinical_August2016_v9.xlsx')
clinical_df

Unnamed: 0,# donor_unique_id,project_code,icgc_donor_id,submitted_donor_id,tcga_donor_uuid,donor_sex,donor_vital_status,donor_diagnosis_icd10,first_therapy_type,first_therapy_response,donor_age_at_diagnosis,donor_survival_time,donor_interval_of_last_followup,tobacco_smoking_history_indicator,tobacco_smoking_intensity,alcohol_history,alcohol_history_intensity,donor_wgs_included_excluded
0,BRCA-UK::CGP_donor_1114930,BRCA-UK,DO1000,CGP_donor_1114930,,female,alive,,other therapy,,61.0,,,Smoking history not documented,,Don't know/Not sure,Not Documented,Included
1,BRCA-UK::CGP_donor_1069291,BRCA-UK,DO1001,CGP_donor_1069291,,female,,,other therapy,,41.0,,,Smoking history not documented,,Don't know/Not sure,Not Documented,Included
2,BRCA-UK::CGP_donor_1114881,BRCA-UK,DO1002,CGP_donor_1114881,,female,alive,,other therapy,unknown,39.0,,,Smoking history not documented,,Don't know/Not sure,Not Documented,Included
3,BRCA-UK::CGP_donor_1114929,BRCA-UK,DO1003,CGP_donor_1114929,,female,alive,C50.4,chemotherapy,unknown,34.0,,,Smoking history not documented,,Don't know/Not sure,Not Documented,Included
4,BRCA-UK::CGP_donor_1167078,BRCA-UK,DO1004,CGP_donor_1167078,,female,deceased,,other therapy,,59.0,,0.0,Smoking history not documented,,Don't know/Not sure,Not Documented,Included
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2829,COAD-US::b08b5f49-9434-4653-9772-097ec29b2ca3,COAD-US,DO9708,TCGA-D5-6540,b08b5f49-9434-4653-9772-097ec29b2ca3,male,alive,C18.0,,,66.0,,186.0,,,,,GrayList
2830,COAD-US::bfb07784-693b-4c25-874e-4ad6e04a5d46,COAD-US,DO9732,TCGA-AA-3529,bfb07784-693b-4c25-874e-4ad6e04a5d46,female,deceased,C18.7,,,78.0,0.0,,,,,,Included
2831,COAD-US::7d8eab0a-e6c8-4449-9ebf-50c41db94a06,COAD-US,DO9788,TCGA-A6-2681,7d8eab0a-e6c8-4449-9ebf-50c41db94a06,female,alive,C18.9,,,73.0,,552.0,,,,,Included
2832,COAD-US::e457344d-76fb-46bf-b362-61a6e811d131,COAD-US,DO9876,TCGA-AA-A00N,e457344d-76fb-46bf-b362-61a6e811d131,male,deceased,C18.0,,,75.0,122.0,,,,,,Included


Notice that now except for the sex and vital status features, the NA category nan from numpy is the most common category across the information on the patients. This is going to complicate analysis using this clinical phenotypical data.

In [2]:
# Calculate the percentage of missing values for each column
na_counts = clinical_df.isna().sum() / len(clinical_df) * 100
na_df = na_counts.reset_index()
na_df.columns = ['Column', 'Missing_Percentage']
na_df = na_df.sort_values(by='Missing_Percentage', ascending=True)

# Create a horizontal bar chart
fig = px.bar(
    na_df,
    x='Missing_Percentage',
    y='Column',
    orientation='h',
    title="<b>Missing Data (NA) Prevalence</b>",
    labels={'Missing_Percentage': 'Percentage of Missing Values (%)', 'Column': 'Feature'},
    text_auto='.1f',  # Show values on bars
    template='plotly_white',
    color='Missing_Percentage',
    color_continuous_scale='Reds' # Red highlights the "danger" of missing data
)

fig.update_layout(height=600)
fig.show()

We will use some clinical data as predictors for regression (predicting mutations from age at diagnosis), hence we need to check if Age is normally distributed.

In [3]:
fig_age = px.histogram(
    clinical_df, 
    x="donor_age_at_diagnosis", 
    color="donor_vital_status",
    nbins=50,
    title="<b>Distribution of Donor Age</b>",
    labels={"donor_age_at_diagnosis": "Age at Diagnosis", "count": "Number of Patients"},
    opacity=0.7,
    template="plotly_white"
)

fig_age.show()

### Exploratory analysis: samples

Now we can check the file with relation of specimens and samples extracted from each donor. It is just a file that helps connect by IDs other files, so let's have a quick look.

In [4]:
## It is a tabular separated file, so we read with the read_csv function specifying the tabulator \t as the separator character
## Moreover, we indicate that the file has a header that should be inferred as the column names.
sample_df = pd.read_csv('data/pcawg_sample_sheet.tsv', sep='\t', header='infer')
sample_df

Unnamed: 0,donor_unique_id,donor_wgs_exclusion_white_gray,submitter_donor_id,icgc_donor_id,dcc_project_code,aliquot_id,submitter_specimen_id,icgc_specimen_id,submitter_sample_id,icgc_sample_id,dcc_specimen_type,library_strategy
0,BLCA-US::096b4f32-10c1-4737-a0dd-cae04c54ee33,Whitelist,096b4f32-10c1-4737-a0dd-cae04c54ee33,DO804,BLCA-US,e0fccaf5-925a-41f9-b87c-cd5ee4aecb59,27461a27-26eb-4c2c-9c54-e16fbd32c615,SP1682,e0fccaf5-925a-41f9-b87c-cd5ee4aecb59,SA5237,Normal - solid tissue,WGS
1,BLCA-US::096b4f32-10c1-4737-a0dd-cae04c54ee33,Whitelist,096b4f32-10c1-4737-a0dd-cae04c54ee33,DO804,BLCA-US,301d6ce3-4099-4c1d-8e50-c04b7ce91450,52f538ef-b05d-4c76-9976-ce6d49158016,SP1677,301d6ce3-4099-4c1d-8e50-c04b7ce91450,SA5195,Primary tumour - solid tissue,WGS
2,BLCA-US::096b4f32-10c1-4737-a0dd-cae04c54ee33,Whitelist,096b4f32-10c1-4737-a0dd-cae04c54ee33,DO804,BLCA-US,22e154de-0e3b-443b-8420-48d68d6c1ce4,52f538ef-b05d-4c76-9976-ce6d49158016,SP1677,22e154de-0e3b-443b-8420-48d68d6c1ce4,SA5213,Primary tumour - solid tissue,RNA-Seq
3,BLCA-US::178b28cd-99c3-48dc-8d09-1ef71b4cee80,Whitelist,178b28cd-99c3-48dc-8d09-1ef71b4cee80,DO555,BLCA-US,c1da8eed-4919-4ba5-a735-3fba476c18a7,cd3cfb26-e66f-408e-81f6-3b61c247c976,SP1135,c1da8eed-4919-4ba5-a735-3fba476c18a7,SA1598,Normal - blood derived,WGS
4,BLCA-US::178b28cd-99c3-48dc-8d09-1ef71b4cee80,Whitelist,178b28cd-99c3-48dc-8d09-1ef71b4cee80,DO555,BLCA-US,4838b5a9-968c-4178-bffb-3fafe1f6dc09,59d6683f-5eb7-493d-8e8e-78b88be2cd70,SP1132,4838b5a9-968c-4178-bffb-3fafe1f6dc09,SA1556,Primary tumour - solid tissue,WGS
...,...,...,...,...,...,...,...,...,...,...,...,...
7250,UCEC-US::fba80122-d8b2-4d8d-a032-9767e8160f9f,Whitelist,fba80122-d8b2-4d8d-a032-9767e8160f9f,DO42544,UCEC-US,e54b7e44-82a3-4016-bc32-129799097b4c,ddd2f9e0-0aa3-425b-9d74-74fcc638cb08,SP92947,e54b7e44-82a3-4016-bc32-129799097b4c,SA462448,Primary tumour - solid tissue,RNA-Seq
7251,UCEC-US::fba80122-d8b2-4d8d-a032-9767e8160f9f,Whitelist,fba80122-d8b2-4d8d-a032-9767e8160f9f,DO42544,UCEC-US,ce5b0ba0-2777-4c92-ac50-483174cc5dca,cad6c89d-f722-470b-93bc-e8b24c033f0f,SP92955,ce5b0ba0-2777-4c92-ac50-483174cc5dca,SA462509,Normal - tissue adjacent to primary,RNA-Seq
7252,UCEC-US::ffaa98a0-2b69-46dc-aee5-c5c3f2abbc38,Whitelist,ffaa98a0-2b69-46dc-aee5-c5c3f2abbc38,DO42432,UCEC-US,47f826a1-96ed-4f4d-94e0-49f4460ef44f,2729ed97-f971-4d98-8baa-f99404dd2b9f,SP92731,47f826a1-96ed-4f4d-94e0-49f4460ef44f,SA461078,Normal - blood derived,WGS
7253,UCEC-US::ffaa98a0-2b69-46dc-aee5-c5c3f2abbc38,Whitelist,ffaa98a0-2b69-46dc-aee5-c5c3f2abbc38,DO42432,UCEC-US,712ba532-fb1a-43fa-a356-b446b509ceb7,8bb3a057-8958-4f62-af81-976da2e92df7,SP92723,712ba532-fb1a-43fa-a356-b446b509ceb7,SA461016,Primary tumour - solid tissue,WGS


Note that for the same donor, several specimens are extracted. Usually, one from a normal tissue and another from a primary tumor (to find mutations through WGS it is necessary to remove germline mutations that are present on both normal and tumoral tissue, that is why the mutations found on normal tissues are usually substracted from the tumor mutation calls).

In [5]:
donor_specimen_counts = sample_df.groupby('donor_unique_id')['icgc_specimen_id'].nunique().reset_index()
donor_specimen_counts.columns = ['Donor_ID', 'Specimen_Count']

fig_count = px.histogram(
    donor_specimen_counts,
    x="Specimen_Count",
    title="<b>Distribution of Specimens per Donor</b>",
    labels={'Specimen_Count': 'Number of Specimens Collected', 'count': 'Number of Donors'},
    text_auto=True,  # Show the exact count on top of the bars
    template="plotly_white",
    color_discrete_sequence=['teal']
)

# Force the X-axis to show integers (1, 2, 3...) rather than a range
fig_count.update_layout(xaxis=dict(dtick=1))
fig_count.show()

In some projects, some donors have more than the Normal and tumoral pair, and more specimens are collected. Moreover, notice that from the same tumoral specimen several samples might be extracted to extract information with different techniques: in this case for WGS or for RNA-seq. This will be relevant to map multiomic information across samples from the same tumoral specimen from a given donor.

In [6]:
# Group by Project -> Specimen Type -> Library Strategy
# This effectively tracks the "Sample" level (e.g., is this sample WGS or RNA-Seq?)
hierarchy_counts_detailed = sample_df.groupby(
    ['dcc_project_code', 'dcc_specimen_type', 'library_strategy']
).size().reset_index(name='count')

# Create the 3-Level Sunburst Plot
fig = px.sunburst(
    hierarchy_counts_detailed,
    path=['dcc_project_code', 'dcc_specimen_type', 'library_strategy'], # Added 3rd level
    values='count',
    title="<b>Multi-Omics Data Hierarchy</b>: Project > Specimen > Strategy",
    color='count', 
    color_continuous_scale='ice', # 'ice' scale is distinct and clean
    height=800
)

fig.update_layout(margin=dict(t=50, l=0, r=0, b=0))
fig.show()

We can finally get the primary location label for each project and a sample dataframe ready for analyses is saved.

In [7]:
from os import path

# Merge it with a file provided in the data folder to extract the primary site
project_info = pd.read_csv(path.join('data', 'projects_PCAWG_info.txt'), sep='\t', header='infer')

primary_location_dict = dict(zip(project_info.project, project_info.primary_location))

sample_df['primary_location'] = sample_df['dcc_project_code'].map(primary_location_dict)

# Save it on the data folder for later uses
sample_df.to_csv(path.join('data', 'sample_df.tsv'), sep='\t', index=False)

### Exploratory analysis: RNA-seq

Next, we can explore the file with the expression data. Remember you should **download** it at https://drive.google.com/file/d/1V2_deNxYowAG2mvb9OqMCg0WC2uPX3LM and **save it into the data folder**.

In [8]:
# Load the expression data
expression_df = pd.read_csv(path.join('data', 'pcawg.rnaseq.transcript.expr.tpm.tsv.gz'), sep='\t', header='infer', compression='gzip')
expression_df.head()

Unnamed: 0,Transcript,b337121c-9821-4644-820e-b8c477f6c38a,612ef912-5a28-4c11-8703-3376f51afef5,56a705f4-fd28-44ff-8a3c-85bc4300c760,28239a0e-9990-49ef-a159-ba63fb078c77,fbf2f1c0-fb91-4548-8653-68021e6541f9,01b7a29e-0d3a-4a18-8cb6-f8a329f9d6e6,4aa70762-1a0e-4b38-be77-a89db0955193,2de8f5dd-decd-47f4-856a-bda678ee6ab8,d8164b02-4b3c-454d-945b-2838edb1b5b1,...,df0df04e-bb65-4dd6-91db-753eaa86a0c3,ccdfd8a3-2bc7-474c-b80d-96e8741fe8bc,06dbb69c-390b-4174-859b-c1e20380b483,6a3488f6-b364-426c-bb80-77acc0da4ff8,52739598-8083-45b3-b89e-4d841cb15d7a,e80aa591-ed43-4486-a7cd-0ea654f10983,27b37399-0a77-400e-9dff-0c9c26455144,6d98d09b-cdec-4f71-afb3-2c53bee990c4,9734c685-9f30-4c0b-b9ba-06bec4b3ca64,88E25F76-1B1F-4AFB-AB08-215D4F7F08B5
0,ENST00000335137.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000423372.3,44.6,69.58,35.31,24.14,12.1,84.5,68.78,19.09,12.46,...,66.76,97.87,76.92,159.5,106.01,69.07,54.83,141.96,86.25,37.78
2,ENST00000426406.1,0.06,0.05,0.02,0.02,0.01,0.03,0.08,0.02,0.04,...,0.03,0.2,0.02,0.02,0.0,0.05,0.01,0.0,0.0,0.01
3,ENST00000332831.2,0.06,0.05,0.02,0.02,0.01,0.03,0.08,0.02,0.04,...,0.03,0.2,0.02,0.02,0.0,0.05,0.01,0.0,0.0,0.01
4,ENST00000599533.1,1.6,1.86,1.48,0.0,1.02,1.84,0.94,1.25,1.05,...,0.31,2.91,3.45,0.63,7.81,4.76,1.96,3.93,0.92,5.05


Here the first column shows a the Ensembl Transcript ID and the rest of the columns, whose name is the aliquot ID (present at the **pcawg_sample_sheet.tsv** file).

If we want to add gene IDs or Symbols instead of transcript IDs, we will need to process the file that the PCAWG consortium uses for annotation. The file is already present on your data folder, but it will need some preprocessing.

We will perform the preprocessing with a bash script. In Jupyter notebooks, you can use other interpreters rather than python. The script below is able to download and process the necessary file (If you don't have a linux-based operative system, skip this step. The final processed file is already provided in the data folder).

In [9]:
%%bash
# The %% above indicates to the jupyter notebook to use bash as interpreter

# Change into the data directory
cd data

# Process the file using a bash code
zcat rnaseq.gc19_extNc.gtf.tar.gz | cut -f9 | cut -d';' -f2 | sed 's/.*gencode::\([^:]*\)::tc_\([^._]*\)[^:]*::\([^._]*\)[^:]*.*/\1\t\2\t\3/' | sort | uniq | tail -n +3 | gzip > gencode_transcript.tsv.gz

Summarizing until here we have:

**pcawg.rnaseq.transcript.expr.tpm.tsv.gz**: A large file with the first column being the ensembl Transcript ID and the rest of the columns with an aliquot ID.

**sample_df.tsv**: A file that contains the relationship between donors, specimens and samples. The aliquotID is also a column of this file.

**gencode_transcript.tsv.gz**: A file that contains the transcript information.

For the analysis at the following sessions we will need to process the expression data, specifically:

- A) Get only the expression for tumoral specimens (and that will not cover all the tumoral specimens of the PCAWG).

- B) Get the information on a gene, instead than on a transcript level.

In [10]:
# Open expression matrix
expression_matrix = pd.read_csv(path.join('data', 'pcawg.rnaseq.transcript.expr.tpm.tsv.gz'),
                                                                    sep="\t", header='infer', compression='gzip')

# Remove version
expression_matrix['Transcript'] = (
    expression_matrix['Transcript']
    .str.extract(r'^(\w+)\.\w+$')
)

expression_matrix = expression_matrix.set_index('Transcript', drop=True)
expression_matrix

Unnamed: 0_level_0,b337121c-9821-4644-820e-b8c477f6c38a,612ef912-5a28-4c11-8703-3376f51afef5,56a705f4-fd28-44ff-8a3c-85bc4300c760,28239a0e-9990-49ef-a159-ba63fb078c77,fbf2f1c0-fb91-4548-8653-68021e6541f9,01b7a29e-0d3a-4a18-8cb6-f8a329f9d6e6,4aa70762-1a0e-4b38-be77-a89db0955193,2de8f5dd-decd-47f4-856a-bda678ee6ab8,d8164b02-4b3c-454d-945b-2838edb1b5b1,d29e5a15-6ad2-41f9-b4e8-5c4b0f3fda9e,...,df0df04e-bb65-4dd6-91db-753eaa86a0c3,ccdfd8a3-2bc7-474c-b80d-96e8741fe8bc,06dbb69c-390b-4174-859b-c1e20380b483,6a3488f6-b364-426c-bb80-77acc0da4ff8,52739598-8083-45b3-b89e-4d841cb15d7a,e80aa591-ed43-4486-a7cd-0ea654f10983,27b37399-0a77-400e-9dff-0c9c26455144,6d98d09b-cdec-4f71-afb3-2c53bee990c4,9734c685-9f30-4c0b-b9ba-06bec4b3ca64,88E25F76-1B1F-4AFB-AB08-215D4F7F08B5
Transcript,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENST00000335137,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
ENST00000423372,44.60,69.58,35.31,24.14,12.10,84.50,68.78,19.09,12.46,125.44,...,66.76,97.87,76.92,159.50,106.01,69.07,54.83,141.96,86.25,37.78
ENST00000426406,0.06,0.05,0.02,0.02,0.01,0.03,0.08,0.02,0.04,0.10,...,0.03,0.20,0.02,0.02,0.00,0.05,0.01,0.00,0.00,0.01
ENST00000332831,0.06,0.05,0.02,0.02,0.01,0.03,0.08,0.02,0.04,0.10,...,0.03,0.20,0.02,0.02,0.00,0.05,0.01,0.00,0.00,0.01
ENST00000599533,1.60,1.86,1.48,0.00,1.02,1.84,0.94,1.25,1.05,3.45,...,0.31,2.91,3.45,0.63,7.81,4.76,1.96,3.93,0.92,5.05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENST00000361335,1059.78,3310.69,3029.49,3101.11,2106.18,6754.96,1718.41,4467.47,1933.69,6316.78,...,9080.20,13978.18,3134.45,3046.82,3017.72,14910.56,7331.67,14228.82,5543.91,5790.72
ENST00000361381,2018.75,7465.49,6086.27,6848.49,3890.93,11094.95,3257.50,9618.19,3734.32,16712.80,...,13746.50,30243.45,5984.89,4816.30,6001.62,17649.84,9925.30,27826.78,10794.55,19981.97
ENST00000361567,825.79,1820.51,1489.06,2321.55,1464.96,5158.48,1502.45,2551.32,1330.51,2830.86,...,2858.90,1511.51,1875.72,467.10,1661.27,2158.65,1120.06,1827.90,2308.25,4270.44
ENST00000361681,1307.77,3075.86,2137.05,4173.20,2494.56,6510.20,2674.45,4080.42,2935.40,3287.61,...,4437.29,1647.12,2205.86,435.41,2002.27,3156.38,1577.09,1775.71,3304.82,10701.42


#### A) Select tumoral specimens

Using **sample_df.tsv** for mapping

In [11]:
# Specimen information PCAWG
sample_df = pd.read_csv(path.join('data', 'sample_df.tsv'), sep="\t", header='infer')

# Get an aliquot to specimen ID dictionary
specimen_dict = dict(zip(sample_df.aliquot_id, sample_df.icgc_specimen_id))

# Let's translate the columns into the specimen ID
translated_columns = []
aliq_ID_not_found_on_files = []
for aliqID in expression_matrix.columns:
    try:
        translated_columns.append(specimen_dict[aliqID])
    except:
        aliq_ID_not_found_on_files.append(aliqID)
        
print('Total number of aliquots with expression data: ' + str(len(expression_matrix.columns)))
print('Aliquot that could be translated into specimenID: ' + str(len(translated_columns)))
print('Dropped samples because of unknown translation of IDs: ' + str(len(aliq_ID_not_found_on_files)))

# Extract the columns
print(expression_matrix.shape[1])
expression_matrix = expression_matrix.drop(aliq_ID_not_found_on_files, axis=1)
print(expression_matrix.shape[1])

expression_matrix.columns = translated_columns

Total number of aliquots with expression data: 1359
Aliquot that could be translated into specimenID: 1359
Dropped samples because of unknown translation of IDs: 0
1359
1359


Double checking if all aliquotIDs can be translated into SpecimenIDs is key. However, how many of these specimens that were RNA-Sequenced are from tumoral samples? We do not want on the next analysis to include non-tumoral tissues.

In [12]:
# rom the Specimen IDs that we could obtain using the RNA-Seq, library strategy, are all specimen types from tumoral samples?
category_series = sample_df[(sample_df['icgc_specimen_id'].isin(translated_columns))&(sample_df['library_strategy']=='RNA-Seq')]['dcc_specimen_type'].value_counts()
category_series

dcc_specimen_type
Primary tumour - solid tissue                          900
Primary tumour - lymph node                             95
Normal - tissue adjacent to primary                     89
Primary tumour                                          75
Primary tumour - blood derived (peripheral blood)       68
Normal - solid tissue                                   61
Metastatic tumour - metastatsis to distant location     34
Recurrent tumour - other                                20
Recurrent tumour - solid tissue                          5
Metastatic tumour - metastasis to distant location       5
Primary tumour - other                                   4
Primary tumour - blood derived (bone marrow)             3
Name: count, dtype: int64

Most of the expression data come from specimens of primary solid tumors. Other specimens are from lymph nodes or blood (not solid primary tumors) or even metastasis. However, a non-negligible proportion comes from either Normal tissue adjacent to the primary tumor or just regular healthy tissues. **Hence we need to remove them from the data.**

In [13]:
# Get specimens that do not come form normal healthy tissues (do not contain the normal word)
clean_sample_df = sample_df[(sample_df['icgc_specimen_id'].isin(translated_columns))&(sample_df['library_strategy']=='RNA-Seq')&(~sample_df['dcc_specimen_type'].str.startswith('Normal'))].copy()

B) Get information at gene, not at transcript level

Finally, we need to process the matrix to get expression information at the gene level.

In [14]:
# The annotation file does not have a header, so the column names are specified
annotation_df = pd.read_csv(path.join('data', 'gencode_transcript.tsv.gz'), 
                                sep="\t", header=None, names=['Symbol', 'Gene', 'Transcript'], compression='gzip')
annotation_df

Unnamed: 0,Symbol,Gene,Transcript
0,5S_rRNA,ENSG00000201285,ENST00000364415
1,5S_rRNA,ENSG00000212595,ENST00000391293
2,5S_rRNA,ENSG00000261122,ENST00000516136
3,5S_rRNA,ENSG00000261122,ENST00000516234
4,5S_rRNA,ENSG00000261122,ENST00000516580
...,...,...,...
302866,ZZZ3,ENSG00000036549,ENST00000469944
302867,ZZZ3,ENSG00000036549,ENST00000474746
302868,ZZZ3,ENSG00000036549,ENST00000476195
302869,ZZZ3,ENSG00000036549,ENST00000476275


In [15]:
# To group the transcripts and sum their expression by gene IDs we have to do the following steps

## Merge the expression_matrix with annotation_df on the 'Transcript' column.
## An Inner join is done to work with the Transcript IDs that are on both dataframes
merged_df = pd.merge(expression_matrix.reset_index(), annotation_df , left_on='Transcript', right_on='Transcript', how='inner')
print("Expression available for " + str(len(merged_df)) + " transcripts.")

## Group by 'Gene' and sum the values for each gene
collapsed_df = merged_df.groupby('Gene').sum()

## Drop unnecessary columns
collapsed_df = collapsed_df.drop(columns=['Transcript', 'Symbol']).reset_index()

print("After merging, expression for " + str(len(collapsed_df)) + " genes.")

Expression available for 95309 transcripts.
After merging, expression for 20738 genes.


In [16]:
# Save it on the data folder for later uses.
collapsed_df.to_csv(path.join('data', 'gene_expression.tsv.gz'), sep='\t', index=False, compression='gzip')

### Exploratory analysis: mutation signatures

Finally, we can explore the signature number of attributed mutations for each specimen.

In [17]:
signatures_df = pd.read_csv('data/SA_COMPOSITE_SNV.activity.FULL_SET.031918.txt', sep='\t', header='infer')
signatures_df

Unnamed: 0.1,Unnamed: 0,Biliary_AdenoCA__SP117655,Biliary_AdenoCA__SP117556,Biliary_AdenoCA__SP117627,Biliary_AdenoCA__SP117775,Biliary_AdenoCA__SP117332,Biliary_AdenoCA__SP117712,Biliary_AdenoCA__SP117017,Biliary_AdenoCA__SP117031,Biliary_AdenoCA__SP117759,...,Skin_Melanoma__SP104056,Skin_Melanoma__SP83083,Skin_Melanoma__SP82433,Skin_Melanoma__SP82780,Skin_Melanoma__SP83019,Skin_Melanoma__SP83099,Skin_Melanoma__SP83146,Skin_Melanoma__SP103866,Skin_Melanoma__SP83844,Skin_Melanoma__SP83027
0,BI_COMPOSITE_SNV_SBS1_P,1589.616207,1147.469423,1657.394,2289.068493,594.241435,411.759445,1410.241,2239.91831,3033.648031,...,550.6757,621.6714,618.0735,235.4213,378.997636,484.0279,428.2321,467.6804,237.912064,908.9658
1,BI_COMPOSITE_SNV_SBS2_P,1158.772685,155.662937,690.1413,1300.529865,486.814576,212.539479,285.9621,692.541633,1283.465322,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,BI_COMPOSITE_SNV_SBS3_P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,BI_COMPOSITE_SNV_SBS4_P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,BI_COMPOSITE_SNV_SBS5_P,2125.787143,607.249619,8.62e-172,1071.107892,357.574985,2906.621828,1532.025,722.196641,987.217327,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,BI_COMPOSITE_SNV_SBS6_S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,23.64613,7.471426,10.29017,7.22474,8.097868,4.408429,11.8513,11.34733,8.22552,5.0787
6,BI_COMPOSITE_SNV_SBS7a_S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,155518.9,42984.87,33209.46,21486.44,66697.18415,10861.49,263146.9,56937.2,37084.37241,753.3443
7,BI_COMPOSITE_SNV_SBS7b_S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,43467.71,14957.72,1160.563,4005.323,12957.94826,3525.343,24352.98,11897.13,7500.89208,73.36892
8,BI_COMPOSITE_SNV_SBS7c_S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,26557.15,2613.461,5e-324,575.2728,2487.385123,174.7664,5080.666,3769.907,1623.242663,106.653
9,BI_COMPOSITE_SNV_SBS8_P,1690.964739,449.212599,823.8964,655.400506,151.918232,897.45274,764.674,808.104071,1401.975456,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# We can get the tumor mutation burden to do some exploratory analysis
TMB_proxy = signatures_df.iloc[:, 1:].sum(axis=0)
TMB_proxy

Biliary_AdenoCA__SP117655     14868.992600
Biliary_AdenoCA__SP117556      5072.446116
Biliary_AdenoCA__SP117627      5837.270352
Biliary_AdenoCA__SP117775     12426.775021
Biliary_AdenoCA__SP117332      3427.951291
                                 ...      
Skin_Melanoma__SP83099        22261.560781
Skin_Melanoma__SP83146       348647.808151
Skin_Melanoma__SP103866       94879.104988
Skin_Melanoma__SP83844        56958.008549
Skin_Melanoma__SP83027         6173.487353
Length: 2780, dtype: float64

In [19]:
# Process the first column to extract SBS code
signatures_df['Unnamed: 0'] = signatures_df['Unnamed: 0'].str.extract(r'_(SBS\w+)_')

# Change column names: the first is signature and the rest are the specimenID
signatures_df.columns = ['signature'] + [col.split('__')[-1] for col in signatures_df.columns[1:]]

# Save the information for later uses
signatures_df.to_csv(path.join('data' , 'signatures.tsv.gz'), sep='\t', index=False, compression='gzip')
signatures_df

Unnamed: 0,signature,SP117655,SP117556,SP117627,SP117775,SP117332,SP117712,SP117017,SP117031,SP117759,...,SP104056,SP83083,SP82433,SP82780,SP83019,SP83099,SP83146,SP103866,SP83844,SP83027
0,SBS1,1589.616207,1147.469423,1657.394,2289.068493,594.241435,411.759445,1410.241,2239.91831,3033.648031,...,550.6757,621.6714,618.0735,235.4213,378.997636,484.0279,428.2321,467.6804,237.912064,908.9658
1,SBS2,1158.772685,155.662937,690.1413,1300.529865,486.814576,212.539479,285.9621,692.541633,1283.465322,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,SBS3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,SBS4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,SBS5,2125.787143,607.249619,8.62e-172,1071.107892,357.574985,2906.621828,1532.025,722.196641,987.217327,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,SBS6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,23.64613,7.471426,10.29017,7.22474,8.097868,4.408429,11.8513,11.34733,8.22552,5.0787
6,SBS7a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,155518.9,42984.87,33209.46,21486.44,66697.18415,10861.49,263146.9,56937.2,37084.37241,753.3443
7,SBS7b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,43467.71,14957.72,1160.563,4005.323,12957.94826,3525.343,24352.98,11897.13,7500.89208,73.36892
8,SBS7c,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,26557.15,2613.461,5e-324,575.2728,2487.385123,174.7664,5080.666,3769.907,1623.242663,106.653
9,SBS8,1690.964739,449.212599,823.8964,655.400506,151.918232,897.45274,764.674,808.104071,1401.975456,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# Now, going back to the TMB value
specimen_IDs = [col.split('__')[-1] for col in TMB_proxy.index]
Histological_type = [col.split('__')[0] for col in TMB_proxy.index]

# Generate de novo pandas dataframe with the info
TMB_df = pd.DataFrame({'specimenID': specimen_IDs, 'hist_type': Histological_type, 'TMB_proxy': TMB_proxy.values})
TMB_df

TMB_df.to_csv(path.join('data' , 'TMB.tsv.gz'), sep='\t', index=False, compression='gzip')

It might be interesting to explore the data with a plot. For that we will generate a plot similar to the one showed when the case of use was introduced: a complex plot with two panels, one showing the distribution of total number of mutations for each histological class in logarithmic scale and one showing the proportion of attribution of mutations to the different signatures, across samples throughout different histological classes.

In [21]:
import numpy as np

# First set the signature name as the index (row name)
signatures_df = signatures_df.set_index('signature')

# Normalize the values in each column to generate the proportions of each signature
signatures_df = signatures_df.div(signatures_df.sum(axis=0), axis=1)

# Traspose and reorganize index to have as columns (independent variables) each signature
signatures_df = signatures_df.transpose().reset_index()

# Some signatures that were extracted at the start of the cancer genomics field were subdivided into more components
# (7 was subdivided into 7a, 7b and 7c while 17 into 17a and 17b). To simplify we will merge into one component.
# Create new columns by summing the specified columns
signatures_df['SBS7a'] = signatures_df[['SBS7a', 'SBS7b', 'SBS7c']].sum(axis=1)
signatures_df['SBS17a'] = signatures_df[['SBS17a', 'SBS17b']].sum(axis=1)
# Rename the columns ('index' column to 'specimenID' and the others)
signatures_df = signatures_df.rename(columns={'index': 'specimenID', 
                                              'SBS7a': 'SBS7', 
                                              'SBS17a': 'SBS17',
                                              'SBS10a': 'SBS10'})
# Drop the original columns
signatures_df = signatures_df.drop(['SBS7b', 'SBS7c', 'SBS17b'], axis=1)

# Drop signatures with no contribution across specimens 
sum_over = signatures_df[signatures_df.columns[1:]].sum(axis=0)
signatures_df = signatures_df.drop(columns=list(sum_over[sum_over==0].index))

# Convert TMB_proxy to logarithmic scale
TMB_df['log_TMB_proxy'] = np.log10(TMB_df['TMB_proxy'])

# Include the total number of elements in the hist_type label for the plots
TMB_df['hist_type'] = TMB_df['hist_type'] + ' (n=' + TMB_df.groupby('hist_type').transform('count')['specimenID'].astype(str) + ')'

# Merge the two dataframes
merged_df = pd.merge(signatures_df, TMB_df , left_on='specimenID', right_on='specimenID', how='inner')
merged_df

Unnamed: 0,specimenID,SBS1,SBS2,SBS3,SBS4,SBS5,SBS6,SBS7,SBS8,SBS9,...,SBS76,SBS77,SBS78,SBS79,SBS80,SBS81,SBS82,hist_type,TMB_proxy,log_TMB_proxy
0,SP117655,0.106908,0.077932,0.0,0.0,1.429678e-01,0.000000,0.000000,0.113724,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,Biliary_AdenoCA (n=35),14868.992600,4.172282
1,SP117556,0.226216,0.030688,0.0,0.0,1.197153e-01,0.000000,0.000000,0.088559,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,Biliary_AdenoCA (n=35),5072.446116,3.705217
2,SP117627,0.283933,0.118230,0.0,0.0,1.476718e-175,0.000000,0.000000,0.141144,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,Biliary_AdenoCA (n=35),5837.270352,3.766210
3,SP117775,0.184205,0.104655,0.0,0.0,8.619355e-02,0.000000,0.000000,0.052741,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,Biliary_AdenoCA (n=35),12426.775021,4.094358
4,SP117332,0.173352,0.142013,0.0,0.0,1.043116e-01,0.000000,0.000000,0.044318,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,Biliary_AdenoCA (n=35),3427.951291,3.535035
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2775,SP83099,0.021743,0.000000,0.0,0.0,0.000000e+00,0.000198,0.654114,0.000000,0.005957,...,0.0,0.0,0.0,0.0,0.0,8.501045e-05,0.0,Skin_Melanoma (n=107),22261.560781,4.347556
2776,SP83146,0.001228,0.000000,0.0,0.0,0.000000e+00,0.000034,0.839187,0.000000,0.000957,...,0.0,0.0,0.0,0.0,0.0,6.562287e-06,0.0,Skin_Melanoma (n=107),348647.808151,5.542387
2777,SP103866,0.004929,0.000000,0.0,0.0,0.000000e+00,0.000120,0.765229,0.000000,0.003378,...,0.0,0.0,0.0,0.0,0.0,5.891708e-25,0.0,Skin_Melanoma (n=107),94879.104988,4.977171
2778,SP83844,0.004177,0.000000,0.0,0.0,0.000000e+00,0.000144,0.811273,0.000000,0.004417,...,0.0,0.0,0.0,0.0,0.0,8.261201e-06,0.0,Skin_Melanoma (n=107),56958.008549,4.755555


In [22]:
# Order cohorts by median log TMB
order = (
    merged_df
    .groupby('hist_type')['log_TMB_proxy']
    .median()
    .sort_values()
    .index
    .tolist()
)

# Median TMB per cohort (on raw scale)
medians = (
    merged_df
    .groupby('hist_type')['TMB_proxy']
    .median()
    .reindex(order)
)

fig = px.strip(
    merged_df,
    x='hist_type',
    y='TMB_proxy',
    category_orders={'hist_type': order},
    log_y=True, # log-scale
    stripmode='overlay'
)

fig.update_traces(
    jitter=0.6,
    marker=dict(
        size=4,
        opacity=0.6
    )
)

for i, cohort in enumerate(order):
    fig.add_shape(
        type='line',
        x0=i - 0.35,
        x1=i + 0.35,
        y0=medians.loc[cohort],
        y1=medians.loc[cohort],
        line=dict(
            color='red',
            width=2
        )
    )

fig.update_layout(
    xaxis_title='Tumor cohort',
    yaxis_title='Mutations per Mb',
    width=1600,
    height=500,
    template='simple_white'
)

fig.show()


Definetly, different tumor types from the histological point of view show different levels of mutations, although there is a large variability within each histological type. For instance, it will be very difficult to distinguish by the number of mutations a sample of a bone benign tumor or a myelodisplasic syndrome type of blood cancer. But what about the composition of these mutations?

In [23]:
import plotly.graph_objects as go
import plotly.express as px

# Get only relevant columns
prop_df = merged_df[merged_df.columns[:-2]].set_index('specimenID')

palette = px.colors.qualitative.Dark24 + px.colors.qualitative.Light24

figures = {}

for hist_type in order:

    # Subset cohort
    sub_prop_df = prop_df[prop_df['hist_type'] == hist_type].copy()
    sub_prop_df = sub_prop_df.drop(columns=['hist_type'])

    # Order signatures by total contribution
    contribution_columns = (
        sub_prop_df.iloc[:, :-1]
        .sum()
        .sort_values(ascending=False)
        .index
    )

    # Sort specimens
    sorted_specimens = sub_prop_df.sort_values(
        by=list(contribution_columns),
        ascending=False
    )

    # Create figure (single cohort → single subplot)
    fig = go.Figure()

    for j, sig in enumerate(contribution_columns):
        fig.add_trace(
            go.Bar(
                x=sorted_specimens.index,
                y=sorted_specimens[sig].replace(0, None),
                name=sig,
                marker_color=palette[j % len(palette)]
            )
        )

    fig.update_layout(
        barmode='stack',
        bargap=0,
        bargroupgap=0,
        hovermode='x',
        template='simple_white',
        height=400,
        width=1000,
        title=f"Stacked Bar Plot – {hist_type}",
        xaxis=dict(showticklabels=False),
        yaxis=dict(title="Contribution"),
        legend=dict(
            orientation='v',
            yanchor='middle',
            y=0.5,
            xanchor='left',
            x=1.02
        )
    )

    fig.update_traces(marker_line_width=0)

    # Save figure
    outfile = path.join("plots", f"Barplot_signatures_{hist_type}.html")
    fig.write_html(outfile)

    figures[hist_type] = fig

figures["Skin_Melanoma (n=107)"].show()

There is clearly larger differences in term of composition of the mutations which might help with the identification of tumor types just by using the proportions of signatures on a given sample, although there is still high variability. This might help with the decision of the type of data to use if we consider to build a model that looks on genomic data and wants to identify the histological (or even molecular subtype) of tumor.