# Introduction to Machine Learning

**In this practical session you will:**

   - Learn the essential idea behind Machine Learning including several statistical concepts and the implementation steps under the point of view of the Data Science cycle.
   - Download, explore and implement the preliminary processing of a multi-omics cancer dataset that will be used throughout the course.

## Definition:

Machine learning is a subfield of artificial intelligence (AI) that focuses on developing algorithms and models that allow computers to learn patterns and make predictions or decisions without being explicitly programmed. In the context of data science and mathematical modeling, machine learning plays a crucial role in building models that represent real-world systems using mathematical concepts and language. A subfield of Machine learning is Deep learning, which uses a type of models called neural networks that are inspired in the architechture of human brains.

[![AI diagram](http://danieljhand.com/images/AI_ML_DL_circles.jpeg)](http://danieljhand.com/the-relationship-between-artificial-intelligence-ai-machine-learning-ml-and-deep-learning-dl.html)



## Characteristics of Machine Learning:

1. **Learning from Data:**
   - Machine learning systems learn from data rather than relying on explicit programming by using some statistical techniques.
   - Algorithms use available data to identify patterns, relationships, and trends.

<!-- Add an empty line here -->

2. **Model Development:**
   - Machine learning involves creating models that can generalize patterns from the training data to make predictions or decisions on new, unseen data.

<!-- Add an empty line here -->

3. **Adaptability:**
   - Machine learning models can adapt and evolve as new data becomes available, making them suitable for dynamic and changing environments.

<!-- Add an empty line here -->

## Data Science Lifecycle:

The data science lifecycle involves several key steps that machine learning implementations follows:

<!-- Add an empty line here -->

1. **Identification of the problem:**
   - Allows the decision on the suitable model and algorithm and definition of training and test datasets.

<!-- Add an empty line here -->

2. **Data Collection:**
   - Gather relevant data from various sources, ensuring it is representative and suitable for the problem at hand.
   - It is highly important to perform exploratory analysis to evaluate the quality of the data and, if suitable, define the subsequent necessary processing steps.

<!-- Add an empty line here -->

3. **Data Processing:** This is probably the most relevant step: independently of a succesful implementation of the previous steps, if the data does not contain the information relevant to solve the problem and or present in an inadequate state for the algorithm to learn from, the resulting model will be useless (garbage-in -> garbage-out). It mainly consists of two steps.

    3.1. **Data Pre-processing:**
    - Clean and preprocess the data to handle missing values, outliers, and format issues.

    <!-- Add an empty line here -->

    3.2. **Feature Engineering:**
    - Necessary in some cases but optional in others.
    - Select or create features that are relevant and informative for the machine learning model.
    - Common approaches are grouped into *Filter-based*, *Wrapper-based* and *Embedded-based* categories.
   
<!-- Add an empty line here -->

4. **Data modelling:** During this iterative process, each model's performance is assessed using different metrics depending if the algorithm works with categorical or continous variables.

    <!-- Add an empty line here -->

    4.1. **Model Training:**
    - Use a learning algorithm to train the model on a labeled dataset, allowing it to learn patterns and relationships.

    <!-- Add an empty line here -->

    4.2. **Model Optimization:**
    - Adjust model parameters and features to improve performance, often involving techniques like hyperparameter tuning. Within this step it is important to avoid overfitting (the model could be generalized to datasets beyond the training ones).

    <!-- Add an empty line here -->

    4.3. **Model Testing:**
    - Validate the model on new, unseen data to ensure it generalizes well (without overfitting) and provides accurate predictions (without underfitting).

<!-- Add an empty line here -->

5. **Deployment:**
   - Deploy the model into a real-world environment, integrating it into decision-making processes.

<!-- Add an empty line here -->

[![Data Science LyfeCycle](https://www.onlinemanipal.com/wp-content/uploads/2022/09/Data-Science-Life-cycle-768x767.png.webp)](https://www.onlinemanipal.com/blogs/data-science-lifecycle-explained)

## Types of Machine Learning

Machine learning is broadly categorized into several types, each serving different purposes and solving distinct problems. Here are the main types:

<!-- Add an empty line here -->

[![AI diagram](https://www.freecodecamp.org/news/content/images/2020/08/ml-1.png)](https://www.freecodecamp.org/news/machine-learning-for-managers-what-you-need-to-know/)

<!-- Add an empty line here -->

### Supervised Learning

In supervised learning, the algorithm is trained on a **labeled** dataset, where each input is paired with the corresponding output. The goal is to learn a mapping from inputs to outputs, and hence, **predict an output based on input**.

The usefulness of these models is evaluated immediately since both the input and corresponding correct outputs are provided in the testing dataset.


**a. Regression:**
   - **Objective:** Predict a continuous target variable.
   - **Examples:** Linear Regression, Polynomial Regression.

**b. Classification:**
   - **Objective:** Predict a discrete target variable (class labels).
   - **Examples:** Logistic Regression, Decision Trees or Random Forest and Support Vector Machines.

<!-- Add an empty line here -->

### Unsupervised Learning

Unsupervised learning involves training on **unlabeled** data, and the algorithm tries to **discover patterns or relationships in the data** without explicit guidance on the output.

Since the output is unknown in the training data, the usefulness is implicitly derived from the structure and relationships discovered in the data.

**a. Clustering:**
   - **Objective:** Group similar data points together.
   - **Examples:** K-Means Clustering, Hierarchical Clustering.

**b. Dimensionality Reduction:**
   - **Objective:** Reduce the number of input features while preserving important information. It is also commonly used as a pre-processing step for feature extraction.
   - **Examples:** Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE).

**c. Association Rule Learning:** It will not be covered on this course.
   - **Objective:** Discover interesting relationships between variables in large datasets.
   - **Examples:** Apriori Algorithm, Eclat Algorithm.


### Reinforcement Learning

Reinforcement learning involves an **agent interacting with an environment**, learning to make decisions by receiving feedback in the form of rewards or penalties (mimics the human trial and error behaviour). Hence, the objective is to **learn a policy to make decisions achieving the most optimal result**.

On this type of algorithms, the performance depends on the environment provided by the agent (reward or penalty) after each action, guiding it towards learning a successful policy for optimization (https://www.youtube.com/@aiwarehouse). It is usually employed for training AI for videogames rather than on -omics data analysis, so it won't be covered on this course.

**a. Model-Based Reinforcement Learning:**

   - **Objective:** Build an explicit model of the environment to make decisions.
   - **Examples:** Monte Carlo Tree Search.

**b. Model-Free Reinforcement Learning:**

   - **Objective:** Learn to make decisions without an explicit model of the environment.
   - **Examples:** Q-Learning, Deep Q Network (DQN).

## Relationship with Statistical Concepts

1. **Pattern Recognition:**
   - Machine learning involves finding patterns in data, a concept deeply rooted in statistics.
   - Depeding on the types of problem, and hence, the employed algorithm, different kinds of patterns can be extracted from data.

<!-- Add an empty line here -->

[![Pattern types](https://www.researchgate.net/profile/Gordon-Elger/publication/352727978/figure/fig2/AS:1153327744192512@1651986170131/Machine-learning-tasks-most-relevant-for-PdM.png)](https://www.researchgate.net/figure/Machine-learning-tasks-most-relevant-for-PdM_fig2_352727978)

<!-- Add an empty line here -->

2. **Cross-Validation:** A key concept for supervised models when the available dataset is smaller than the optimal for the validation purposes.
   - To assess a supervised model's generalization ability, cross-validation techniques are used to evaluate performance on multiple subsets of the data.
   - There are multiple methodologies (https://www.turing.com/kb/different-types-of-cross-validations-in-machine-learning-and-their-explanations) although the most common is the **K-fold cross-validation** which involves partitioning the entire dataset into k number of random subsets, where k-1 are used for training and 1 for testing purposes. This is repeated for a number of iterations and the model is evaluated through the metrics obtained across interations.

<!-- Add an empty line here -->

[![Bias and Variance](https://d2mk45aasx86xg.cloudfront.net/image5_11zon_af97fe4b03.webp)](https://www.turing.com/kb/different-types-of-cross-validations-in-machine-learning-and-their-explanations) 

<!-- Add an empty line here -->

3. **Bias-Variance Tradeoff:** Also a key concept when dealing with supervised models.
   - In statistics, the bias of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. On the other hand, the variance of an estimator measures how much the estimates from the estimator are likely to vary or spread out around the true, unknown parameter, through repeated sampling.

<!-- Add an empty line here -->

   [![Bias and Variance](https://nvsyashwanth.github.io/machinelearningmaster/assets/images/bias_variance.jpg)](https://nvsyashwanth.github.io/machinelearningmaster/bias-variance/)

<!-- Add an empty line here -->

   - If we consider Machine learning predictions as estimations these two concepts acquire the following meaning in this context. 
       
       - **Variance** is the consistency of the model predictions for a particular sample instance (for instance applying the model multiple times on subsets of the training dataset). In other words, is the sensitivity of the model to the randomness of the training dataset.
       
       - In contrast, **Bias** could be seen as the measure of the distance between predictions and the correct values (the labels) if we rebuild the model multiple times with different training datasets. Therefore, is the measure of the systematic error not due to randomness in the training data.
             
   - These two concepts are intrinsically related, and therefore, the bias-variance tradeoff is a fundamental concept in machine learning: there is an optimal model complexity that allows for good performance on the training data but still keeping the ability to generalize to new data. Deviations from these optimal area leads to either high bias (underfitting, there is still room to improve the model though training) or high variance (overfitting, excessive training on a specific dataset and unable to generalize to similar test datasets) of the model.
   
   <!-- Add an empty line here -->

   [![Optimal complexity](https://ejenner.com/post/bias-variance-tradeoff/tradeoff_huad58a1a719791584e96223cc1385b715_74447_1200x1200_fit_q75_h2_lanczos_3.webp)](https://ejenner.com/post/bias-variance-tradeoff/)
   
<!-- Add an empty line here -->

   [![Underfitting and overfitting](https://www.endtoend.ai/assets/blog/misc/bias-variance-tradeoff-in-reinforcement-learning/underfit_right_overfit.png)](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)

<!-- Add an empty line here -->

4. **Statistical Metrics:**
   - Various statistical metrics are used to quantify the performance of both unsupervised, and mostly, supervised machine learning models.
   - The type of metric used is related with the type of problem/algorithm used.
   
   <!-- Add an empty line here -->
   
   [![Supervised metrics](https://www.kdnuggets.com/wp-content/uploads/anello_machine_learning_evaluation_metrics_theory_overview_11.png)](https://www.kdnuggets.com/machine-learning-evaluation-metrics-theory-and-overview)

## Case of use: Cancer genomics

Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. At the very core of the etiology of cancer is somatic mutations: permanent alterations in the genetic material (either resulting from spontaneous errors during the DNA replication or as a result of DNA damage) originated throughout the somatic development (from the very first mitotic divisions of the Zygot to the human adult tissues).

As sequencing technologies advanced in the past decade, the number of available tumoral whole genomes have increased exponentially, revealing that different tumors accumulate mutations with a variability of up to three orders of magnitude.

<!-- Add an empty line here -->

![ICGC TMB](images/ICGC_muts.png)

<!-- Add an empty line here -->

Not only the total number of mutation varies, but also the composition. The endogenous mutational processes active in a tissue as well as the mutagens a person has been exposed during their lifetime, e.g ultraviolet (UV)-light or tobacco smoking, define a set of probabilities for each nucleotide to mutate provided of its neighboring
sequence. These probabilities can be inferred can be decomposed from the observed data into several
components that roughly reflect the individual mutational processes affecting the cell, the so-called ‘mutation signatures’, some linked to specific mechanisms.

<!-- Add an empty line here -->

<!-- Add an empty line here -->

**Tobacco-related signature of single base substitutions (SBS) 4**

<!-- Add an empty line here -->

[![Tobacco Signature](https://cog.sanger.ac.uk/cosmic-signatures-production/images/v2_signature_profile_4.original.png)](https://cancer.sanger.ac.uk/signatures/signatures_v2/)

<!-- Add an empty line here -->

<!-- Add an empty line here -->

**Ultraviolet light-related signature of single base substitutions (SBS) 7**

<!-- Add an empty line here -->

[![Tobacco Signature](https://cog.sanger.ac.uk/cosmic-signatures-production/images/v2_signature_profile_7.original.png)](https://cancer.sanger.ac.uk/signatures/signatures_v2/)

<!-- Add an empty line here -->

<!-- Add an empty line here -->

Hence, the study of mutations within the Cancer Genomics field, integrated with other -omic data such as transcriptomics or epigenomics as well as clinical data has paved the latest advances in Cancer Research.

Several international consortium have generated multi-omic cancer datasets. One of them, enmarked within The Cancer Genome Atlas (TCGA) is the Pan Cancer Analysis of Whole Genomes (PCAWG) initiative. Public available data is stored at the International Cancer Genome Consortium (ICGC) database: https://dcc.icgc.org/releases/PCAWG

Some files are particularly interesting for analysis with Machine Learning techniques:

- Clinical (phenotypical) information for each donor belonging to given project, contained at **pcawg_donor_clinical_August2016_v9.xlsx** file. Here you have relevant information such as the donor sex, the vital status, the treatment, the age at diagnosis and the history of smoking and alcohol habits.
- Relationship between donor, specimen and sample identifications at **pcawg_sample_sheet.tsv** file. A donor is the individual with cancer, where several specimens (biopsies of tumor or healthy tissue) can be collected. Moreover, from these specimens more than one samples could be collected to extract omics information (WGS, RNA-seq,...).
- A matrix with the expression in transcript per millions (TPMs) for multiple genes across several samples. This information is contained at **pcawg.rnaseq.transcript.expr.tpm.tsv.gz**.
<!-- - A list of known detected driver mutations on samples, contained at **TableS3_panorama_driver_mutations_ICGC_samples.public.tsv.gz**. -->
- A matrix of the proportion of mutations attributed to a given mutation signature across specimens with Signature Analyzer. The data is contained at the **SignatureAnalyzer_COMPOSITE.SBS.txt** file.

In [None]:
# We can start by downloading some files, for that we will need the pandas package
import pandas as pd

# We will also need numpy for some operations
import numpy as np

# Os is a basic python integrated library. The path utilities are useful to work with local files
from os import path

# To explore the datasets it is always useful to use some plotting packages
import matplotlib.pyplot as plt
import seaborn as sns

# Some dependencies on the seaborn package will generate warnings due to the version. Just ignore them
import warnings
warnings.filterwarnings('ignore')

In [None]:
# The clinical information of the donors
## It is an excel file, so we use the function read_excel from pandas
clinical_df = pd.read_excel('https://dcc.icgc.org/api/v1/download?fn=/PCAWG/clinical_and_histology/pcawg_donor_clinical_August2016_v9.xlsx')
clinical_df

In [None]:
# For starters we can explore both of these files.
## The clinical information for donors contains mostly categorical variables but others such as the patient age at diagnosis is continous
## Moreover, it seems there are a lot of missing data. Let's explore it.

categorical_columns = ['project_code', 'donor_sex', 'donor_vital_status', 'first_therapy_type', 'first_therapy_response',
                        'tobacco_smoking_history_indicator', 'alcohol_history', 'alcohol_history_intensity']

continuous_columns = ['donor_age_at_diagnosis', 'tobacco_smoking_intensity', 'donor_survival_time', 'donor_interval_of_last_followup']

# Create a list of tuples indicating whether each column is categorical or continuous
column_types = [(col, 'categorical') if col in categorical_columns else (col, 'continuous') for col in categorical_columns + continuous_columns]

n_rows = 4
n_cols = (len(categorical_columns)+len(continuous_columns))//n_rows

# Create a figure with multiple subplots (make a grid, 4 rows, 3 columns)
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(13, 16))

for i, tupl in enumerate(column_types):
    col = tupl[0]
    cat = tupl[1]
    if cat == 'categorical':
        clinical_df[col].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[i//n_cols, i%n_cols], startangle=90)
        axes[i//n_cols, i%n_cols].set_title(f'Pie Chart - {col}')
        axes[i//n_cols, i%n_cols].set_ylabel('')
    elif cat == 'continuous':
        sns.histplot(clinical_df[col], bins=20, kde=True, ax=axes[i//n_cols, i%n_cols])
        axes[i//n_cols, i%n_cols].set_title(f'Histogram - {col}')
        axes[i//n_cols, i%n_cols].set_xlabel(col)
        axes[i//n_cols, i%n_cols].set_ylabel('Frequency')

plt.show()

In [None]:
# Notice the warnings, the code ignores on the categorical plots the Non-Available data. Let's plot it with dropna=False on the value_counts() function
# Create a figure with multiple subplots (make a grid, 4 rows, 3 columns)
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(13, 16))

for i, tupl in enumerate(column_types):
    col = tupl[0]
    cat = tupl[1]
    if cat == 'categorical':
        clinical_df[col].value_counts(dropna=False).plot.pie(autopct='%1.1f%%', ax=axes[i//n_cols, i%n_cols], startangle=90)
        axes[i//n_cols, i%n_cols].set_title(f'Pie Chart - {col}')
        axes[i//n_cols, i%n_cols].set_ylabel('')
    elif cat == 'continuous':
        sns.histplot(clinical_df[col], bins=20, kde=True, ax=axes[i//n_cols, i%n_cols])
        axes[i//n_cols, i%n_cols].set_title(f'Histogram - {col}')
        axes[i//n_cols, i%n_cols].set_xlabel(col)
        axes[i//n_cols, i%n_cols].set_ylabel('Frequency')

# Save the figure for future uses
plt.savefig(path.join('plots', 'Clinical_withNA.png'))
plt.show()

Notice that know except for the sex and vital status categories, the NA category nan from numpy is the most common category across the information on the patients. This is going to complicate analysis using this clinical phenotypical data.

In [None]:
## Save it on the data folder for later uses.
clinical_df.to_csv(path.join('data', 'clinical_df.tsv'), sep='\t', index=False)

Now we can download the file with relation of specimens and samples extracted from each donor. It is just a file that helps connect by IDs other files, so let's have a quick look.

In [None]:
## It is a tabular separated file, so we read with the read_csv function specifying the tabulator \t as the separator character
## Moreover, we indicate that the file has a header that should be inferred as the column names.
sample_df = pd.read_csv('https://dcc.icgc.org/api/v1/download?fn=/PCAWG/donors_and_biospecimens/pcawg_sample_sheet.tsv', sep='\t', header='infer')
sample_df

Note that for the same donor, several specimens are extracted. Usually, one from a normal tissue and another from a primary tumor (to find mutations through WGS it is necessary to remove germline mutations that are present on both normal and tumoral tissue, that is why the mutations found on normal tissues are usually substracted from the tumor mutation calls).

Moreover, notice that from the same tumoral specimen several samples might be extracted to extract information with different techniques: in this case for WGS or for RNA-seq. This will be relevant to map multiomic information across samples from the same tumoral specimen from a given donor.

In [None]:
# Merge it with a file provided in the data folder to extract the primary site
project_info = pd.read_csv(path.join('data', 'projects_PCAWG_info.txt'), sep='\t', header='infer')

primary_location_dict = dict(zip(project_info.project, project_info.primary_location))

sample_df['primary_location'] = sample_df['dcc_project_code'].map(primary_location_dict)

# Save it on the data folder for later uses
sample_df.to_csv(path.join('data', 'sample_df.tsv'), sep='\t', index=False)

Next, the expression data is downloaded. This is a very large file and using pandas function will take time. As a simple alternative, you can download it directly with your browser into the data folder: https://dcc.icgc.org/api/v1/download?fn=/PCAWG/transcriptome/transcript_expression/pcawg.rnaseq.transcript.expr.tpm.tsv.gz

In [None]:
# Load the expression data
expression_df = pd.read_csv(path.join('data', 'pcawg.rnaseq.transcript.expr.tpm.tsv.gz'), sep='\t', header='infer', compression='gzip')
expression_df

Here the first column shows a the Ensembl Transcript ID and the rest of the columns, whose name is the aliquot ID (present at the **pcawg_sample_sheet.tsv** file).

If we want to add gene IDs or Symbols instead of transcript IDs, we will need to process the file that the PCAWG consortium uses for annotation, located at: https://dcc.icgc.org/api/v1/download?fn=/PCAWG/drivers/expression/rnaseq.gc19_extNc.gtf.tar.gz

Another alternative than using the browser is to use a bash script. In Jupyter notebooks, you can use other interpreters rather than python. The script below is able to download and process the necessary file (If you don't have a linux-based operative system, skip this step. The file is already provided in the data folder).

In [None]:
%%bash
# The %% above indicates to the jupyter notebook to use bash as interpreter

# Change into the data directory
cd data

# Download the file from the ICGC
wget -O rnaseq.gc19_extNc.gtf.tar.gz https://dcc.icgc.org/api/v1/download?fn=/PCAWG/drivers/expression/rnaseq.gc19_extNc.gtf.tar.gz

# Process the file using a bash code
zcat rnaseq.gc19_extNc.gtf.tar.gz | cut -f9 | cut -d';' -f2 | sed 's/.*gencode::\([^:]*\)::tc_\([^._]*\)[^:]*::\([^._]*\)[^:]*.*/\1\t\2\t\3/' | sort | uniq | tail -n +3 | gzip > gencode_transcript.tsv.gz

Summarizing until here we have:

**pcawg.rnaseq.transcript.expr.tpm.tsv.gz**: A large file with the first column being the ensembl Transcript ID and the rest of the columns with an aliquot ID.

**sample_df.tsv**: A file that contains the relationship between donors, specimens and samples. The aliquotID is also a column of this file.

**gencode_transcript.tsv.gz**: A file that contains the transcript information.

For the analysis at the following sessions we will need to process the expression data, specifically:

- Get the information on a gene, instead than on a transcript level.

- Get only the expression for tumoral specimens (and that will not cover all the tumoral specimens of the PCAWG).

In [None]:
# Open expression matrix
expression_matrix = pd.read_csv(path.join('data', 'pcawg.rnaseq.transcript.expr.tpm.tsv.gz'),
                                                                    sep="\t", header='infer', compression='gzip')

expression_matrix['Transcript'] = expression_matrix['Transcript'].str.extract('^(\w+)\.\w+$')

expression_matrix = expression_matrix.set_index('Transcript', drop=True)


# Specimen information PCAWG
sample_df = pd.read_csv(path.join('data', 'sample_df.tsv'), sep="\t", header='infer')

# Get an aliquot to specimen ID dictionary
specimen_dict = dict(zip(sample_df.aliquot_id, sample_df.icgc_specimen_id))


# Let's translate the columns into the specimen ID
translated_columns = []
aliq_ID_not_found_on_files = []
for aliqID in expression_matrix.columns:
    try:
        translated_columns.append(specimen_dict[aliqID])
    except:
        aliq_ID_not_found_on_files.append(aliqID)
        
print('Total number of aliquots with expression data: ' + str(len(expression_matrix.columns)))
print('Aliquot that could be translated into specimenID: ' + str(len(translated_columns)))
print('Dropped samples because of unknown translation of IDs: ' + str(len(aliq_ID_not_found_on_files)))

# Extract the columns
print(expression_matrix.shape[1])
expression_matrix = expression_matrix.drop(aliq_ID_not_found_on_files, axis=1)
print(expression_matrix.shape[1])

expression_matrix.columns = translated_columns

Apparently all the aliquotIDs can be translated into SpecimenIDs thanks to the **sample_df.tsv**, so there was no lost of information. However, how many of these specimens that were RNA-Sequenced are from tumoral samples? We do not want on the next analysis to include non-tumoral tissues.

In [None]:
# rom the Specimen IDs that we could obtain using the RNA-Seq, library strategy, are all specimen types from tumoral samples?
category_series = sample_df[(sample_df['icgc_specimen_id'].isin(translated_columns))&(sample_df['library_strategy']=='RNA-Seq')]['dcc_specimen_type'].value_counts()
category_series

In [None]:
# We make a pie plot for the different categories
category_series.plot.pie(autopct='%1.1f%%', startangle=90)

# Add a title
plt.title('Distribution of Specimen Types for RNA-Seq')

# Show the plot
plt.show()

Most of the expression data come from specimens of primary solid tumors. Other specimens are from lymph nodes or blood (not solid primary tumors) or even metastasis. However, a non-negligible proportion comes from eaither Normal tissue adjacent to the primary tumor or just regular healthy tissues. Hence we need to remove them from the data.

In [None]:
# Get specimens that do not come form normal healthy tissues (do not contain the normal word)
clean_sample_df = sample_df[(sample_df['icgc_specimen_id'].isin(translated_columns))&(sample_df['library_strategy']=='RNA-Seq')&(~sample_df['dcc_specimen_type'].str.startswith('Normal'))].copy()
# Check that no healthy tissue derived specimens remain
print(clean_sample_df['dcc_specimen_type'].value_counts())

# Remove the columns on the expression_df with expression for healthy tissue specimens
print('Original specimens with expression: ' + str(len(expression_matrix.columns)))
print('Specimens that belong to a tumoral tissue: ' + str(len(clean_sample_df['icgc_specimen_id'])))
expression_matrix = expression_matrix[clean_sample_df['icgc_specimen_id']]

In [None]:
expression_matrix

Finally, we need to process the matrix to get expression information at the gene level.

In [None]:
# The annotation file does not have a header, so the column names are specified
annotation_df = pd.read_csv(path.join('data', 'gencode_transcript.tsv.gz'), 
                                sep="\t", header=None, names=['Symbol', 'Gene', 'Transcript'], compression='gzip')
annotation_df

In [None]:
# To group the transcripts and sum their expression by gene IDs we have to do the following steps

## Merge the expression_matrix with annotation_df on the 'Transcript' column.
## An Inner join is done to work with the Transcript IDs that are on both dataframes
merged_df = pd.merge(expression_matrix.reset_index(), annotation_df , left_on='Transcript', right_on='Transcript', how='inner')
print("Expression available for " + str(len(merged_df)) + " transcripts.")

## Group by 'Gene' and sum the values for each gene
collapsed_df = merged_df.groupby('Gene').sum()

## Drop unnecessary columns
collapsed_df = collapsed_df.drop(columns=['Transcript', 'Symbol']).reset_index()

print("After merging, expression for " + str(len(collapsed_df)) + " genes.")

In [None]:
# Save it on the data folder for later uses.
collapsed_df.to_csv(path.join('data' , 'gene_expression.tsv.gz'), sep='\t', index=False, compression='gzip')

Finally, we can download the signature number of attributed mutations for each specimen.

In [None]:
signatures_df = pd.read_csv('https://dcc.icgc.org/api/v1/download?fn=/PCAWG/mutational_signatures/Signatures_in_Samples/SA_Signatures_in_Samples/SA_Full_PCAWG_Attributions/SA_COMPOSITE_SNV.activity.FULL_SET.031918.txt', sep='\t', header='infer')
signatures_df

In [None]:
# We can get the tumor mutation burden to do some exploratory analysis
TMB_proxy = signatures_df.iloc[:, 1:].sum(axis=0)
TMB_proxy

In [None]:
# Process the first column to extract SBS code
signatures_df['Unnamed: 0'] = signatures_df['Unnamed: 0'].str.extract(r'_(SBS\w+)_')

# Change column names: the first is signature and the rest are the specimenID
signatures_df.columns = ['signature'] + [col.split('__')[-1] for col in signatures_df.columns[1:]]

# Save the information for later uses
signatures_df.to_csv(path.join('data' , 'signatures.tsv.gz'), sep='\t', index=False, compression='gzip')

In [None]:
# Now, going back to the TMB value
specimen_IDs = [col.split('__')[-1] for col in TMB_proxy.index]
Histological_type = [col.split('__')[0] for col in TMB_proxy.index]

# Generate de novo pandas dataframe with the info
TMB_df = pd.DataFrame({'specimenID': specimen_IDs, 'hist_type': Histological_type, 'TMB_proxy': TMB_proxy.values})
TMB_df

TMB_df.to_csv(path.join('data' , 'TMB.tsv.gz'), sep='\t', index=False, compression='gzip')

It might be interesting to explore the data with a plot. For that we will generate a plot similar to the one showed when the case of use was introduced: a complex plot with two panels, one showing the distribution of total number of mutations for each histological class in logarithmic scale and one showing the proportion of attribution of mutations to the different signatures, across samples throughout different histological classes.

In [None]:
# First set the signature name as the index (row name)
signatures_df = signatures_df.set_index('signature')

# Normalize the values in each column to generate the proportions of each signature
signatures_df = signatures_df.div(signatures_df.sum(axis=0), axis=1)

# Traspose and reorganize index to have as columns (independent variables) each signature
signatures_df = signatures_df.transpose().reset_index()

# Some signatures that were extracted at the start of the cancer genomics field were subdivided into more components
# (7 was subdivided into 7a, 7b and 7c while 17 into 17a and 17b). To simplify we will merge into one component.
# Create new columns by summing the specified columns
signatures_df['SBS7a'] = signatures_df[['SBS7a', 'SBS7b', 'SBS7c']].sum(axis=1)
signatures_df['SBS17a'] = signatures_df[['SBS17a', 'SBS17b']].sum(axis=1)
# Rename the columns ('index' column to 'specimenID' and the others)
signatures_df = signatures_df.rename(columns={'index': 'specimenID', 
                                              'SBS7a': 'SBS7', 
                                              'SBS17a': 'SBS17',
                                              'SBS10a': 'SBS10'})
# Drop the original columns
signatures_df = signatures_df.drop(['SBS7b', 'SBS7c', 'SBS17b'], axis=1)

# Drop signatures with no contribution across specimens 
sum_over = signatures_df[signatures_df.columns[1:]].sum(axis=0)
signatures_df = signatures_df.drop(columns=list(sum_over[sum_over==0].index))

# Convert TMB_proxy to logarithmic scale
TMB_df['log_TMB_proxy'] = np.log10(TMB_df['TMB_proxy'])

# Include the total number of elements in the hist_type label for the plots
TMB_df['hist_type'] = TMB_df['hist_type'] + ' (n=' + TMB_df.groupby('hist_type').transform('count')['specimenID'].astype(str) + ')'

# Merge the two dataframes
merged_df = pd.merge(signatures_df, TMB_df , left_on='specimenID', right_on='specimenID', how='inner')
merged_df

In [None]:
# Get the plotting order of hist_type by increasing median in log TMB
order = TMB_df.groupby('hist_type')['log_TMB_proxy'].median().sort_values().index

# Set seaborn style
sns.set(style="whitegrid")

# Create a figure and axes
plt.figure(figsize=(10, 40))

# Create a violin plot with a boxplot inside
ax = sns.violinplot(y='hist_type', x='log_TMB_proxy', data=merged_df, order=order, inner='box')

# Set X-axis label
ax.set_xlabel('log(TMB_proxy)')

# Set Y-axis label
ax.set_ylabel('Hist Type')

# Save the figure for future uses
plt.savefig(path.join('plots', 'Violin.png'))
plt.show()

Definetly, different tumor types from the histological point of view show different levels of mutations, although there is a large variability within each histological type. For instance, it will be very difficult to distinguish by the number of mutations a sample of a bone benign tumor or a myelodisplasic syndrome type of blood cancer. But what about the composition of these mutations?

In [None]:
from matplotlib.colors import ListedColormap

# Get only relevant columns
prop_df = merged_df[merged_df.columns[:-2]].set_index('specimenID')

# Create subplots
fig, axes = plt.subplots(nrows=len(order), ncols=1, figsize=(5, 5 * len(order)))

# Iterate over hist_type
for i, hist_type in enumerate(order):    

    # Plot the stacked bar plot on the right side
    ax_bar = axes[i]
    sub_prop_df = prop_df[prop_df['hist_type']==hist_type].copy()
    sub_prop_df = sub_prop_df.drop(columns=['hist_type'])

    # Step 1: Drop signatures that do not contribute to the class or less than 1% mean across samples
    mean_over = sub_prop_df[sub_prop_df.columns].mean(axis=0)
    sub_prop_df = sub_prop_df.drop(columns=list(mean_over[mean_over<0.01].index))

    # Step 2: Identify and sort columns by contribution
    contribution_columns = sub_prop_df.iloc[:, :-1].sum().sort_values(ascending=False).index

    # Step 3: Sort rows (specimens) by total contribution of selected SBS columns
    sorted_specimens = sub_prop_df.sort_values(by=list(contribution_columns), ascending=False)

    # Automatically generate a ListedColormap with unique colors based on the number of labels
    num_colors = len(contribution_columns)
    color_map = plt.get_cmap('tab20', num_colors)

    # Step 4: Plot the stacked bar plot
    stacked_bar = sorted_specimens.plot(kind='bar', stacked=True, colormap=color_map, edgecolor='none', width=1, ax=ax_bar)
    ax_bar.set_xlabel('Specimen')
    ax_bar.set_ylabel('Contribution')
    # Remove X-axis tick labels
    ax_bar.set_xticklabels([])  
    ax_bar.set_title(f'Stacked Bar Plot - {hist_type}')
    
    # Move the legend to the right
    ax_bar.legend(loc='center left', bbox_to_anchor=(1, 0.5))

# Save the figure for future uses
plt.savefig(path.join('plots', 'Barplot_signatures.png'))
plt.show()

There is clearly larger differences in term of composition of the mutations which might help with the identification of tumor types just by using the proportions of signatures on a given sample, although there is still high variability. This might help with the decision of the type of data to use if we consider to build a model that looks on genomic data and wants to identify the histological (or even molecular subtype) of tumor.

# Unsupervised methods

**In this practical session you will:**

   - Learn to use several common unsupervised methods (dimensionality reduction and clustering algotithms) used in multi-omics data analysis.
   - Explore part of the multi-omics dataset and discover the underlying structure of the trasncriptomic data.

## Dimensionality reduction methods

Dimensionality reduction algorithms are techniques used to reduce the number of features (or dimensions) in a dataset while preserving its essential information: this is particularly useful for **visualization, meaningful compression and discovery of the underlying structure of the data**. Two popular dimensionality reduction algorithms are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).

### Principal Component Analysis (PCA):

PCA is a statistical technique that on an n-dimensional matrix of values that:

- Identifies the directions (specific axis in the matrix) at which, if the rest of the data is projected into, the data varies the most: the principal components.

- Represents the data in a new coordinate system defined by these principal components.
    
Therefore, the key idea is to find a lower-dimensional representation of the data that captures the maximum amount of variance. Hence, the first principal component is the one that captures the most significant amount of variance in the data, followed by the second principal component, and so on.

To achieve this, the algorithm follows these steps:

1. PCA starts by computing the **covariance matrix** of the original data, which represents the relationships between the different features.

2. **Eigenvectors and eigenvalues** are extracted from the covariance matrix. The **eigenvectors** are the principal components and the **eigenvalues** indicate the variance along each principal component.

3. The **eigenvectors** are sorted in descending order based on the **eigenvalues**.

4. The dimensionality of the data is reduced and the data is transformed **linearly** into a new coordinate system aligned with the an amount of first principal components depending on the new dimensionality.

<!-- Add an empty line here -->

[![PCA in a nutshell](https://pbs.twimg.com/media/F9XIOm1boAEhsL2?format=jpg&name=small)](https://twitter.com/akshay_pachaar/status/1717519050706952695)

<!-- Add an empty line here -->

PCA is probably the most used dimensionality reduction technique thanks to its multiple advantatges, although it also has its own problems:

**Advantages:**
- Computationally efficient for linear dimensionality reduction.
- Preserves as much variance as possible.
- Clear interpretation of the principal components.

**Disadvantages:**
- Assumes linearity.
- May not capture complex nonlinear relationships.

### t-Distributed Stochastic Neighbor Embedding (t-SNE):

t-SNE is another common dimensionality reduction algorithm primarily used for visualizing high-dimensional data in a lower-dimensional space. However, unlike linear methods like PCA, t-SNE focuses on preserving local structures and capturing non-linear relationships between data points.

To this effect, the algorithm uses:

- **Measures of pairwise similarity** between data points since similar data points in the high-dimensional space are intended to remain close to each other in the low-dimensional space.

- Moreover, t-SNE constructs **probability distributions** for the pairwise similarities in both the high-dimensional and low-dimensional spaces to model the similarities using conditional probabilities.

- The algorithm minimizes the **divergence between the probability distributions** in the high-dimensional and low-dimensional spaces low-dimensional representation to reflect the structure of the high-dimensional data.

<!-- Add an empty line here -->

Following this criteria and statistical tools, the steps of the algorithm are:

1. **Compute Pairwise Similarities:** For each pair of data points in the high-dimensional space the pairwise similarity is computed.

2. **Construct Probability Distributions:** The pairwise similarities are converted into probability distributions. In the high-dimensional space a Gaussian distribution is used to represent the similarities while in the low-dimensional space is a Student's t-distribution (this distribution has heavier tails compared to the Gaussian making it more flexible and better suited for capturing local structures).

3. **Minimize the Divergence:** The Kullback-Leibler divergence is minimized between the two sets of probability distributions by iteratively adjusting the positions of data points in the low-dimensional space. Once the minimum is achieved, the final output is the resulting low-dimensional embedding of the data.

<!-- Add an empty line here -->

The advantatges and disadvantatges of t-SNE remark the complementarity of this technique to PCA:

**Advantages:**
- Effective for preserving local structure and capturing non-linear relationships.
- Well-suited for visualization of high-dimensional data.

**Disadvantages:**
- Computationally expensive for large datasets.
- Optimizing t-SNE involves non-convex optimization, which may result in different solutions for different initializations.

### PCA or t-SNE?

The different characteristics of these techniques is key to choose the appropiate one based on the nature of the data and the problem at hand: PCA is often preferred for linear relationships and dimensionality reduction, while t-SNE is powerful for visualizing complex, non-linear structures in high-dimensional data.

**Linearity vs. Non-Linearity:**
- PCA: Assumes linear relationships.
- t-SNE: Captures non-linear relationships.

**Preservation of Global vs. Local Structure:**
- PCA: Emphasizes preserving global variance.
- t-SNE: Focuses on preserving local structures and similarities.

**Interpretability:**
- PCA: Principal components have a clear interpretation.
- t-SNE: The mapping is more difficult to interpret, especially for distances in the high-dimensional space.

**Computational Complexity:**
- PCA: Computationally efficient.
- t-SNE: Computationally expensive, especially for large datasets.

## Clustering algorithms

Clustering algorithms are used to group data points together into clusters based on their relation to surrounding data points. Hence, they use similarity or distance measures in the feature space in an effort to discover dense regions of data points (hence, it is good practice to scale data prior to using clustering algorithms).

There are many types of clustering algorithms but they have in common an iterative process identified clusters are evaluated and reported back to the algorithm configuration until the desired or appropriate number of clusters is achieved.

Therefore, some clustering algorithms require the user to specify the number of clusters to discover in the data while others require only some minimum distance between observations, a theshold at which data points might be considered as "close" or "connected".

### Hierarchical Clustering:

Hierarchical clustering generates a tree-like hierarchy of clusters known as a dendrogram through the iterative process. It does not require to specify the number of clusters beforehand but the user should subjectively define a posteriori the amount of clusters based on the dendogram. The iterative steps are the following:

1. **Evaluate the distance between clusters** The algorithm computes the pairwise distance between all the clusters at the iteration (the algorithm starts by considering each data point as an individual cluster and ends when all data points are assigned to one cluster).

2. **Merge the closest clusters**: The two clusters with the lowest distance between them are merged toghether into a new cluster. This distance is recorded on the dendogram as the length of the branch between the original two clusters and the new cluster (a new node in the dendogram). The algorith enters into the next iteration.


<!-- Add an empty line here -->

[![Hierarchical clustering](https://cdn-dfnaj.nitrocdn.com/xxeFXDnBIOflfPsgwjDLywIQwPChAOzV/assets/images/optimized/rev-6132d4f/www.displayr.com/wp-content/uploads/2018/03/Hierarchical-clustering-3-1.png)](https://www.displayr.com/what-is-hierarchical-clustering/)

<!-- Add an empty line here -->

Once the dendogram is generated, the shape could be interpreted to define the amount of desired clusters. Together with visual inspection and several performance metrics, such as the **Cophenetic Correlation Coefficient** that measures how faithfully the hierarchical clustering preserves pairwise distances between data points (close to 1 indicates good clustering) or the **Ward's method** (see below), it allows to evaluate how succesful the clustering has been.

<!-- Add an empty line here -->

[![Dendogram](https://cdn-dfnaj.nitrocdn.com/xxeFXDnBIOflfPsgwjDLywIQwPChAOzV/assets/images/optimized/rev-6132d4f/www.displayr.com/wp-content/uploads/2018/03/Screen-Shot-2018-03-28-at-11.48.48-am.png)](https://www.displayr.com/what-is-hierarchical-clustering/)

<!-- Add an empty line here -->

There are various linkage methods, that is, methods to measure the distance between clusters, where each one of them has the advantatge to proficiently detect specific shapes of clusters or the disadvantatge of be misguided by data with a different nature. Some examples are:

- **Single linkage**: The distance is computed as the closest between two points such that one point lies in one cluster and the other point lies in the other. This method is able to separate non-elliptical shapes as long as the gap between the two clusters is not small, however, it has bad performance when there is noise between clusters.

- **Complete linkage**: The distance is computed as the furthest between two points such that one point lies in one cluster and the other point lies in the other. In contrast, this method has a good performance when there is noise between clusters but is biased towards detecting globular clusters and tends to disgregate the large clusters.

- **Average linkage**: The distance is computed as the average between all possible pairs of data points between clusters. Similar to the complete linkage, has a good performance with noise between clusters but is biased towards globular ones.

- **Ward's method**: It is similar to the average linkage, but the average is computed over the sum of the square of pair-wise distances. The ward's method also serves as a performance metric where low values within each cluster suggest better performance.

<!-- Add an empty line here -->

[![Likage methods](https://miro.medium.com/v2/resize:fit:640/format:webp/0*s2KrCgCQIlEqcK_X)](https://medium.com/@u0808b100/unsupervised-learning-in-python-hierarchical-clustering-t-sne-41f17bbbd350)

<!-- Add an empty line here -->

With respect to other clustering algotithms, the hierarchical clustering presents:

**Advantages:**
- No need to pre-specify the number of clusters.
- Provides a hierarchical structure.

**Disadvantages:**
- Computationally expensive, especially for large datasets.
- Difficult to determine the optimal number of clusters (highly subjective).

### K-Means Clustering:

K-Means clustering partitions the data into a predefined k number of clusters, where each cluster is defined by a centroid: a data point calculated as the mean of all the data points in the cluster. There are different algorithms, but all of them use an iterative procedure until a convergence solution is achieved. Roughly, they follow these two steps:

1. **Assignment**: Each data point is assigned to the nearest centroid, generating K clusters at the current iteration. At the first iteration, the k initial cluster centroids are choosen at random in the space.

2. **Update the centroids**: The k-centroids are recalculated based on the mean of the data points in each cluster. If after updating several times the data points on each cluster remain the same after assigment, the centroids remain the same after the update: convergence has been achieved and the algorithm stops.

<!-- Add an empty line here -->

<a href="https://stackoverflow.com/questions/60312401/when-using-the-k-means-clustering-algorithm-is-it-possible-to-have-a-set-of-dat" target="_blank">
  <img src="https://i.stack.imgur.com/ibYKU.png" alt="K-means clustering" width="800"/>
</a>

<!-- Add an empty line here -->

Similar to hierarchical clustering, there are some metrics that reflect performance such as the **silhouette score**, which evaluates the intra-cluster compactness and between clusters separation or the **Ward's method**. Despite being unsupervised methods, if there is any information about the "true" clusters in the data, one can compute the **Adjusted Rand Index (ARI)** which measures the similarity between true and predicted clusters, adjusted for chance.

With respect to advantages and disadvantages compared to other clustering methods:

**Advantages**:
- Efficient and works well with large datasets.
- Simple and easy to implement.

<!-- Add an empty line here -->

**Disadvantages**:
- Sensitive to initial centroid placement.
- Assumes clusters are spherical and equally sized.

### Hierarchical o K-Means Clustering?

As always, depends on the nature of the data and the goals of the analysis. Hierarchical does not require to specify the initial number of clusters and decision can be done a posteriori evaluating the hierarchy, however it is computationally expensive. K-means clustering is more efficient, but requires a predefined expected number of clusters and is very sensitive to initializations.

## Practical session: visualization of transcriptomes and identification of cancer types by expression

The expression data across genes and specimens that we generated in the previous session is highly multidimensional, given that for each specimen we have the expression across more than 20000 thousand genes or features.

In order to make any sense of this data, we can start by employing techniques such as PCA or t-SNE to preliminary investigate for interesting patterns in the data. For this purpose we can use the utilities available on the scikit-learn package.

### PCA

From a practical point of view, these are the steps we are going to implement:

1. Standardize the data: The objetive of PCA is to maximize the variance. If the features (in this case >20000 gene expressions have different scales or units, it is important to standardize the data by subtracting the mean and dividing by the standard deviation. This step ensures that all features are on a similar scale and prevents dominance by features with larger variances. For that, we will use the StandardScaler of sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

2. Compute the covariance matrix: To understand the relationships between pairs of gene expression in the data.

3. Perform the eigen-decomposition: The covariance matrix is decomposed into its eigenvectors and eigenvalues. The eigenvectors represent the principal components, and the eigenvalues indicate the amount of variance explained by each principal component.

4. Select the principal components: The principal components are ranked based on their corresponding eigenvalues, and the top components capturing the most variance are selected. Since we will do a 2D visualization, we need the two components that explain most of the variance in gene expression across samples.

5. Project the data onto the new coordinate system: The original data is transformed by projecting it onto the selected principal components. Each data point is represented by its new coordinates in the principal component space.

Steps 2, 3, 4 and 5 could be implemented easily with numpy through linear algebra operations (if anyones wants to, you can try the exercise. If you need help, see https://stackoverflow.com/questions/58666635/implementing-pca-with-numpy) however, this is a standard procedure and already implemented in machine learning packages such as **skicit-learn** as the PCA module.

In [None]:
import pandas as pd
import numpy as np
from os import path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# To ignore some plot warnings
import warnings
warnings.filterwarnings('ignore')

n_components = 2

# We load the expression data
expression_df = pd.read_csv(path.join('data', 'gene_expression.tsv.gz'),
                                                        sep="\t", header='infer', index_col=0, compression='gzip')

# We preprocess the data with standarization
scaler = StandardScaler()
data = scaler.fit_transform(expression_df.T)

# We perform the PCA
pca2D = PCA(n_components=n_components)
proj_data = pca2D.fit_transform(data)

# We can check the amount of variance explained by the two Principcal components
print(pca2D.explained_variance_ratio_)

# We plot the data (the first two components are the first two columns of proj_data)
scatter = plt.scatter(proj_data[:,0], proj_data[:,1], s=2)  # Adjust the 's' parameter to control the size
plt.title('Transcriptome')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
# We can even use the third PC and do a 3D plot
n_components = 3
# We perform the PCA
pca3D = PCA(n_components=n_components)
proj_data3D = pca3D.fit_transform(data)

print(pca3D.explained_variance_ratio_)

fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(proj_data3D[:,0],
              proj_data3D[:,1],
              proj_data3D[:,2], s=2)
ax.set_title("Transcriptome")

A priori we cannot distinguish much, but we could try to colour each sample based on the primary tumor type and see if it is more informative.

In [None]:
from matplotlib.colors import ListedColormap
from matplotlib.lines import Line2D

# Get sample dataframe with the information
sample_df = pd.read_csv(path.join('data', 'sample_df.tsv'), sep="\t", header='infer')

# Generate a dictionary to translate the Specimen IDs to the tumor type code.
tumortype_dict = dict(zip(sample_df.icgc_specimen_id, sample_df.primary_location))

# Get the tumor type for each sample (each column)
labels = expression_df.columns.map(tumortype_dict)

# Automatically create a mapping from label categories to integers
unique_labels = np.unique(labels)
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

# Convert labels to integers based on the automatic mapping
label_integers = np.array([label_mapping[label] for label in labels])

# Automatically generate a ListedColormap with unique colors based on the number of labels
num_colors = len(unique_labels)
color_map = plt.get_cmap('tab20', num_colors)  # Replace 'viridis' with any other colormap of your choice

# Create a ListedColormap with unique colors
listed_color_map = ListedColormap([color_map(idx) for idx in range(num_colors)])

# Plot with colours 
# We plot the data (the first two components are the first two columns of proj_data)
scatter = plt.scatter(proj_data[:,0], proj_data[:,1], s=2, c=label_integers, cmap=listed_color_map)

# Create legend handles and labels
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=color_map(idx), markersize=10, label=label)
                  for idx, label in enumerate(unique_labels)]

# Add legend
plt.legend(handles=legend_handles, title='Tumor Type', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Transcriptome')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# Save image
plt.savefig(path.join('plots', 'PCA.png'))
plt.show()

Apparently the PCA can separate some blood cancer clusters, but has problems to separate other samples from a wide variety of primary tumor locations. Definetly, the implementation of PCA does not solve the problem at hand.

### t-SNE

t-SNE does not assume linearity and might be a better proxy to separate the different types of tumors.

In [None]:
from sklearn.manifold import TSNE

# Transpose the matrix
transposed_matrix = expression_df.to_numpy().T

# Initialize TSNE with 2 components for 2D visualization)\
tsne = TSNE(n_components=2, random_state=42)

# Fit and transform the data
tsne_result = tsne.fit_transform(transposed_matrix)

# Create a DataFrame with the t-SNE results
tsne_df = pd.DataFrame(tsne_result, columns=['TSNE1', 'TSNE2'])

# Finally, plot the t-SNE results
plt.figure(figsize=(10, 8))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], s=5, c=label_integers, cmap=listed_color_map)

# Add legend
plt.legend(handles=legend_handles, title='Tumor Type', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('t-SNE Visualization')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

# Save the figure
plt.savefig(path.join('plots', 'tSNE.png'))
plt.show()

This non-linear transformation that t-SNE uses for reducing the dimensionality allows for a better differentiation of specific groups, namely the pancreas tumors, prostate, brain and some kidney cancers. Other types of tumors are not so clearly differentiated although they tend to be together. Within primary cancer types there is also variability, for instance multiple types of blood cancers tend to show subclusters (similarly colorectal tumors and others).

In [None]:
# Generate a binary category
binary_blood = pd.Series(labels).apply(lambda x: 'Blood' if x == 'Blood' else 'Others')

# Automatically create a mapping from label categories to integers
unique_labels = np.unique(binary_blood)
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

# Convert labels to integers based on the automatic mapping
label_integers = np.array([label_mapping[label] for label in binary_blood])

color_map = ListedColormap(['red', 'lightgrey',])

# Finally, plot the t-SNE results
plt.figure(figsize=(10, 8))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], s=5, c=label_integers, cmap=color_map)

# Create legend handles and labels
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=color_map(idx), markersize=10, label=label)
                  for idx, label in enumerate(unique_labels)]

# Add legend
plt.legend(handles=legend_handles, title='Tumor Type', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('t-SNE Visualization')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

# Save the figure
plt.savefig(path.join('plots', 'tSNE_blood.png'))
plt.show()

In [None]:
# Generate a binary category
binary_colorectal = pd.Series(labels).apply(lambda x: 'Colorectal' if x == 'Colorectal' else 'Others')

# Automatically create a mapping from label categories to integers
unique_labels = np.unique(binary_colorectal)
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

# Convert labels to integers based on the automatic mapping
label_integers = np.array([label_mapping[label] for label in binary_colorectal])

color_map = ListedColormap(['green', 'lightgrey',])

# Finally, plot the t-SNE results
plt.figure(figsize=(10, 8))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], s=5, c=label_integers, cmap=color_map)

# Create legend handles and labels
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=color_map(idx), markersize=10, label=label)
                  for idx, label in enumerate(unique_labels)]

# Add legend
plt.legend(handles=legend_handles, title='Tumor Type', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('t-SNE Visualization')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

# Save the figure
plt.savefig(path.join('plots', 'tSNE_colorectal.png'))
plt.show()

### Hierarchical clustering

A priori we do not have any preliminary information like the primary types, but we can try to extract clusters directly from the t-SNE output. This is not devoid of interpretation problems (https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne), so well-differentiated clusters might not show real biological features that differentiate them.

For starters let's assume that we do not know the underlying structure of primary types and try to extract 18 clusters.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, cophenet
from scipy.spatial.distance import pdist
from sklearn.metrics import silhouette_score

# Apply hierarchical clustering
linkage_matrix = linkage(tsne_result, method='ward')

# Compute cophenetic correlation (Ward's metric)
c, coph_dists = cophenet(linkage_matrix, pdist(tsne_result))
print(f"Cophenetic correlation coefficient: {c}")

# Set the number of clusters using maxclust
num_clusters = 18

# Assign cluster labels based on the maxclust criterion
clusters = fcluster(linkage_matrix, num_clusters, criterion='maxclust')

# Get the threshold distance used for clustering
threshold_distance = linkage_matrix[-(num_clusters - 1), 2]

# Compute silhouette score
silhouette_avg = silhouette_score(tsne_result, clusters)
print(f"Silhouette Score: {silhouette_avg}")

# Plot the dendrogram with a horizontal line at the threshold
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix, leaf_rotation=90., leaf_font_size=8., color_threshold=num_clusters)
plt.xticks([]) # Remove x-axis labels
plt.axhline(y=threshold_distance, color='r', linestyle='--', label=f'Max Clusters ({num_clusters})')
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample point')
plt.ylabel('Distance')
plt.legend()

# Save the plot
plt.savefig(path.join('plots', 'HierarchClust_dendogram.png'))
plt.show()

# Plot the t-SNE results with cluster colors
plt.figure(figsize=(10, 8))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], c=clusters, cmap='tab20', s=5)
plt.title('t-SNE Visualization with Hierarchical Clustering')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

# Save the plot
plt.savefig(path.join('plots', 'tSNE_HierarchClust.png'))
plt.show()

The red dashed line on the dendogram marks the threshold of ward's distance that has been used to define 18 clusters. How good is the clustering then?

From a strictly geometric point of view, the **Cophenetic Correlation Coefficient (or Ward's method)**, which measures how faithfully the hierarchy preserves the pairwise distances between the original data points, shows a moderate level of fidelity (this metric ranges from -1 to 1, where higher values indicate better preservation). Moreover, the **Silhouette Score**, which measures cohesion of the clusters and also ranges from -1 to 1 with higher values indicating cohesed clusters, reflects a rather mid cohesion. However, these metrics do not have any value unless interpreted under the light of the nature of the data.  

Some clusters are clearly defined, and we know that reflect real expression pattern differences due to primary type location and even tumor subtypes (different blood cancers show different clusters that, in reality, reflect different types of tumors). However, on the center of the t-SNE plot there are some clusters that were extracted from groups of points with not so clear distinction between them. Here the expression patterns are not enough different to distinguish clear subdivisions and might not reflect any biological feature (at least it does not reflect the primary type location). We can make this interpretation thanks to external information about tumor type, but as an unsupervised method this is not implemented on its methodology.

We can play with the dendogram to define other numbers of clusters (independently of this 18 clusters value we obtain from external information) or use the primary type external information to compute the **Adjusted Rand Index (ARI)** as a proxy of performance of the clustering algorithm.

In [None]:
from matplotlib.colors import to_hex
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import LabelEncoder

# Function to calculate and plot hierarchical clustering
def hierarchical_clustering_and_tsne(tsne_result, true_labels, num_clusters_list, method):
    
    # Get the colours for the threshold
    color_map = plt.get_cmap('tab10', len(num_clusters_list))
    
    # Apply hierarchical clustering
    linkage_matrix = linkage(tsne_result, method=method)
    
    # Plot dendrogram and t-SNE for different numbers of clusters
    plt.figure(figsize=(12, 8))
    dendrogram(linkage_matrix, leaf_rotation=90., leaf_font_size=8.)
    plt.title(f'Hierarchical Clustering Dendrogram')
    plt.xlabel('Sample Index')
    plt.ylabel('Distance')
    
    all_clusters = list()
    for i, num_clusters in enumerate(num_clusters_list):
        
        # Assign cluster labels based on the maxclust criterion
        clusters = fcluster(linkage_matrix, num_clusters, criterion='maxclust')
        all_clusters.append(clusters)
    
        # Get the threshold distance used for clustering
        threshold_distance = linkage_matrix[-(num_clusters - 1), 2]
        plt.axhline(y=threshold_distance, color=to_hex(color_map.colors[i]), linestyle='--', label=f'Max Clusters ({num_clusters})')
    
    plt.legend()
    plt.show()
    
    for i, clusters in enumerate(all_clusters):

        # Plot the t-SNE results with cluster colors
        plt.figure(figsize=(10, 8))
        plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], c=clusters, cmap='tab20', s=5)
        plt.title(f't-SNE Visualization (Clusters={num_clusters_list[i]})', color=to_hex(color_map.colors[i]))
        plt.xlabel('t-SNE Component 1')
        plt.ylabel('t-SNE Component 2')
        plt.show()

        # Calculate Adjusted Rand Index (ARI)
        ari = adjusted_rand_score(true_labels, clusters)
        print(f'Adjusted Rand Index (ARI) for Clusters={num_clusters_list[i]}: {ari:.4f}\n')


# List of different numbers of clusters to try
num_clusters_list = [4, 7, 10, 14, 18]

# Also we can try different methods by changing this parameter
method = 'ward' 

# Call the function to perform hierarchical clustering and t-SNE for each number of clusters
hierarchical_clustering_and_tsne(tsne_result, label_integers, num_clusters_list, method)

As shown in the plot, the Adjusted Rand Index is higher using 14 clusters, less than the 18. Note on the code that it is possible to change the linkage metric to another different than the **ward's method**. Feel free to experiment with other methodologies such as the **minimum** (also named single method), the **maximum** (or complete), ... The different values that the method parameter can take for the linkage function are collected in the scipy documentation (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html).

### K-means clustering

Similar to hierarchical clustering, we can use k-means clustering to extract plots from the t-SNE results using the following code.

In [None]:
from sklearn.cluster import KMeans

# Function to calculate and plot k-means clustering
def kmeans_clustering_and_tsne(tsne_result, true_labels, num_clusters_list, random_seed=42):

    all_clusters = list()
    for num_clusters in num_clusters_list:
        
        # Apply k-means clustering
        kmeans = KMeans(n_clusters=num_clusters, random_state=random_seed, n_init='auto')
        clusters = kmeans.fit_predict(tsne_result)
        all_clusters.append(clusters)
    
        # Plot the t-SNE results with cluster colors
        plt.figure(figsize=(10, 8))
        plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], c=clusters, cmap='tab20', s=5)
        plt.title(f't-SNE Visualization (Clusters={num_clusters})')
        plt.xlabel('t-SNE Component 1')
        plt.ylabel('t-SNE Component 2')
        plt.show()

        # Calculate Adjusted Rand Index (ARI)
        ari = adjusted_rand_score(true_labels, clusters)
        print(f'Adjusted Rand Index (ARI) for Clusters={num_clusters}: {ari:.4f}\n')


# List of different numbers of clusters to try
num_clusters_list = [4, 7, 10, 14, 18]

# Change the random seed to see how much the algorithm depends on initialization conditions
random_seed = 123

# Call the function to perform k-means clustering and t-SNE for each number of clusters
kmeans_clustering_and_tsne(tsne_result, label_integers, num_clusters_list, random_seed)

## Exercises for this session

Using the code above, explore how the **Adjusted Rand Index (ARI)** (using the primary cancer types as proxy of "true" groups to observe) changes with different number of clusters for the **K-means clustering** algorithm. How different it is compared to the hierarchical clustering?

Try different seeds to initialize the clustering algorithm at random. How dependent the results are with respect to the initialization parameters for this algorithm?

Generate a code for the plots and the metrics. Finally, discuss your answer based on the output on a markdown cell (present your exercise using this Jupyter notebook format).