<a href="https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_hla.ipynb"> <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>
<a id="raw-url" href="https://raw.githubusercontent.com/perrin-isir/xomx-tutorials/main/tutorials/xomx_hla.ipynb" download> <img align="left" src="https://img.shields.io/badge/Github-Download%20(Right%20click%20%2B%20Save%20link%20as...)-blue" alt="Download (Right click + Save link as)" title="Download Notebook"></a>

# *xomx tutorial:* **tissue prediction based on HLA-presented peptides**

**Remark:** This notebook runs best using a GPU runtime (for the variational autoencoder training).  
In Colab: from the Colab menu, choose Runtime > Change Runtime Type, then select **'GPU'**.

In [None]:
# imports:
import os
import sys
import joblib
from IPython.display import clear_output
try:
    import xomx
except ImportError:
    !pip install xomx
    clear_output()
    import xomx
try:
    import scanpy as sc
except ImportError:
    !pip install scanpy
    clear_output()
    import scanpy as sc
try:
    import trimap
except ImportError:
    !pip install trimap
    clear_output()
    import trimap
import numpy as np
import pandas as pd

# Give the possibility to force a plotting extension (bokeh or matplotlib) when running the code as a python script:
if len(sys.argv) > 1 and sys.argv[1] in ["bokeh", "matplotlib"]:
    xomx.pl.force_extension(sys.argv[1])

In [None]:
save_dir = os.path.join(os.path.expanduser("~"), "results", "xomx-tutorials", "xomx_hla")  # the default directory in which results are stored
os.makedirs(save_dir, exist_ok=True)

The HLA Ligand Atlas is a resource of natural HLA ligands presented on benign tissues.  
We first gather in a dict (`dfs`) 4 pandas dataframes from the HLA Ligand Atlas: 
- `dfs["peptides"]`: the list of peptide sequences with their id,
- `dfs["donors"]`: the list of donors and their alleles,
- `dfs["sample_hits"]`: for all the peptide sequences, the donors and tissues in which they have been found, and their HLA class,
- `dfs["aggregated"]`: one row per peptide sequence, with the HLA class of the peptide, and the list of donor alleles and tissues associated with the peptide. 

In [None]:
base_url = "http://hla-ligand-atlas.org/rel/2020.12/"
filenames = ["peptides", "donors", "sample_hits", "aggregated"]
dfs = {}
for nm in filenames:
    if not os.path.isfile(os.path.join(save_dir, nm + ".joblib")):
        dfs[nm] = pd.read_csv(base_url + nm + ".tsv.gz", sep="\t")
        joblib.dump(dfs[nm], os.path.join(save_dir, nm + ".joblib"))
    else:
        dfs[nm] = joblib.load(os.path.join(save_dir, nm + ".joblib"))

We compute the set of all alleles present in the database:

In [None]:
alleles_ = sorted(list(set(np.concatenate([allele.split(",") for allele in dfs["aggregated"].donor_alleles]))))

In this list, the alleles start with one of the 3 prefixes "n/", "w/" and "s/", which characterize binding predictions of peptides:  
- "n/": predicted non-binder donor allele
- "w/": predicted weak binder donor allele
- "s/": predicted strong binder donor allele

For example, the peptide with id 22 has been found in donors with the following alleles:

In [None]:
list(dfs["aggregated"][dfs["aggregated"].peptide_sequence_id == 22].donor_alleles)

The peptide is predicted to be a non-binder for all of these alleles, except for DRB5\*01:01, for which it is predicted to be a strong binder.  
Here is the list of alleles without the prefixes:

In [None]:
alleles = sorted(list(set([al[2:] for al in alleles_])))
alleles

We now filter the data to keep only peptides that are predicted to be weak or strong binders for the allele B\*08:01:

In [None]:
selected_allele = "B*08:01"
allele_filtered_df = dfs["aggregated"][dfs["aggregated"].donor_alleles.apply(lambda x: ("w/" + selected_allele in x) or ("s/" + selected_allele in x))]

Here is the set of tissues in the database:

In [None]:
tissues = set(np.concatenate([tissue.split(",") for tissue in dfs["aggregated"].tissues]))
tissues

We select a few of them, for example "Liver", "Lung" and "Ovary", and filter the data to keep only the peptides that have been found in exactly one of these tissues (and not in several of these tissues):

In [None]:
selected_tissues = ["Liver", "Lung", "Ovary"]

def filter_peptides(x):
    return sum([tissue in x for tissue in selected_tissues]) == 1

tissue_filtered_df = allele_filtered_df[allele_filtered_df.tissues.apply(filter_peptides)]
print(f"{len(tissue_filtered_df)} peptides")

We create an AnnData object with one-hot encodings of the peptides. The label attributed to a peptide is the name of the unique tissue (among the ones selected) in which it has been found.

In [None]:
max_length_peptide = tissue_filtered_df.peptide_sequence.apply(len).max()
xd = sc.AnnData(shape=(tissue_filtered_df.shape[0], max_length_peptide * len(xomx.tl.aminoacids)))
xd.obs_names = np.array(tissue_filtered_df.peptide_sequence)
xd.X = np.empty((xd.n_obs, xd.n_vars))
for i in range(xd.n_obs):
    xd.X[i, :] = xomx.tl.onehot(xd.obs_names[i], max_length_peptide)
    
def compute_label(x):
    tissue_array = np.array([tissue if tissue in x else "" for tissue in selected_tissues])
    return "".join(tissue_array)

xd.obs['labels'] = np.array(tissue_filtered_df.tissues.apply(compute_label))
xd.uns['all_labels'] = xomx.tl.all_labels(xd.obs['labels'])
xd.uns['obs_indices_per_label'] = xomx.tl.indices_per_label(xd.obs['labels'])

Let's observe 2D embeddings of the data. First with the TRIMAP algorithm:

In [None]:
trim = trimap.TRIMAP()
xd.obsm["trimap"] = trim.fit_transform(xd.X)

In [None]:
xomx.pl.plot_2d_obsm(xd, "trimap")

Then with a variational autoencoder:

In [None]:
# first we check the backend for JAX:
import jax
print(jax.lib.xla_bridge.get_backend().platform)

rng = np.random.RandomState(0)
vae = xomx.em.BetaVAE(xd, n_components=2, random_state=rng)

In [None]:
xd.obsm["vae"] = vae.fit_transform(iterations=10000)

In [None]:
xomx.pl.plot_2d_obsm(xd, "vae")

In [None]:
xomx.tl.train_and_test_indices(xd, "obs_indices_per_label", test_train_ratio=0.25, rng=rng)
classifier = {}

In [None]:
for tissue in selected_tissues:
    classifier[tissue] = xomx.cl.ExtraTrees(
        xd,
        tissue,
        n_estimators=450,
        random_state=rng,
    )
    confusion_matrix = classifier[tissue].train()
    print(tissue)
    print(confusion_matrix)
    print()

In [None]:
classifier["Ovary"].plot()

In [None]:
xomx.tl.matthews_coef(classifier["Ovary"].confusion_matrix)

Remark: the MCC score obtained is close to 0.5, which is definitely better than random predictions (MCC ~ 0), however for other choices of alleles and tissues, we frequently obtain an MCC score close to 0, showing that the classifier is not able to generalize at all.  
The problem of tissue prediction based on HLA-presented peptides is hard, but there may be specific cases for which it is possible.

In [None]:
sbm = xomx.cl.ScoreBasedMulticlass(xd, xd.uns["all_labels"], classifier)

In [None]:
sbm.plot()

In [None]:
# To avoid quitting at the end of the tutorial if the code is executed as a python script:
embed()