# *xomx tutorial:* **tissue prediction based on HLA-presented peptides**

In [None]:
# imports:
import os
import joblib
from IPython.display import clear_output
try:
    import xomx
except ImportError:
    !pip install git+https://github.com/perrin-isir/xomx.git
    clear_output()
    import xomx
try:
    import scanpy as sc
except ImportError:
    !pip install scanpy
    clear_output()
    import scanpy as sc
import numpy as np
import pandas as pd

In [None]:
save_dir = os.path.join(os.path.expanduser("~"), "results", "xomx-tutorials", "xomx_hla")  # the default directory in which results are stored
os.makedirs(save_dir, exist_ok=True)

The HLA Ligand Atlas is a resource of natural HLA ligands presented on benign tissues.  
We first gather in a dict (`dfs`) 4 pandas dataframes from the HLA Ligand Atlas: 
- `dfs["peptides"]`: the list of peptide sequences with their id,
- `dfs["donors"]`: the list of donors and their alleles,
- `dfs["sample_hits"]`: for all the peptide sequences, the donors and tissues in which they have been found, and their HLA class,
- `dfs["aggregated"]`: one row per peptide sequence, with the HLA class of the peptide, and the list of donor alleles and tissues associated with the peptide. 

In [None]:
base_url = "http://hla-ligand-atlas.org/rel/2020.12/"
filenames = ["peptides", "donors", "sample_hits", "aggregated"]
dfs = {}
for nm in filenames:
    if not os.path.isfile(os.path.join(save_dir, nm + ".joblib")):
        dfs[nm] = pd.read_csv(base_url + nm + ".tsv.gz", sep="\t")
        joblib.dump(dfs[nm], os.path.join(save_dir, nm + ".joblib"))
    else:
        dfs[nm] = joblib.load(os.path.join(save_dir, nm + ".joblib"))

We compute the set of all alleles present in the database:

In [None]:
alleles = sorted(list(set(np.concatenate([allele.split(",") for allele in dfs["aggregated"].donor_alleles]))))

In this list, the alleles start with one of the 3 prefixes "n/", "w/" and "s/", which characterize binding predictions of peptides:  
- "n/": predicted non-binder donor allele
- "w/": predicted weak binder donor allele
- "s/": predicted strong binder donor allele

For example, the peptide with id 22 has been found in donors with the following alleles:

In [None]:
list(dfs["aggregated"][dfs["aggregated"].peptide_sequence_id == 22].donor_alleles)

The peptide is predicted to be a non-binder for all of these alleles, except for DRB5\*01:01, for which it is predicted to be a strong binder.

We now filter the data to keep only peptides that are predicted to be weak or strong binders for the allele A\*02:01:

In [None]:
allele_filtered_df = dfs["aggregated"][dfs["aggregated"].donor_alleles.apply(lambda x: ("w/A*02:01" in x) or ("s/A*02:01" in x))]

Here is the set of tissues in the database:

In [None]:
tissues = set(np.concatenate([tissue.split(",") for tissue in dfs["aggregated"].tissues]))
tissues

We select two of them, for example "Thymus" and "Liver", and filter the data to keep only the peptides that have been found in either of these tissues:

In [None]:
tissue_1 = "Thymus"
tissue_2 = "Liver"
tissue_filtered_df = allele_filtered_df[allele_filtered_df.tissues.apply(lambda x: tissue_1 in x or tissue_2 in x)]
print(f"{len(tissue_filtered_df)} peptides")

In [None]:
max_length_peptide = tissue_filtered_df.peptide_sequence.apply(len).max()
xd = sc.AnnData(shape=(tissue_filtered_df.shape[0], max_length_peptide * len(xomx.tl.aminoacids)))
xd.obs_names = np.array(tissue_filtered_df.peptide_sequence)
xd.X = np.empty((xd.n_obs, xd.n_vars))
for i in range(xd.n_obs):
    xd.X[i, :] = xomx.tl.onehot(xd.obs_names[i], max_length_peptide)
xd.obs['labels'] = np.array(tissue_filtered_df.tissues.apply(lambda x: (tissue_1 if tissue_1 in x else "") + (tissue_2 if tissue_2 in x else "")))
xd.uns['all_labels'] = xomx.tl.all_labels(xd.obs['labels'])
xd.uns['obs_indices_per_label'] = xomx.tl.indices_per_label(xd.obs['labels'])

In [None]:
rng = np.random.RandomState(0)
xomx.pl.plot_2d_embedding(xd, umap.UMAP())

In [None]:
xomx.tl.train_and_test_indices(xd, "obs_indices_per_label", test_train_ratio=0.25, rng=rng)
classifier = {}
classifier[tissue_1] = xomx.fs.RFEExtraTrees(
    xd,
    tissue_1,
    n_estimators=450,
    random_state=rng,
)

In [None]:
classifier[tissue_1].init()

In [None]:
classifier[tissue_1].plot()

In [None]:
xomx.tl.matthews_coef(classifier[tissue_1].confusion_matrix)

Remark: the MCC score obtained is close to 0.5, which is definitely better than random predictions (MCC ~ 0), however for other choices of alleles and tissues, we frequently obtain an MCC score close to 0, showing that the classifier is not able to generalize at all.  
The problem of tissue prediction based on HLA-presented peptides is hard, but there may be specific cases for which it is possible.