# Ligandome database export guide

We do not distribute external databases in this repo - instead, we provide this notebook as a guide for users to export publically available datasets and databases to run the ligandome workflow. **Some parts of this notebook require visiting sites and manually searching**. Please find the licenses for each data source under `./licenses`.

In [None]:
ALLELES =  [
            'A0101',
            'A0201',
            'A0207',
            'A0301',
            'A1101',
            'A2301',
            'A2402',
            'A3001',
            'A3303',
            'B0702',
            'B0801',
            'B1501',
            'B1502',
            'B1503',
            'B3501',
            'B4001',
            'B4006',
            'B4402',
            'B4403',
            'B4601',
            'B5101',
            'B5201',
            'B5301',
            'B5801',
            'C0102',
            'C0202',
            'C0304',
            'C0401',
            'C0602',
            'C0701',
            'C0702',
            'C0801',
            'C1502',
            'C1601'
            ]

In [None]:
import zipfile
import requests
import pandas as pd
from tqdm import tqdm
from pathlib import Path
from joblib import Parallel, delayed

In [None]:
TMP_DATA_DIR = Path('./tmp_data')
DATABASE_EXPORTS_PATH = Path('../ligandome/database_exports')

In [None]:
TMP_DATA_DIR.mkdir(exist_ok=True)
DATABASE_EXPORTS_PATH.mkdir(exist_ok=True, parents=True)

# 1. HLA Ligand ATLAS

In [None]:
HLA_TMP_DATA_DIR = TMP_DATA_DIR / 'HLA_LIGAND_ATLAS'
HLA_TMP_DATA_DIR.mkdir(exist_ok=True)

In [None]:
HLA_LIGAND_ATLAS_ALLELES = [allele for allele in ALLELES if allele not in [
    'B5301',
    'B1503',
    'A0207',
    'C1502',
    'B5201',
    'C0801',
    'B5101',
    'B4601',
    'A3303',
    'B4006',
    'B4001',
    'C0102',
    'B1502',
]]

Due to a small issue with the export function on HLA Ligand ATLAS's main site, we use a loop here to query and grab data for alleles of interest one by one, then aggregate the results.

In [None]:
def download_file(url: str, output_file: Path, extract: bool=False) -> None:
    file_grab = requests.get(url)
    assert file_grab.status_code == 200, f'Problem downloading file from {url}'
    with open(output_file, 'wb+') as outfile:
        outfile.write(file_grab.content)
    if extract:
        with zipfile.ZipFile(output_file,"r") as zip_ref:
            zip_ref.extractall(output_file.parent / output_file.stem)

In [None]:
hla_results = Parallel(4)(delayed(download_file)(f'https://hla-ligand-atlas.org/peptides/download?&h=sw/{allele[0]}*{allele[1:3]}:{allele[3:]}', HLA_TMP_DATA_DIR / f'HLA_{allele}.zip', True) for allele in tqdm(HLA_LIGAND_ATLAS_ALLELES))

In [None]:
dfs = []
for allele in tqdm(HLA_LIGAND_ATLAS_ALLELES):
    df = pd.read_csv(HLA_TMP_DATA_DIR / f'HLA_{allele}' / 'hla_ligand_atlas' / 'HLA_aggregated.tsv', sep='\t')
    df['allele'] = allele
    dfs.append(df)

In [None]:
final_hla = pd.concat(dfs)
final_hla.to_csv(DATABASE_EXPORTS_PATH / 'HLA_ligand_atlas_aggregated.csv')

# 2. IEDB

In [None]:
IEDB_TMP_DATA_DIR = TMP_DATA_DIR / 'IEDB'
IEDB_TMP_DATA_DIR.mkdir(exist_ok=True)

We export multiple subsets of IEDB to ensure we can standardise the source of each peptide. This section requires you manually visit the [IEDB site here](https://www.iedb.org/home_v3.php) and download several files.

## 2.1 Healthy peptides

In [None]:
IEDB_HEALTHY_DATA_DIR = IEDB_TMP_DATA_DIR / 'Healthy'
IEDB_HEALTHY_DATA_DIR.mkdir(exist_ok=True)

Select the following options in IEDB and click `Search`:

![](iedb_search_guide.png)

Then scroll to the `Disease` panel at the bottom and select `None (Healthy)`, then click `Search` again:

![](iedb_healthy.png)

Click the `Export Results` button in the top right of the page, and export the data in the following format:
- File Format: `.CSV`
- Header Row Format: `Single Header`
- Export Type: `Full, all data columns`

And leave `Columns to Include` as the default value. Then hit `Export` and save the data as `tmp_data/IEDB/Healthy/IEDB_healthy_export.csv`

In [None]:
temp_df = pd.read_csv(IEDB_HEALTHY_DATA_DIR / 'IEDB_healthy_export.csv')
temp_df.to_csv(DATABASE_EXPORTS_PATH / 'IEDB_healthy_data.csv', index=False)

## 2.2 Tumour origin peptides

In [None]:
IEDB_TUMOUR_DATA_DIR = IEDB_TMP_DATA_DIR / 'Tumour'
IEDB_TUMOUR_DATA_DIR.mkdir(exist_ok=True)

Select the following options in IEDB and click `Search`:

![](iedb_search_guide.png)

Then scroll to the `Disease` panel at the bottom and select `Cancer`, then click `Search` again:

![](iedb_cancer.png)

Click the `Export Results` button in the top right of the page, and export the data in the following format:
- File Format: `.CSV`
- Header Row Format: `Single Header`
- Export Type: `Full, all data columns`

And leave `Columns to Include` as the default value. Then hit `Export` and save the data as `tmp_data/IEDB/Tumour/IEDB_tumour_export.csv`

In [None]:
temp_df = pd.read_csv(IEDB_TUMOUR_DATA_DIR / 'IEDB_tumour_export.csv')
temp_df.to_csv(DATABASE_EXPORTS_PATH / 'IEDB_tumour_data.csv', index=False)

## 2.3 Viral origin peptides

In [None]:
IEDB_VIRAL_DATA_DIR = IEDB_TMP_DATA_DIR / 'Viral'
IEDB_VIRAL_DATA_DIR.mkdir(exist_ok=True)

We additionally include a small amount of viral data from IEDB (which we later supplement with VDJDB).

Select the following options in IEDB and click `Search`:

![](iedb_viral.png)

Click the `Export Results` button in the top right of the page, and export the data in the following format:
- File Format: `.CSV`
- Header Row Format: `Single Header`
- Export Type: `Full, all data columns`

And leave `Columns to Include` as the default value. Then hit `Export` and save the data as `tmp_data/IEDB/Viral/IEDB_viral_export.csv`

In [None]:
temp_df = pd.read_csv(IEDB_VIRAL_DATA_DIR / 'IEDB_viral_export.csv')
temp_df.to_csv(DATABASE_EXPORTS_PATH / 'IEDB_viral_data.csv', index=False)

## 2.4 Assay mapping

Finally for IEDB we need the data to map the assay IDs back to labelled MHC I alleles. For this, please download and extract the following large file into `tmp_data/IEDB/mhc_ligand_full.csv`:

In [None]:
!wget https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip -O {IEDB_TMP_DATA_DIR}/mhc_ligand_full.zip

In [None]:
!unzip {IEDB_TMP_DATA_DIR}/mhc_ligand_full.zip -d {IEDB_TMP_DATA_DIR}

We then post-process this file:

In [None]:
df = pd.read_csv(IEDB_TMP_DATA_DIR / 'mhc_ligand_full.csv', header=[0, 1])
df.columns = [f'{i}_{j}' for i, j in df.columns]
df = df.reset_index(drop=True)

In [None]:
df['Epitope IRI'] = df['Epitope_Epitope IRI'].apply(lambda x: x.split('/')[-1])
df['mhc_allele'] = df['MHC Restriction_Name']
df[['Epitope IRI','mhc_allele']].to_csv(DATABASE_EXPORTS_PATH / 'IEDB_ligand_full.csv', index=False)

# 3. NetMHCPan 4.1 training data

In [None]:
NETMHCPAN_TMP_DATA_DIR = TMP_DATA_DIR / 'NETMHCPAN'
NETMHCPAN_TMP_DATA_DIR.mkdir(exist_ok=True)

Download and extract training data archive:

In [None]:
file_grab = requests.get('https://services.healthtech.dtu.dk/suppl/immunology/NAR_NetMHCpan_NetMHCIIpan/NetMHCpan_train.tar.gz')
with open(NETMHCPAN_TMP_DATA_DIR / 'NetMHCpan_train.tar.gz', 'wb+') as archive:
    archive.write(file_grab.content)

Extract the archive (make sure `tar` is installed):

In [None]:
!tar -xzvf {NETMHCPAN_TMP_DATA_DIR}/NetMHCpan_train.tar.gz -C {NETMHCPAN_TMP_DATA_DIR}

Then concatenate all the monoallelic data:

In [None]:
training_data_files = [f for f in (NETMHCPAN_TMP_DATA_DIR / 'NetMHCpan_train').iterdir() if '00' in f.stem]
netmhcpan_all = pd.concat([pd.read_csv(df, names=['peptide','Target','AlleleTemp'], header=None, sep=' ') for df in training_data_files])

In [None]:
allele_keys = pd.read_csv(NETMHCPAN_TMP_DATA_DIR / 'NetMHCpan_train' / 'allelelist', sep='\t| ', header=None, names=['Experiment','Allele List'])

In [None]:
netmhcpan_all['allele'] = netmhcpan_all['AlleleTemp'].map(dict(zip(allele_keys['Experiment'], allele_keys['Allele List'])))
netmhcpan_all['mono_allelic'] = (netmhcpan_all['allele'] == netmhcpan_all['AlleleTemp'])
netmhcpan_monoallelic = netmhcpan_all.loc[netmhcpan_all['mono_allelic']]
netmhcpan_monoallelic.to_csv(DATABASE_EXPORTS_PATH / 'NetMHCPan_monoallelic_training_data.csv', index=False)

# 4. VDJDB viral data

In [None]:
VDJDB_TMP_DATA_DIR = TMP_DATA_DIR / 'VDJDB'
VDJDB_TMP_DATA_DIR.mkdir(exist_ok=True)

Unfortunately there exists no simple way to export all VDJDB viral data; here we access the website and manually add all available viral species as search filters.

First, visit the VDJDB database browser page - https://vdjdb.cdr3.net/search

Under the `MHC` heading, untick `MHCII`, then under the `Antigen` heading, add all the following viral origin options to the `Source species` filter (unfortunately this has to be done one by one):

![](vdjdb_selection.png)

Finally hit `Export as:`, then `TSV`, saving the file as `tmp_data/VDJDB/VDJDB_viral.tsv`

In [None]:
temp_df = pd.read_csv(VDJDB_TMP_DATA_DIR / 'VDJDB_viral.tsv', sep='\t')
temp_df.to_csv(DATABASE_EXPORTS_PATH / 'VDJDB_viral.csv', index=False)

# 5. UniProt human proteome

Lastly we need a local copy of the human proteome, which we obtain from UniProt:

In [None]:
def fetch_human_proteome_fasta(fasta_outfile: Path) -> None:
    """Fetch a fasta file of the entire human reference proteome.

    Args:
        fasta_outfile (Path): Path to save the fasta file.
    """
    canonical_proteome = requests.get('https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28reviewed%3Atrue%20AND%20proteome%3Aup000005640%29')

    if not Path(fasta_outfile).exists():
        with open(fasta_outfile, 'a+') as fasta:
            fasta.write(canonical_proteome.text)

In [None]:
fetch_human_proteome_fasta(DATABASE_EXPORTS_PATH / 'UniProtCanonicalProteome.fasta')