## `Human MHC set` creation

Create an OOD dataset with human peptides sampled from the ID Human TCR dataset.
Add randomly samples human MHC sequences in order to create (peptide, MHC) pairs.
The pairs constitue the `Human MHC set`, which is a OOD set.

In [1]:
import pandas as pd
import os
login = os.getlogin( )

#DATA_BASE = f"/home/{login}/Git/tcr/data/"
DATA_BASE = os.path.join('..', '..', 'data')

df_in = pd.concat([
    pd.read_csv(os.path.join(DATA_BASE,'alpha-beta-splits','alpha-beta.csv')),
    pd.read_csv(os.path.join(DATA_BASE,'alpha-beta-splits','beta.csv'))
    ])

df_mhc_seq = pd.read_csv(os.path.join(DATA_BASE,'mhc','pseudosequence.2016.all.X.dat'), sep='\t')

In [2]:
mhc = df_mhc_seq.sequence.unique()
peptides = df_in.peptide.unique()

In [3]:
from itertools import combinations, product

def pairs(*lists):
    pairs = []
    for t in combinations(lists, 2):
        for pair in product(*t):
            pairs.append(pair)
    return pairs

pep_mhc_pairs = pairs(peptides, mhc)
print(f"{len(pep_mhc_pairs)=}")

len(pep_mhc_pairs)=561808


In [4]:
df_out = pd.DataFrame({
    'peptide': [p[0] for p in pep_mhc_pairs],
    'mhc': [p[1] for p in pep_mhc_pairs],
    'sign': [1 for p in pep_mhc_pairs]
})

df_out = df_out[~df_out.mhc.str.contains("X")]
df_out = df_out[~df_out.peptide.str.contains("X")]

In [5]:
df_out.head()

Unnamed: 0,peptide,mhc,sign
0,SSLENFRAYV,QEFFIASGAAVDAIMWLFLECYDLQRATYHVGFT,1
1,SSLENFRAYV,QEFFIASGAAVDAIMWLFLECYDLQRATYHAVFT,1
2,SSLENFRAYV,QEFFIASGAAVDAIMWLFLECYDIDEATYHVGFT,1
3,SSLENFRAYV,QEFFIASGAAVDAIMWLFLECYDLQRANYHVVFT,1
4,SSLENFRAYV,QEFFIASGAAVDAIMWLFLECYDLQAATYHVVFT,1


In [6]:
len(df_out)

463684

In [8]:
df_out.to_csv(os.path.join(DATA_BASE,'mhc','peptide-mhc.csv'), index=False)