*(This is from tcellmatch authors)*

This tutorial shows how to preprocess the raw data files from IEDB and feed into attention-base feed forward network for binary classification. We want to predict whether CDR3 could bind to certain antigens, so we take out (CDR3, antigen) pairs as training set, the target set would be boolean vector which indicates there is binding or not. The introduction of 10x can be found here: https://www.iedb.org.

Download settings: linear epitope, MHC restriction to HLA-A*02:01 and organism as human and only human.

In [24]:
import pandas as pd
import tensorflow as tf
import tcellmatch.api as tm
import os

# Raw data

Here, VDJdb.tsv is a downloaded file from the VDJdb website.

In [3]:
# Path of input directory.
indir = YOUR_PATH
# Path to IEDB raw files.
fn_iedb = indir+"tcell_receptor_table_export_1558607498.csv"
# Path to vdjdb files.
fns_vdjdb = [indir + x for x in ["VDJdb.tsv"]]

## Heads of IEDB raw files:

In [5]:
cellranger_out = pd.read_csv(fn_iedb).fillna(value="None")
cellranger_out.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Group Receptor ID,Receptor ID,Reference ID,Epitope ID,Description,Antigen,Organism,Response Type,Assay IDs,Reference Name,...,Chain 2 CDR1 Start Curated,Chain 2 CDR1 End Curated,Chain 2 CDR1 Start Calculated,Chain 2 CDR1 End Calculated,Chain 2 CDR2 Curated,Chain 2 CDR2 Calculated,Chain 2 CDR2 Start Curated,Chain 2 CDR2 End Curated,Chain 2 CDR2 Start Calculated,Chain 2 CDR2 End Calculated
0,8494,59,1013533,37257,LLFGYPVYV,transcriptional activator Tax,Human T-cell leukemia virus type I,T cell,"1775493, 1775496, 1779714, 1779715, 1779718",B7,...,,,27.0,31.0,,SVGAGI,,,49.0,54.0
1,8494,59,1017753,186691,LLFGFPVYV,,,T cell,"1975823, 1975824",B7,...,,,27.0,31.0,,SVGAGI,,,49.0,54.0
2,8494,59,1017753,37257,LLFGYPVYV,transcriptional activator Tax,Human T-cell leukemia virus type I,T cell,1975819,B7,...,,,27.0,31.0,,SVGAGI,,,49.0,54.0
3,18305,60,1016521,44920,NLVPMVATV,pp65,Human betaherpesvirus 5,T cell,"1678482, 1678554",RA14,...,,,,,,,,,,
4,8500,66,1032053,6435,CINGVCWTV,polyprotein,Hepacivirus C,T cell,"3468725, 3468748, 3468770, 3468771",NS3-1073,...,,,,,,,,,,


## Heads of VDJDB raw files:

In [9]:
cellranger_out = pd.read_table(fns_vdjdb[0]).fillna(value="None") 
cellranger_out.head()

Unnamed: 0,complex.id,Gene,CDR3,V,J,Species,MHC A,MHC B,MHC class,Epitope,Epitope gene,Epitope species,Reference,Method,Meta,CDR3fix,Score
0,0,TRB,CASSSGQLTNTEAFF,TRBV9*01,TRBJ1-1*01,HomoSapiens,HLA-A*02:01,B2M,MHCI,GLCTLVAML,BMLF1,EBV,PMID:12504586,"{""frequency"": ""17/52"", ""identification"": ""anti...","{""cell.subset"": ""CD8"", ""clone.id"": """", ""donor....","{""cdr3"": ""CASSSGQLTNTEAFF"", ""cdr3_old"": ""CASSS...",1
1,0,TRB,CASSASARPEQFF,TRBV9*01,TRBJ2-1*01,HomoSapiens,HLA-A*02:01,B2M,MHCI,GLCTLVAML,BMLF1,EBV,PMID:12504586,"{""frequency"": ""4/52"", ""identification"": ""antig...","{""cell.subset"": ""CD8"", ""clone.id"": """", ""donor....","{""cdr3"": ""CASSASARPEQFF"", ""cdr3_old"": ""CASSASA...",0
2,0,TRB,CASSSGLLTADEQFF,TRBV9*01,TRBJ2-1*01,HomoSapiens,HLA-A*02:01,B2M,MHCI,GLCTLVAML,BMLF1,EBV,PMID:12504586,"{""frequency"": ""3/58"", ""identification"": ""antig...","{""cell.subset"": ""CD8"", ""clone.id"": """", ""donor....","{""cdr3"": ""CASSSGLLTADEQFF"", ""cdr3_old"": ""CASSS...",0
3,0,TRB,CASSSGQVSNTGELFF,TRBV9*01,TRBJ2-2*01,HomoSapiens,HLA-A*02:01,B2M,MHCI,GLCTLVAML,BMLF1,EBV,PMID:12504586,"{""frequency"": ""9/58"", ""identification"": ""antig...","{""cell.subset"": ""CD8"", ""clone.id"": """", ""donor....","{""cdr3"": ""CASSSGQVSNTGELFF"", ""cdr3_old"": ""CASS...",0
4,0,TRB,CSARDRTGNGYTF,TRBV20-1*01,TRBJ1-2*01,HomoSapiens,HLA-A*02:01,B2M,MHCI,GLCTLVAML,BMLF1,EBV,PMID:12504586,"{""frequency"": ""4/52"", ""identification"": ""antig...","{""cell.subset"": ""CD8"", ""clone.id"": """", ""donor....","{""cdr3"": ""CSARDRTGNGYTF"", ""cdr3_old"": ""CSARDRT...",2


## List of target antigens.
We can only load observations that match the target antigen sequences

In [10]:
iedb_categ_ids = [
    "GILGFVFTL",
    "NLVPMVATV",
    "GLCTLVAML",
    "LLWNGPMAV",
    "VLFGLGFAI"
]

# Read data

## Create model object. 
EstimatorBinary object includes all of reading, training and testing modules.

In [11]:
ffn = tm.models.EstimatorBinary()

## Read IEDB raw files, taking out TCR CDR3 and antigen pairs  as training data

We encode the TCR CDR3 amino acid sequences (include TRA and TRB) and antigens with one-hot encoding, the embedded sequences are of shape [num_samples, tra/trb, max_sequence_length, aa_onehot_dim]. For example if we take out 4000 TRB sequences seperately, the maximal length of sequences is 30 and we have 22 amino acids, the shape of output would be [4000, 1, 30, 26]. 

In [12]:
ffn.read_iedb(
    fns=fn_iedb,
    fn_blosum=None,
    antigen_ids=iedb_categ_ids,
    blosum_encoding=False,
    is_train=True,
    chains="trb"
)


  exec(code_obj, self.user_global_ns, self.user_ns)


Found 87 CDR3 observations with unkown amino acids out of 13478.
Found 0 antigen observations with unkown amino acids out of 13478.
Found 87 CDR3+antigen observations with unkown amino acids out of 13478, leaving 13391 observations.
Assembled 13391 single-chain observations into 13391 multiple chain observations.
Found 12778 observations that match target antigen sequences out of 13391.
Found 12778 observations and assigned to train data.


# Process data

## Create training datasets
The input consists of TCR CDR3 sequences and antigens we concatenate them along the third dimension, this is equal to concatenate CDR3 and antigens amino acid sequences before one-hot encoding. We don't need covariates here, so the values of covariates_train are all zero. The target set would be a boolean vector which shows the binding between TCR CDR3 and antigens.

In [14]:
print("Shape of (CDR3,antigen) sequences: ",ffn.x_train.shape)
# print("The head of TCR sequences: ",ffn.x_train[0])
print("Shape of covariates: ",ffn.covariates_train.shape)
# print("The head of covariates: ",ffn.covariates_train[0:5])
print("Shape of target set: ",ffn.y_train.shape)
# print("The head of target set: ",ffn.y_train[0:5])

Shape of TCR sequences:  (12778, 1, 47, 26)
Shape of covariates:  (12778, 1)
Shape of target set:  (12778, 1)


## Assign clonotype by Manhatten distance

In [15]:
ffn.assign_clonotype(flavor="manhatten")

Found 10291 clonotypes for 12778 observations.


## Downsample clonotypes to data stored in x_train
This avoids training, evaluation or test set to be too biased to a subset of TCRs.

In [16]:
#max_obs: Maximum number of observations per clonotype.
ffn.downsample_clonotype(max_obs=10)

Downsampled 10291 clonotypes from 12778 cells to 11878 cells.


In [19]:
print("Shape of (CDR3,antigen) sequences: ",ffn.x_train.shape)

Shape of TCR sequences:  (23756, 1, 47, 26)


## Create negative (CDR3, antigen) pairs
Since the pairs we get from IEDB dataset are all positive pairs, we need to sample negative pairs in order to keep the positive/negative rate as 50%/50%

In [18]:
ffn.sample_negative_data(is_train=True)

Generated 11878 negative samples in train data, yielding 23756 total observations.


In [20]:
print("Shape of (CDR3,antigen) sequences: ",ffn.x_train.shape)

Shape of TCR sequences:  (23756, 1, 47, 26)


In [26]:
# Padding zeros to tcr sequences in both training and testing set to make sure they have same size.
ffn.pad_sequence(target_len=40, sequence="tcr")
ffn.pad_sequence(target_len=25, sequence="antigen")

## Create test dataset
We can either split a part of training set or use a new database as the test set. Here we use VDJDB dataset as test set.

In [35]:
#Clear test set.
ffn.clear_test_data()
ffn.read_vdjdb(
    fns=fns_vdjdb,
    fn_blosum=None,
    blosum_encoding=False,
    is_train=False,
    chains="trb"
)
ffn.remove_overlapping_antigens(data="test")
#Assign clonotype by Manhatten distance.
ffn.assign_clonotype(flavor="manhatten", data="test")
#Downsample clonotypes to data stored in x_test.
ffn.downsample_clonotype(max_obs=10, data="test")
# Sample negative binding pairs for training.
ffn.sample_negative_data(is_train=False)

Found 0 CDR3 observations with unkown amino acids out of 3964.
Found 0 antigen observations with unkown amino acids out of 3964.
Found 0 CDR3+antigen observations with unkown amino acids out of 3964, leaving 3964 observations.
Assembled 3964 single-chain observations into 1422 multiple chain observations.
Found 1422 observations and assigned to test data.
Reduced 1422 cells to 142 cells in test data because of antigen overlap.
Found 119 clonotypes for 142 observations.
Downsampled 119 clonotypes from 142 cells to 142 cells.
Generated 142 negative samples in test data, yielding 284 total observations.


In [39]:
print("Shape of (CDR3,antigen) sequences for training: ",ffn.x_train.shape)
print("Shape of target set for training: ",ffn.y_train.shape)
print("Shape of (CDR3,antigen) sequences for test: ",ffn.x_test.shape)
print("Shape of target set for test: ",ffn.y_test.shape)

Shape of (CDR3,antigen) sequences for training:  (200, 1, 65, 26)
Shape of target set for training:  (200, 1)
Shape of (CDR3,antigen) sequences for test:  (200, 1, 65, 26)
Shape of target set for test:  (200, 1)


## Downsample data to given number of observations
In order to save time we sample a small dataset for training. Never use this method in practice. 

In [37]:
ffn.downsample_data(n=200, data="train")
ffn.downsample_data(n=200, data="test")

Downsampled train data from 23756 cells to 200 cells.
Downsampled test data from 284 cells to 200 cells.


In [38]:
print("Shape of TCR CDR3 sequences for training: ",ffn.x_train.shape)
print("Shape of TCR CDR3 sequences for test: ",ffn.x_test.shape)

Shape of TCR CDR3 sequences for training:  (200, 1, 65, 26)
Shape of TCR CDR3 sequences for test:  (200, 1, 65, 26)


# Build a attention-based feed forward model

In [40]:
ffn.build_self_attention(
    residual_connection=True,
    aa_embedding_dim=0,
    attention_size=[5, 5],
    attention_heads=[4, 4],
    optimizer='adam',
    lr=0.001,
    loss='bce',
    label_smoothing=0
)

Instructions for updating:
Colocations handled automatically by placer.


# Train model
Train this model for 1 epoch       

In [41]:
 ffn.train(
    epochs=1,
    steps_per_epoch=1,
    batch_size=8
    )

Number of observations in evaluation data: 21
Number of observations in training data: 179

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Use tf.cast instead.


# Use model

## Evaluate on test set.

In [42]:
ffn.evaluate()



## Save the model.

In [43]:
os.mkdir('temp_iedb')
fn_tmp = 'temp_iedb/temp'
ffn.save_model(fn_tmp)

## Print model summary.

In [44]:
print(ffn.model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
aa_embedding (LayerAaEmbeddi multiple                  702       
_________________________________________________________________
layer_self_attention (LayerS multiple                  3120      
_________________________________________________________________
layer_self_attention_1 (Laye multiple                  3120      
_________________________________________________________________
dense (Dense)                multiple                  1692      
Total params: 8,634
Trainable params: 8,634
Non-trainable params: 0
_________________________________________________________________
None


## Reproduce evaluation in a new instance of model that receives same weights.

In [45]:
ffn2 = tm.models.EstimatorFfn()
ffn2.load_model(fn_tmp)
ffn2.evaluate()
ffn2.predict()

Number of observations in evaluation data: 20
Number of observations in training data: 180
