# TDC 103: Datasets Part 2 - Biologics

[Kexin](https://twitter.com/KexinHuang5)

In this tutorial, we will continue the dataset exploration and now walk through various biologics datasets provided in TDC!

We assume you have familiarize yourself with the installations, data loaders, and data functions. If not, please visit [TDC 101 Data Loaders](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_101_Data_Loader.ipynb) and [TDC 102 Data Functions](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_102_Data_Functions.ipynb) first!

TDC has 10 biologics datasets in the first release. We will go with the order of discovery and development stage. 

### Target Discovery

For target discovery, gene disease associtation prediction that we discussed in the last tutotorial also applies here. In addition to that, we include miRNA-Target Interaction (MTI) prediction dataset. miRNAs (microRNAs) are small non-coding RNAs that control expression of target genes at the post-transcriptional level. It was shown by recent research that they play important roles in disease pathology and celluar processes. miRNA therapeutics refer to insert miRNA into cells whereas sometimes miRNA is also regarded as biomarker for other drug products to target upon. miRNA-target interaction is thus very important because it can identify which miRNA can regulate the target of interest. In TDC, we include the miRTarBase database, which was collected by manually surveying literature and are later validated experimentally. TDC also provides miRNA mature sequence and target amino acid sequence as the features for miRNA and target gene. You can obtain it via:


In [1]:
from tdc.multi_pred import MTI
data = MTI(name = 'miRTarBase')
data.get_data().head(2)

Downloading...
100%|██████████| 338M/338M [00:17<00:00, 19.8MiB/s] 
Loading...
Done!


Unnamed: 0,miRNA_ID,miRNA,Target_ID,Target,Y
0,ath-miR398c-3p,UGUGUUCUCAGGUCACCCCUG,817365,MAATNTILAFSSPSRLLIPPSSNPSTLRSSFRGVSLNNNNLHRLQS...,1
1,ath-miR398b-3p,UGUGUUCUCAGGUCACCCCUG,817365,MAATNTILAFSSPSRLLIPPSSNPSTLRSSFRGVSLNNNNLHRLQS...,1


### Activity Screening

We include four therapeutics tasks to predict the activity of biologics. Immunotherapy is an important diagram of therapeutics. It has gained lots of interests in the recent years because of its promise in treating various cancers and less side effects than small molecule compounds. One big part of immunotherapy is Monoclonal antibody therapy. Antibody binds to antigens and once it binds to antigens, together they serve as a target marker for the humans immune system to attack those marked cells/proteins. In TDC, we include three tasks in predicting the binding of antibody and antigen. The first is paratope prediction, where we want to predict the binding region in antibody. TDC includes a processed dataset from Parapred, which curates a dataset from SAbDab. The dataset contains both heavy and light chain sequence of the antibody and each data point has the chain sequence and the positions index in the chain that correspond to the binding regions.

In [1]:
from tdc.single_pred import Paratope
data = Paratope(name = 'SAbDab_Liberis')
data.get_data().head(2)

Downloading...
100%|██████████| 150k/150k [00:00<00:00, 855kiB/s] 
Loading...
Done!


Unnamed: 0,Antibody_ID,Antibody,Y
0,2hh0_H,LEQSGAELVKPGASVKLSCTASGFNIEDSYIHWVKQRPEQGLEWIG...,"[49, 80, 81, 82, 101]"
1,1u8q_B,ITLKESGPPLVKPTQTLTLTCSFSGFSLSDFGVGVGWIRQPPGKAL...,"[30, 31, 53, 83, 84, 85, 104, 105, 106, 107, 1..."


Similarly, we can also predict the active binding region in the antigen. TDC uses a dataset from Bepipred, which curates a dataset from IEDB. It collects B-cell epitopes and non-epitope amino acids determined from crystal structures. To load, type: 

In [3]:
from tdc.single_pred import Epitope
data = Epitope(name = 'IEDB_Jespersen')
data.get_data().head(2)

Downloading...
100%|██████████| 2.18M/2.18M [00:00<00:00, 4.12MiB/s]
Loading...
Done!


Unnamed: 0,Antigen_ID,Antigen,Y
0,Protein 1,MASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFF...,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ..."
1,Protein 2,MSDLTDIQEDITRHEQQLIVARQKLKDAERAVEVDPDDVNKNTLQA...,"[312, 313, 314, 315, 316, 317, 318, 319, 320, ..."


While the previous two tasks are antigen/antibody agnostic and focus on identifying the active regions, we can also directly predict antigen-antibody binding affinities. TDC processes a dataset from SAbDab, where we only uses protein/peptide antigens:

In [4]:
from tdc.multi_pred import AntibodyAff
data = AntibodyAff(name = 'Protein_SAbDab')
data.get_data().head(2)

Downloading...
100%|██████████| 330k/330k [00:00<00:00, 1.02MiB/s]
Loading...
Done!


Unnamed: 0,Antibody_ID,Antibody,Antigen_ID,Antigen,Y
0,1hh6,['QDQLQQSGAELVRPGASVKLSCKALGYIFTDYEIHWVKQTPVHG...,pep-4,DATPEDLGARL,1e-07
1,4i2x,['EVKLQQSGPELVKPGASVKISCKASGYSFTSYYIHWVKQRPGQG...,signal-regulatory protein gamma,EEELQMIQPEKLLLVTVGKTATLHCTVTSLLPVGPVLWFRGVGPGR...,1.2e-06


Note that in practice, as proteins are usually in 3D, an ideal data input is the 3D structure information. However, as obtaining 3D structure is itself expensive for new data points, we here include the accessible sequences information. We plan to add 3D dataset in the future release. 

In addition to binding affinity, an antibody has also to have good properties to be able to be developed. Immunogenicity, instability, self-association, high viscosity, polyspecificity, or poor expression can all preclude an antibody from becoming a therapeutic. This can be formulated as an antibody developability prediction problem. TDC includes two datasets for that. In the first one, a small dataset is provided by Therapeutic Antibody Profiler where they propose five metrics in measuring the developability of an antibody: CDR length, patches of surface hydrophobicity (PSH), patches of positive charge (PPC), patches of negative charge (PNC), structural Fv charge symmetry parameter (SFvCSP). For example, to retrieve the values of CDR Length, you can type: 

In [2]:
from tdc.single_pred import Develop
data = Develop(name = 'TAP', label_name = 'CDR Length')
data.get_data().head(2)

Found local copy...
Loading...
Done!


Unnamed: 0,Antibody_ID,Antibody,Y
0,Abagovomab,['QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQG...,46
1,Abituzumab,['QVQLQQSGGELAKPGASVKVSCKASGYTFSSFWMHWVRQAPGQG...,45


All the label names can also accessed via:

In [4]:
from tdc.utils import retrieve_label_name_list
retrieve_label_name_list('TAP')

['CDR_Length', 'PSH', 'PPC', 'PNC', 'SFvCSP']

The label name also supports fuzzy search, so don't worry about typing few characters wrong! 

TDC also includes another large developability dataset from Chen et al., where they process from the SAbDab dataset. The binary developability label is generated via the BIOVIA's pipeline. It contains 2,409 antibodies: 

In [5]:
from tdc.single_pred import Develop
data = Develop(name = 'SAbDab_Chen')
data.get_data().head(2)

Downloading...
100%|██████████| 601k/601k [00:00<00:00, 1.48MiB/s]
Loading...
Done!


Unnamed: 0,Antibody_ID,Antibody,Y
0,12e8,['EVQLQQSGAEVVRSGASVKLSCTASGFNIKDYYIHWVKQRPEKG...,0
1,15c8,['EVQLQQSGAELVKPGASVKLSCTASGFNIKDTYMHWVKQKPEQG...,0


Last but not least, TDC also includes two Peptide-MHC binding datasets. Similar to the mechanism of Monoclonal antibody therapy, major histocompatibility complex (MHC) can bind to peptides and display them at the cell surface where human immune system (T-cell) can recognize them and eliminate them. There are various categories of MHC (MHC I, II, III) due to various structural differences. Thus, it is important to predict MHC and peptide binding affinity. In TDC, we include two datasets. The first is from NetMHCpan for MHC class I binding. They collected it from IEDB and IMGT/HLA database. You can retrieve them via:

In [6]:
from tdc.multi_pred import PeptideMHC
data = PeptideMHC(name = 'MHC1_IEDB-IMGT_Nielsen')
data.get_data().head(2)

Downloading...
100%|██████████| 15.1M/15.1M [00:01<00:00, 11.2MiB/s]
Loading...
Done!


Unnamed: 0,Peptide_ID,Peptide,MHC_ID,MHC,Y
0,0,ARWLASTPL,BoLA-D18.4,YYSEYREISENVYESNLYIAYSDYTWEYLNYRWY,0.589395
1,1,ASYAAAAAY,BoLA-D18.4,YYSEYREISENVYESNLYIAYSDYTWEYLNYRWY,0.496594


The second dataset is from MHC class II, which originates from NetMHCIIpan collected also from IEDB database. To retrieve this dataset, type:

In [7]:
from tdc.multi_pred import PeptideMHC
data = PeptideMHC(name = 'MHC2_IEDB_Jensen')
data.get_data().head(2)

Downloading...
100%|██████████| 11.8M/11.8M [00:01<00:00, 9.86MiB/s]
Loading...
Done!


Unnamed: 0,Peptide_ID,Peptide,MHC_ID,MHC,Y
0,0,PKYVKQNTLKLAT,HLA-DPA10103-DPB10201,YAFFMFSGGAILNTLFGQFEYFDIEEVRMHLGMT,0.0
1,1,DSDVGEFRAVTELG,HLA-DPA10103-DPB10201,YAFFMFSGGAILNTLFGQFEYFDIEEVRMHLGMT,0.047212


That's it for this tutorial! TDC is in the very initial stage, we only cover tip of an iceberg of all the biologics datasets. Thus, we are actively looking for contributions from domain scientists and ML researchers for new tasks and new datasets. Please [contact us](mailto:kexinhuang@hsph.harvard.edu) if you are interested in contributing!

In the next set of tutorials, we are going to cover the following topics:

* [TDC 104 ML Model Examples with DeepPurpose](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_104_ML_Model_DeepPurpose.ipynb)

* [TDC 105 Molecular Oracles](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_105_Oracles.ipynb)

See you there!