In this tutorial, we aim to load different forms of dataset to see what input data look like.

### Case 1: one protein input, taken fluorescence dataset as an example. First we import some necessary libraries. If you wish to run this notebook under DEMO folder, add path to system. 

In [1]:
import sys
import os
sys.path.append(os.path.abspath('..'))
from DeepProtein.dataset import FluorescenceDataset
import DeepProtein.utils as utils
from DeepProtein.dataset import *

Data is under DeepProtein/data.

In [2]:
train_fluo = FluorescenceDataset('../DeepProtein/data', 'train')

If you want to train CNN, then graph is set to False, pass to collate_fn

In [3]:
train_protein_processed, train_target, train_protein_idx = collate_fn(train_fluo)

Let's see what training data looks like:

In [4]:
#train_protein_processed

The input data is a list of sequence, which contains hundreds of characters. For example the first data has 237 characters, the property has the value 3.8237.

In [5]:
train_protein_processed[0]

'SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHKIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDERYK'

In [6]:
print(len(train_protein_processed[0]))

237


In [7]:
print(train_target[0])

tensor([3.8237])


If you want to train with a GNN, lets see what it looks like if we set graph to True.

In [8]:
valid_fluo = FluorescenceDataset('../DeepProtein/data', 'valid')
valid_protein_processed, valid_target, valid_protein_idx = collate_fn(valid_fluo, graph=True)

100%|██████████| 5362/5362 [01:13<00:00, 72.58it/s]


Now we find that the input data is a SMILES string.

In [9]:
print(valid_protein_processed[0])

CC[C@H](C)[C@H](NC(=O)CNC(=O)[C@H](CCCCN)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)O)NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCCN)NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](C)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@@H](NC(=O)[C@H](CCCCN)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@H](CC(N)=O)NC(=O)CNC(=O)[C@H](CC(=O)O)NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](CCCCN)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@@H]1CCCN1C(=O)[C@H](CCSC)NC(=O)[C@H](C)NC(=O)[C@H](CO)NC(=O)[C@H](CCCCN)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCSC)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@H](CC(=O)O)NC(=O)[C@@H]1CCCN1C(=O)[C@H](Cc1ccc(O)cc1)NC(=

To see how it is transformed to a graph, we apply smile2graph in dgl to this SMILES string to observe the graph data:

In [10]:
from dgllife.utils import smiles_to_bigraph
g = smiles_to_bigraph(valid_protein_processed[0])

In [11]:
print(g)

Graph(num_nodes=1886, num_edges=3860,
      ndata_schemes={}
      edata_schemes={})


Here it's a graph with 1886 nodes and 3860 edges, which is a big graph for a protein. However, we didnt observe node features and edge features. The reason is that we haven't applied node featurizer of edge featurizer on it. To see, we could: 

In [12]:
from dgllife.utils import smiles_to_bigraph, CanonicalAtomFeaturizer, CanonicalBondFeaturizer
node_feat = CanonicalAtomFeaturizer()
edge_feat = CanonicalBondFeaturizer()

In [13]:
g = smiles_to_bigraph(valid_protein_processed[0], 
                    node_featurizer=node_feat, 
                    edge_featurizer=edge_feat)

In [14]:
print(g)
print(g.ndata['h'].shape)

Graph(num_nodes=1886, num_edges=3860,
      ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
      edata_schemes={'e': Scheme(shape=(12,), dtype=torch.float32)})
torch.Size([1886, 74])


Now we see that the dimensional of input data is 74. The size of input node features is 1886 * 74. This is an example of the valid data.


### Case 2: two protein input. This exists in PPI. Inputs are two protein sequence. We take the PPI_affinity as an example

In [15]:
train_ppi = PPI_Affinity('../DeepProtein/data', 'train')


In [16]:
train_protein_1, train_protein_2, train_target, train_protein_idx = collate_fn_ppi(train_ppi, graph=True, unsqueeze= False)

100%|██████████| 2421/2421 [00:48<00:00, 49.68it/s]


These are two input protein sequence (index 0).

In [17]:
train_protein_1[0], train_protein_2[0] 

('CC[C@H](C)[C@H](NC(=O)[C@H](CCC(N)=O)NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCNC(=N)N)NC(=O)CNC(=O)[C@H](CCSC)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@@H](NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CO)NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](CO)NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CO)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](C)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](CO)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)[C@H](CCCNC

### Case 3: Load from PICKLE file. Input is one protein. Take IEDB as an example (Epitope prediction). This could be passed to Token CNN / CNN_RNN / Transformer

In [18]:
from tdc.single_pred import Epitope

In [19]:
data_class, name, X = Epitope, 'IEDB_Jespersen', 'Antigen'

data = data_class(name=name)
split = data.get_split()
train_data, valid_data, test_data = split['train'], split['valid'], split['test']
vocab_set = set()

train_vocab, train_positive_ratio = data2vocab(train_data, train_data, X)
valid_vocab, valid_positive_ratio = data2vocab(valid_data, train_data, X)
test_vocab, test_positive_ratio = data2vocab(test_data, train_data, X)

vocab_set = train_vocab.union(valid_vocab)
vocab_set = vocab_set.union(test_vocab)
vocab_lst = list(vocab_set)

train_data = standardize_data(train_data, vocab_lst, X)
valid_data = standardize_data(valid_data, vocab_lst, X)
test_data = standardize_data(test_data, vocab_lst, X)

train_set = data_process_loader_Token_Protein_Prediction(train_data)
valid_set = data_process_loader_Token_Protein_Prediction(valid_data)
test_set = data_process_loader_Token_Protein_Prediction(test_data)



Found local copy...
Loading...
Done!


In [20]:
print(train_set.sequences[4])
print(train_set.sequences[4].shape)

tensor([[0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
torch.Size([300, 24])


Since its for token level prediction, we could also print y to see what its shape.

In [21]:
print(len(train_set.labels))
print(train_set.labels[4].shape)
print(train_set.labels[4])

2211
torch.Size([300])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0

Therefore, for a sequeunce with size 300, each token has its label (belong to 0 or 1).

### Case 4 Load from TAB file. Input is one protein. Take CRISPR as an example

In [22]:
from tdc.utils import retrieve_label_name_list
from tdc.single_pred import Develop, CRISPROutcome

In [23]:
label_list = retrieve_label_name_list('Leenay')

data = CRISPROutcome(name='Leenay', label_name=label_list[0])
split = data.get_split()

train_GuideSeq, y_train = list(split['train']['GuideSeq']), list(split['train']['Y'])
val_GuideSeq, y_valid = list(split['valid']['GuideSeq']), list(split['valid']['Y'])
test_GuideSeq, y_test = list(split['test']['GuideSeq']), list(split['test']['Y'])

# print(y_train)
train_CRISPR = list(zip(train_GuideSeq, y_train))
valid_CRISPR = list(zip(val_GuideSeq, y_valid))
test_CRISPR = list(zip(test_GuideSeq, y_test))

Found local copy...
Loading...
Done!


In [24]:
for x, y in train_CRISPR:
    print(x, y)
    break


CTGCAGGGCTAGTTTCCTATAGG 0.0695715507140821


### Case 4 Load from TAB file. However, input are two protein. Take TAP as an example. We treat it as PPI in this library.

In [25]:
from tdc.utils import retrieve_label_name_list
from tdc.single_pred import Develop

In [26]:
label_list = retrieve_label_name_list('TAP')

data = Develop(name='TAP', label_name=label_list[0])
split = data.get_split()

train_antibody_1, train_antibody_2 = to_two_seq(split, 'train', 'Antibody')

y_train = split['train']['Y']


Found local copy...
Loading...
Done!


In [27]:
y_train

0      46
1      45
2      45
3      49
4      51
       ..
164    45
165    47
166    44
167    42
168    51
Name: Y, Length: 169, dtype: int64

In [28]:
train_antibody_1[0], train_antibody_2[0]

('QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
 'DIELTQSPASLSASVGETVTITCQASENIYSYLAWHQQKQGKSPQLLVYNAKTLAGGVSSRFSGSGSGTHFSLKIKSLQPEDFGIYYCQHHYGILPTFGGGTKLEIK')