# Test BERT-Pytorch

BERT-pytorch is a PyTorch implementation of the BERT algorithm.

[BERT-pytorch](https://github.com/codertimo/BERT-pytorch)


## Embedding
In the BERT implemetnation (bert_pytorch/model/bert.py), the masking is done after the second token (x>0) since in the original BERT paper, the first element of the input is always \[CLS\]. In our model, we will use the variant name as the \[CLS\] and the values are:
[wt, alpha, delta, omicron, na], where "na" stands for not assigned.

In [32]:
import io
import copy
import math
from Bio import SeqIO
import torch
import torch.nn as nn
from bert_pytorch.model import BERT

## Tokenization and Vocabulary
In [ProteinBERT](https://academic.oup.com/bioinformatics/article/38/8/2102/6502274), Brandes et al used 26 unique tokens to represent the 20 standard amino acids, selenocysteine (U), and undefined amino acid (X), another amino acid (OTHER) and three speical tokens \<START\>, \<END\>, \<PAD\>.

In [4]:
# Based on the source code of protein_bert
ALL_AAS = 'ACDEFGHIKLMNPQRSTUVWXY'
ADDITIONAL_TOKENS = ['<OTHER>', '<START>', '<END>', '<PAD>']

# Each sequence is added <START> a
ADDED_TOKENS_PER_SEQ = 2

n_aas = len(ALL_AAS)
aa_to_token_index = {aa: i for i, aa in enumerate(ALL_AAS)}
additional_token_to_index = {token: i + n_aas for i, token in enumerate(ADDITIONAL_TOKENS)}
token_to_index = {**aa_to_token_index, **additional_token_to_index}
index_to_token = {index: token for token, index in token_to_index.items()}
n_tokens = len(token_to_index)

def tokenize_seq(seq):
    other_token_index = additional_token_to_index['<OTHER>']
    return [additional_token_to_index['<START>']] + [aa_to_token_index.get(aa, other_token_index) for aa in seq] + [additional_token_to_index['<END>']]

## Amino Acid Token Embeddings
We will derive it from the [torch.nn.Embedding class](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). The size of the vacabulary equals the number of tokens. This approach allows the learning of the embeddings from the model intself. If we train the model with virus sepcific squences, the embeddings shall reflect the hidden properties of the amino acids in context of the trainign sequences. Note that the \<START\> and \<END\> tokens are always added at the beginning of the sequence. \<PAD\> tokens may be added before the \<END\> token if the sequence is shorter than the input sequence.

Note that using the "from_pretrained" class method of torch.nn.Embedding, we can load pretrained weights of the embedding.


In [54]:
class TokenEmbedding(nn.Embedding):
    def __init__(self, num_embeddings: torch.Tensor, embedding_dim: int = 512, padding_idx=None):
        super().__init__(num_embeddings, embedding_dim, padding_idx)

padding_idx = token_to_index['<PAD>']
print(padding_idx)


25


In [7]:
test_wt_seq = """>sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT"""
len(test_wt_seq)

1413

In [8]:
test_seqs = []
fa_parser = SeqIO.parse(io.StringIO(test_wt_seq), 'fasta')
for record in fa_parser:
    seq = record.seq
    test_seqs.append(str(seq))

In [55]:
num_embeddings = n_tokens
embedding_dim = 20
embedding = TokenEmbedding(num_embeddings, embedding_dim, padding_idx)
test_embedding = embedding(torch.IntTensor(tokenize_seq(test_seqs[0])))
print(test_embedding)
print(f'Shape of test sequence embedding: {test_embedding.shape}')


tensor([[-1.1492, -0.1697, -0.6104,  ...,  0.6469, -0.9710, -0.0934],
        [-0.9537, -1.1770,  1.0682,  ...,  0.7800, -1.6534,  2.3468],
        [-1.0529, -0.4486, -0.1189,  ..., -2.6147,  1.1226,  0.2858],
        ...,
        [ 0.8581, -1.7343, -0.0180,  ..., -0.9839,  0.1881,  1.4502],
        [ 0.1738,  0.0885, -1.1182,  ...,  0.3204,  2.5281, -0.4910],
        [-1.6901, -0.4786, -0.4231,  ..., -1.8643,  0.5993, -0.9359]],
       grad_fn=<EmbeddingBackward0>)
Shape of test sequence embedding: torch.Size([1275, 20])


Let's take a look of the embedding weights:

In [28]:
embedding.weight

Parameter containing:
tensor([[-2.3978,  1.2127, -0.2234, -0.0264, -0.3820, -0.9479,  0.8689, -0.7566,
         -0.4174, -1.0082,  0.1021,  0.1771, -1.0763, -0.1874,  0.0987,  0.3682,
         -1.1995,  0.4575, -1.2001,  1.5923],
        [ 0.5006,  3.0301, -1.7250, -2.1034,  2.5169, -0.1309, -1.0380,  0.1059,
          0.7469, -0.3476,  0.5521,  0.7017, -1.5748,  0.3097,  1.1791,  2.1280,
         -0.7778, -0.7498,  1.1281,  2.1922],
        [ 1.9306,  1.8446, -0.9415, -0.2140,  1.0399, -0.8043,  1.3288, -0.4401,
         -0.4866, -0.9693, -0.6359,  1.3263,  0.2730, -0.7357, -0.5857, -1.0038,
         -0.0549,  0.3960,  0.5891,  1.5032],
        [-0.7671,  1.4075, -0.1768, -1.1197, -1.0238, -2.2906,  0.1624,  0.1044,
         -0.4115, -0.3973,  0.4986, -0.3979,  0.2723,  0.4332,  0.8461, -2.0731,
         -0.5507, -0.3212,  0.7629, -0.3854],
        [ 0.3769,  1.8458, -0.2702, -1.0592,  0.8349, -0.2808,  1.8812, -2.5554,
          1.7361,  0.4591,  0.8112,  1.6567, -0.7579,  0.8533,  1

In [29]:
embedding.weight.shape

torch.Size([26, 20])

## Postional Encoding
We will use the  sine and cosine functions of different frequencie to embed positional information as in the original BERT method.

In [98]:
class PositionalEncoding(nn.Module):
    """
    Impement the PE function.
    
    The PE forward function is different from the BERT-pytorch. Here we used the original method in BERT so
    PE embeddings are added to the input embeddings and no graident tracking is used.
    """

    def __init__(self, d_model, dropout, max_len=1500):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0)/d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        print(f'x.shape in PositionalEncoding: {x.shape}')
        print(f'x.shape: {x.shape},pe.shape: {self.pe.shape}')
        print(f'pe[:, : x.size(1)]: {self.pe[:, : x.size(1)]}')


        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

In [99]:
class SeqEncoding(nn.Module):
    """
    Encode amino acid sequence. Input sequence is represented by summing the corresponding sequence token,
    segment (e.g. question and answer or any segments separated by <SEP>), and position embeddings. In our 
    model, we only need the token and position embedding so segment embeddign is not implemented here.    
    """
    def __init__(self, num_embeddings, embedding_dim, dropout=0.1):
        super().__init__()
        self.token_embedding = TokenEmbedding(num_embeddings, embedding_dim)
        self.position = PositionalEncoding(embedding_dim, dropout)
        self.embeddng_dim = embedding_dim
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, seq:str):
        x = torch.IntTensor(tokenize_seq(seq))
        x = self.token_embedding(x)
        x = self.position(x)
        return self.dropout(x)

In [100]:
test_seq_encode = SeqEncoding(n_tokens, 512, 0.1)
num_parameters_seq_encoding = sum(p.numel() for p in test_seq_encode.parameters() if p.requires_grad)
print(f'Parameters in SeqEncoding: {num_parameters_seq_encoding}')
print(test_seq_encode(test_seqs[0]))

Parameters in SeqEncoding: 13312
x.shape in PositionalEncoding: torch.Size([1275, 512])
x.shape: torch.Size([1275, 512]),pe.shape: torch.Size([1, 1500, 512])
pe[:, : x.size(1)]: tensor([[[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  ...,  1.0000e+00,
           0.0000e+00,  1.0000e+00],
         [ 8.4147e-01,  5.4030e-01,  8.2186e-01,  ...,  1.0000e+00,
           1.0366e-04,  1.0000e+00],
         [ 9.0930e-01, -4.1615e-01,  9.3641e-01,  ...,  1.0000e+00,
           2.0733e-04,  1.0000e+00],
         ...,
         [ 6.1950e-02,  9.9808e-01,  7.9820e-01,  ...,  9.9850e-01,
           5.2740e-02,  9.9861e-01],
         [ 8.7333e-01,  4.8714e-01,  9.4981e-01,  ...,  9.9850e-01,
           5.2844e-02,  9.9860e-01],
         [ 8.8177e-01, -4.7168e-01,  2.8401e-01,  ...,  9.9849e-01,
           5.2947e-02,  9.9860e-01]]])


RuntimeError: The size of tensor a (1275) must match the size of tensor b (512) at non-singleton dimension 1

# Model Definition

Here we define a model based on BERT. Part of the implementation is based on [BERT-pytorch](https://github.com/codertimo/BERT-pytorch)

In [30]:
def clones(module, N):
    """Produce N identical layers"""
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [None]:
class Transformer:
    

In [None]:
class BERT(nn.Module):
    """
    BERT model
    """

    def __init__(self, vocabl_size: int = 26, hidden: int = 768, n_layer: int = 12, attn_heads: int = 12, dropout: float = 0.1):
        """
        vacab_size: vacabulary or token size
        hidden: BERT model size (used as input size and hidden size)
        n_layer: number of Transformer layers
        attn_heads: attenion heads
        dropout: dropout ratio
        """

        super().__init__()
        self.hidden  = hidden
        self.n_layer = n_layer
        self.attn_heads = attn_heads

        self.feed_forward_hidden = hidden * 4
        self.embedding = TokenEmbedding(vocabl_size, embed_size=hidden, padding_idx=25)

        self.transformer_blocks = clones(Transformer(hidden, attn_heads, hidden *4, dropout), n_layer)

    def forward(self)
