# **AbLang Examples**

AbLang is a RoBERTa inspired language model trained on antibody sequences. The following is a set of possible use cases of AbLang.

In [1]:
import ablang

In [2]:
heavy_ablang = ablang.pretrained("heavy")
heavy_ablang.freeze()

--------------
## **AbLang building blocks**

For easy use we have build the AbLang module (see below), however; for incoorporating AbLang into personal codebases it might be more convenient to use the individual building blocks.

#### AbLang tokenizer

In [3]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

tokens = heavy_ablang.tokenizer(seqs, pad=True)
tokens

tensor([[ 0,  6, 15, 10, 20, 15,  6,  7, 12, 13, 12, 20, 15, 10, 13, 12,  4,  7,
         20,  2, 20,  7, 11, 15, 14,  7, 12, 17,  8, 17,  7, 12, 18, 12,  1,  3,
         19, 15,  2, 10, 14, 13, 12,  4, 12, 20,  6, 19, 16, 14, 20, 16, 16, 18,
          5,  6,  7,  9,  4, 18, 18, 14,  5,  7, 15,  4, 12,  2, 17,  8, 16,  7,
          2,  5,  9,  7,  4,  9,  8, 20, 18, 20, 10,  1,  7,  7, 20,  2, 14,  6,
          5,  8, 14, 15, 17, 18, 11, 14,  4, 15,  4, 17, 18,  5, 13,  8, 14, 13,
          9,  5, 18, 19, 12, 10, 12,  8, 20, 15,  8, 15,  7,  7, 22],
        [ 0, 10, 15, 10, 20, 15, 10,  7, 12, 14,  6, 15,  4,  4, 13, 12, 14,  7,
         15,  4, 15,  7, 11,  4, 14,  7, 12, 18,  8, 17,  8,  7, 18, 12, 16,  7,
         19, 15,  2, 10, 14, 13, 12, 10, 12, 20,  6, 19,  1, 12, 19, 16,  7, 14,
         18,  9, 12,  9,  8,  9, 18, 14, 10,  4, 20, 10, 12,  2, 15,  8,  1,  8,
          8,  5,  8,  7,  8,  7,  8, 14, 18,  1,  6, 20,  2,  7, 20,  2,  7,  5,
          5,  8, 14, 15, 18, 18, 11, 14

#### AbLang encoder (AbRep)

In [4]:
rescodings = heavy_ablang.AbRep(tokens)
rescodings

AbRepOutput(last_hidden_states=tensor([[[ 0.3613, -0.5545, -1.3733,  ...,  0.7854,  1.0435,  1.4019],
         [-0.0073,  0.9119,  0.3939,  ...,  1.0638, -0.1027,  3.0370],
         [-0.1704, -0.3076, -0.1893,  ...,  0.0445, -1.1808,  0.9443],
         ...,
         [-1.4265,  1.7327, -2.7284,  ...,  0.3252,  0.8509,  0.3174],
         [-1.2368,  0.9761, -2.5828,  ...,  0.6224,  1.1142, -0.5260],
         [-0.5692, -0.2824, -1.0787,  ..., -1.4483,  1.0727,  0.8714]],

        [[ 0.9622,  0.1451, -1.3975,  ...,  1.1158,  0.9947,  0.9987],
         [-0.3677,  0.1645,  0.2982,  ...,  1.3163,  1.3903,  1.3972],
         [-0.0105,  0.7078,  0.6299,  ...,  0.9115, -0.2759,  0.5338],
         ...,
         [-1.6057,  0.6062, -0.0898,  ..., -0.0652,  0.2848,  1.1870],
         [-1.6057,  0.6062, -0.0898,  ..., -0.0652,  0.2848,  1.1870],
         [-1.6057,  0.6062, -0.0898,  ..., -0.0652,  0.2848,  1.1870]]],
       grad_fn=<NativeLayerNormBackward0>), all_hidden_states=None, attentions=None)

#### AbLang full model (AbRep+AbHead)

In [5]:
likelihoods = heavy_ablang.AbLang(tokens)
likelihoods

tensor([[[ 26.2209,  -0.1150,   1.1180,  ...,   7.1572,   8.2511,   7.0621],
         [ -3.2988,  -2.4648,  -1.9716,  ...,  -3.6925,  -2.8412,  -3.9486],
         [-13.8630,   2.9315,  -2.1880,  ..., -12.2973, -11.4024, -12.1913],
         ...,
         [-13.3679,  -5.8459,  -3.4050,  ..., -13.6232, -12.3902, -13.7758],
         [-14.0443,  -6.0580,  -2.5352,  ..., -14.2198, -11.2575, -14.2093],
         [  4.2206,   2.3581,  -0.0688,  ...,   5.4753,  28.2524,   5.1772]],

        [[ 26.0735,   2.7258,   0.6690,  ...,   5.0711,   4.7992,   5.3501],
         [ -3.1518,  -1.2000,   0.7464,  ...,  -3.9869,  -5.1648,  -4.0318],
         [-12.9766,   3.7377,  -3.5336,  ..., -11.4977, -11.5676, -11.5029],
         ...,
         [-10.0570,  -0.4543,   3.8572,  ..., -11.4345, -10.0851, -11.4500],
         [-10.0570,  -0.4543,   3.8572,  ..., -11.4345, -10.0851, -11.4500],
         [-10.0570,  -0.4543,   3.8572,  ..., -11.4345, -10.0851, -11.4500]]],
       grad_fn=<ViewBackward0>)

-----
## **AbLang module: Res-codings**

The res-codings are the 768 values for each residue, describing both a residue's individual properties (e.g. size, hydrophobicity, etc.) and properties in relation to the rest of the sequence (e.g. secondary structure, position, etc.). 

To calculate the res-codings, you can use the mode "rescoding" as seen below. 

In [6]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

rescodings = heavy_ablang(seqs, mode='rescoding')

print("-"*100)
print("The output shape of a single sequence:", rescodings[0].shape)
print("This shape is different for each sequence, depending on their length.")
print("-"*100)
print(rescodings)

----------------------------------------------------------------------------------------------------
The output shape of a single sequence: (121, 768)
This shape is different for each sequence, depending on their length.
----------------------------------------------------------------------------------------------------
[array([[-0.00730015,  0.911948  ,  0.3939441 , ...,  1.0638114 ,
        -0.10272545,  3.037028  ],
       [-0.17044666, -0.307552  , -0.18925877, ...,  0.04447165,
        -1.1808295 ,  0.9442834 ],
       [-2.013703  , -1.1266949 , -0.27024814, ..., -1.8903987 ,
        -0.28660858,  0.9681651 ],
       ...,
       [-0.8431327 , -0.3233702 , -1.4710448 , ..., -0.26041767,
         0.75437   ,  1.1806053 ],
       [-1.4264786 ,  1.7326753 , -2.728447  , ...,  0.32516536,
         0.85092   ,  0.3174294 ],
       [-1.2367772 ,  0.97612107, -2.5827718 , ...,  0.62239   ,
         1.1142055 , -0.5260254 ]], dtype=float32), array([[-0.36770692,  0.1645376 ,  0.29816028, .

---- 
An additional feature, is the ability to align the rescodings. This can be done by setting the parameter align to "True". 

Alignment is done by numbering with anarci and then aligning sequences to all unique numberings found in input antibody sequences.

**NB:** You need to install anarci and pandas for this feature.

In [7]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

rescodings = heavy_ablang(seqs, mode='rescoding', align=True)

print("-"*100)
print("The output shape for the aligned sequences ('aligned_embeds'):", rescodings[0].aligned_embeds.shape)
print("This output also includes this numberings ('number_alignment') used for this set of sequences.")
print("-"*100)
print(rescodings[0].aligned_embeds)
print(rescodings[0].number_alignment)

----------------------------------------------------------------------------------------------------
The output shape for the aligned sequences ('aligned_embeds'): (2, 129, 769)
This output also includes this numberings ('number_alignment') used for this set of sequences.
----------------------------------------------------------------------------------------------------
[[[-0.007300154771655798 0.911948025226593 0.39394411444664 ...
   -0.10272544622421265 3.0370280742645264 'E']
  [-0.17044666409492493 -0.3075520098209381 -0.18925876915454865 ...
   -1.1808295249938965 0.9442834258079529 'V']
  [-2.0137031078338623 -1.126694917678833 -0.270248144865036 ...
   -0.28660857677459717 0.9681650996208191 'Q']
  ...
  [-0.8431326746940613 -0.32337018847465515 -1.4710447788238525 ...
   0.7543699741363525 1.1806052923202515 'V']
  [-1.426478624343872 1.732675313949585 -2.7284469604492188 ...
   0.8509200215339661 0.31742939352989197 'S']
  [-1.2367771863937378 0.9761210680007935 -2.582771778

---------
## **AbLang module: Seq-codings**

Seq-codings are a set of 768 values for each sequences, derived from averaging across the res-codings. Seq-codings allow one to avoid sequence alignments, as every antibody sequence, regardless of their length, will be represented with 768 values. 

In [8]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

seqcodings = heavy_ablang(seqs, mode='seqcoding')
print("-"*100)
print("The output shape of the seq-codings:", seqcodings.shape)
print("-"*100)

print(seqcodings)

----------------------------------------------------------------------------------------------------
The output shape of the seq-codings: (2, 768)
----------------------------------------------------------------------------------------------------
[[-0.66159597  0.13918797 -0.97155616 ... -0.94305375  0.11071647
   0.72706918]
 [-0.48282028  0.16598192 -0.56525127 ...  0.13565184  0.0851997
   0.80196542]]


-----
## **AbLang module: Residue likelihood**

Res- and seq-codings are both derived from the representations created by AbRep. Another interesting representation are the likelihoods created by AbHead. These values are the likelihoods of each amino acids at each position in the sequence. These can be used to explore which amino acids are most likely to be mutated into and thereby explore the mutational space.

**NB:** Currently, the likelihoods includes the start and end tokens and padding.

In [9]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

likelihoods = heavy_ablang(seqs, mode='likelihood')
print("-"*100)
print("The output shape with paddings still there:", likelihoods.shape)
print("-"*100)
print(likelihoods)

----------------------------------------------------------------------------------------------------
The output shape with paddings still there: (2, 123, 20)
----------------------------------------------------------------------------------------------------
[[[-0.11498421  1.1179807   1.1934879  ...  0.45755565 -0.72848666
   -3.068036  ]
  [-2.464817   -1.9716127   3.3576703  ... -2.3272204  -1.1907938
   -4.0330477 ]
  [ 2.9314642  -2.1880102  -3.6571255  ... -2.2125928  -2.5663288
    1.6678787 ]
  ...
  [-5.845918   -3.40504    -1.579061   ...  1.7031232  -4.064082
   -1.3621801 ]
  [-6.0580072  -2.5351667  -3.9164515  ... -1.8216585  -3.6267085
   -2.031465  ]
  [ 2.3581345  -0.06876343  1.9812663  ... -2.0339162  -1.4751194
   -1.2847601 ]]

 [[ 2.7258353   0.6690303  -0.5187334  ... -0.50959873 -1.5229917
   -1.004743  ]
  [-1.2000117   0.7463537   4.91817    ... -2.7072291  -0.76294684
   -1.1908685 ]
  [ 3.7377124  -3.533647   -4.2123537  ... -3.5040321  -1.6032338
    1.0644

### The corresponding amino acids for each likelihood

For each position the likelihood for each of the 20 amino acids are returned. The amino acid order can be found by looking at the ablang vocabulary. For this output the likelihoods for '<', '-', '>' and '\*' have been removed.

In [10]:
ablang_vocab = heavy_ablang.tokenizer.vocab_to_aa
ablang_vocab

{0: '<',
 21: '-',
 22: '>',
 23: '*',
 1: 'M',
 2: 'R',
 3: 'H',
 4: 'K',
 5: 'D',
 6: 'E',
 7: 'S',
 8: 'T',
 9: 'N',
 10: 'Q',
 11: 'C',
 12: 'G',
 13: 'P',
 14: 'A',
 15: 'V',
 16: 'I',
 17: 'F',
 18: 'Y',
 19: 'W',
 20: 'L'}

-----
## **AbLang module: Antibody sequence restoration**

In some cases, an antibody sequence is missing some residues. This could be derived from sequencing errors or limitations of current sequencing methods. To solve this AbLang has the "restore" mode, as seen below, which picks the amino acid with the highest likelihood for residues marked with an asterisk (*). 

In [11]:
seqs = [
    'EV*LVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    '*************PGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNK*YADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTL*****',
]

print("-"*100)
print("Restoration of masked residues.")
print("-"*100)
print(heavy_ablang(seqs, mode='restore'))

----------------------------------------------------------------------------------------------------
Restoration of masked residues.
----------------------------------------------------------------------------------------------------
['EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS'
 'QVQLVESGGGVVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS']


In cases where sequences are missing unknown lengths at the ends, we can use the "align=True" argument.

In [12]:
seqs = [
    'EV*LVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'PGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNK*YADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTL',
]

print("-"*100)
print("Restoration of masked residues and unknown missing end lengths.")
print("-"*100)
print(heavy_ablang(seqs, mode='restore', align=True))

----------------------------------------------------------------------------------------------------
Restoration of masked residues and unknown missing end lengths.
----------------------------------------------------------------------------------------------------
['EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS'
 'QVQLVESGGGVVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS']
