# **AbLang Examples**

AbLang is a BERT inspired language model trained on antibody sequences. The following is a set of possible use cases of AbLang.

In [2]:
import ablang

In [3]:
heavy_ablang = ablang.pretrained("heavy")
heavy_ablang.freeze()

-----
## **Res-codings**

The res-codings are the 768 values for each residue, describing both a residue's individual properties (e.g. size, hydrophobicity, etc.) and properties in relation to the rest of the sequence (e.g. secondary structure, position, etc.). 

To calculate the res-codings, you can use the mode "rescoding" as seen below. 

In [3]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

rescodings = heavy_ablang(seqs, mode='rescoding')

print(rescodings)

print("The shape of the output of a single sequence:", rescodings[0].shape)
print(rescodings)

[array([[-0.00730008,  0.91194767,  0.3939439 , ...,  1.0638115 ,
        -0.10272609,  3.037028  ],
       [-0.17044747, -0.30755213, -0.18925896, ...,  0.04447161,
        -1.1808295 ,  0.94428295],
       [-2.0137024 , -1.1266947 , -0.27024856, ..., -1.8903985 ,
        -0.28660882,  0.9681646 ],
       ...,
       [-0.84313226, -0.32336968, -1.4710451 , ..., -0.2604175 ,
         0.7543702 ,  1.1806058 ],
       [-1.4264785 ,  1.7326759 , -2.7284472 , ...,  0.32516527,
         0.8509197 ,  0.31742918],
       [-1.2367766 ,  0.9761213 , -2.582772  , ...,  0.6223901 ,
         1.1142057 , -0.5260258 ]], dtype=float32), array([[-0.36770657,  0.16453761,  0.29815987, ...,  1.3162804 ,
         1.3902713 ,  1.3971791 ],
       [-0.01045511,  0.70775384,  0.6299132 , ...,  0.911532  ,
        -0.27585816,  0.5337534 ],
       [-2.359949  , -0.8066429 ,  0.23082013, ..., -0.9835925 ,
         0.07312769,  0.14750186],
       ...,
       [-0.594226  , -0.1477494 , -0.59627694, ...,  1.183

---- 
An additional feature, is the ability to align the rescodings. This can be done by setting the parameter align to "True". 

**NB:** You need to install anarci and pandas for this feature.

In [4]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

rescodings = heavy_ablang(seqs, mode='rescoding', align=True)

print("The shape of the output:", rescodings[0].aligned_embeds.shape)
print(rescodings[0].aligned_embeds)
print(rescodings[0].number_alignment)

The shape of the output: (2, 129, 769)
[[[-0.0073000784032046795 0.9119476675987244 0.3939439058303833 ...
   -0.10272609442472458 3.0370280742645264 'E']
  [-0.1704474687576294 -0.30755212903022766 -0.18925896286964417 ...
   -1.1808295249938965 0.9442829489707947 'V']
  [-2.013702392578125 -1.126694679260254 -0.27024856209754944 ...
   -0.28660881519317627 0.9681646227836609 'Q']
  ...
  [-0.8431322574615479 -0.32336968183517456 -1.4710451364517212 ...
   0.7543702125549316 1.1806057691574097 'V']
  [-1.4264785051345825 1.7326759099960327 -2.728447198867798 ...
   0.8509197235107422 0.31742918491363525 'S']
  [-1.23677659034729 0.9761213064193726 -2.5827720165252686 ...
   1.1142057180404663 -0.5260257720947266 'S']]

 [[-0.3677065670490265 0.16453760862350464 0.2981598675251007 ...
   1.3902713060379028 1.397179126739502 'Q']
  [-0.010455112904310226 0.7077538371086121 0.6299132108688354 ...
   -0.27585816383361816 0.5337533950805664 'V']
  [-2.3599491119384766 -0.8066428899765015 0

---------
## **Seq-codings**

Seq-codings are a set of 768 values for each sequences, derived from averaging across the res-codings. Seq-codings allow one to avoid sequence alignments, as every antibody sequence, regardless of their length, will be represented with 768 values. 

In [5]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

seqcodings = heavy_ablang(seqs, mode='seqcoding')
print("The shape of the output:", seqcodings.shape)
print(seqcodings)

The shape of the output: (2, 768)
[[-0.66159577  0.13918797 -0.97155634 ... -0.94305359  0.11071627
   0.72706922]
 [-0.48282028  0.16598191 -0.56525127 ...  0.13565185  0.08519978
   0.8019654 ]]


-----
## **Residue likelihood**

Res- and seq-codings are both derived from the representations created by AbRep. Another interesting representation are the likelihoods created by AbHead. These values are the likelihoods of each amino acids at each position in the sequence. These can be used to explore which amino acids are most likely to be mutated into and thereby explore the mutational space.

**NB:** Currently, the likelihoods includes the start and end tokens and padding.

In [6]:
seqs = [
    'EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    'QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVLGWGSMDVWGQGTTVTVSS'
    ]

likelihoods = heavy_ablang(seqs, mode='likelihood')
print("The shape of the output:", likelihoods.shape)
print(likelihoods)

The shape of the output: (2, 123, 20)
[[[-0.11498569  1.1179823   1.1934892  ...  0.45755526 -0.7284867
   -3.0680373 ]
  [-2.4648163  -1.9716133   3.3576705  ... -2.3272207  -1.1907938
   -4.0330467 ]
  [ 2.9314635  -2.1880107  -3.6571255  ... -2.2125928  -2.5663297
    1.6678782 ]
  ...
  [-5.8459177  -3.4050395  -1.5790601  ...  1.7031232  -4.0640826
   -1.3621796 ]
  [-6.0580077  -2.5351663  -3.9164515  ... -1.8216588  -3.626708
   -2.031465  ]
  [ 2.3581345  -0.06876329  1.9812675  ... -2.0339167  -1.4751202
   -1.2847607 ]]

 [[ 2.7258344   0.6690306  -0.5187323  ... -0.5095991  -1.5229932
   -1.0047419 ]
  [-1.2000118   0.7463538   4.91817    ... -2.7072291  -0.76294655
   -1.1908679 ]
  [ 3.7377117  -3.5336478  -4.2123528  ... -3.504032   -1.6032335
    1.0644858 ]
  ...
  [-0.45430183  3.8572206  -3.0571616  ... -2.6384842  -4.9668417
   -3.8441541 ]
  [-0.45430183  3.8572206  -3.0571616  ... -2.6384842  -4.9668417
   -3.8441541 ]
  [-0.4543      3.857221   -3.0571606  ... -2.

-----
## **Antibody sequence restoration**

In some cases, an antibody sequence is missing some residues. This could be derived from sequencing errors or limitations of current sequencing methods. To solve this AbLang has the "restore" mode, as seen below, which picks the amino acid with the highest likelihood for residues marked with an asterisk (*). 

In [4]:
seqs = [
    'EV*LVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
    '*************PGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNK*YADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVT***',
]

heavy_ablang(seqs, mode='restore')

array(['EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS',
       'QVQLVESGGGVVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYYADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS'],
      dtype='<U121')