This notebook will show an example on how to use METL models through hugging face to predict on more than the sequences allowed by the demo.

First, we will import the required 🤗 modules in order to download the METL wrapper through their API.

In [None]:
# @title Installing libraries not included with colab
!pip install -q biopandas==0.5.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.3/68.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# @title download the example pdb file
!wget -O 2qmt_p.pdb https://raw.githubusercontent.com/gitter-lab/metl-pretrained/main/pdbs/2qmt_p.pdb

--2024-08-16 22:23:13--  https://raw.githubusercontent.com/gitter-lab/metl-pretrained/main/pdbs/2qmt_p.pdb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78764 (77K) [text/plain]
Saving to: ‘2qmt_p.pdb’


2024-08-16 22:23:13 (4.79 MB/s) - ‘2qmt_p.pdb’ saved [78764/78764]



In [None]:
# @title Importing required libraries
from transformers import AutoModel, AutoConfig, logging
import ipywidgets as widgets
from IPython.display import clear_output, HTML, display
import pandas as pd
import torch
from huggingface_hub import login
import io
import json
import biopandas

logging.set_verbosity_error()

# Declaring this here so that it's available regardless if later cells are run or not
variant_file = None
pdb_file_path = '2qmt_p.pdb'

Next we will define a necessary helper function for later on in the file.

In [None]:
# @title To zero based helper functoin
def to_zero_based(variants):
    zero_based = []
    for line in variants:
        line_as_json = json.loads(line)
        new_variants = []
        for variant in line_as_json:
            new_variant = []
            mutations = variant.split(',')
            for mutation in mutations:
                residue_zero_based = int(mutation[1:-1]) - 1
                new_variant.append(f"{mutation[0]}{residue_zero_based}{mutation[-1]}")
            new_variants.append(",".join(new_variant))
        zero_based.append(new_variants)

    return zero_based

We will then load a METL model through the 🤗 API. trust_remote_code=True is required to use METL models through 🤗.

In [None]:
# @title Loading METL from 🤗
from google.colab import userdata

API_KEY = userdata.get('SPACE_KEY')
login(API_KEY)
metl = AutoModel.from_pretrained('gitter-lab/METL', trust_remote_code=True)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


config.json:   0%|          | 0.00/282 [00:00<?, ?B/s]

huggingface_wrapper.py:   0%|          | 0.00/100k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/176 [00:00<?, ?B/s]

The METL 🤗 wrapper requires the loading of the specific METL model after initialization of the `metl` variable above. Use the dropdown below to select a model to use for predicting.

The publically available METL models are hosted on zenodo [here](https://zenodo.org/records/11051645). **TODO: Add some link to where we eventually document what each metl model does**

In [None]:
# @title Available metl models
# @markdown You may use this dropdown here to chose a metl model to predict with. Running this cell will load the selected model.
metl_model = 'metl-l-2m-3d-gb1' # @param ["metl-g-20m-1d","metl-g-20m-3d","metl-g-50m-1d","metl-g-50m-3d","metl-l-2m-1d-gfp","metl-l-2m-3d-gfp","metl-l-2m-1d-dlg4","metl-l-2m-3d-dlg4","metl-l-2m-1d-gb1","metl-l-2m-3d-gb1","metl-l-2m-1d-grb2","metl-l-2m-3d-grb2","metl-l-2m-1d-pab1","metl-l-2m-3d-pab1","metl-l-2m-1d-tem-1","metl-l-2m-3d-tem-1","metl-l-2m-1d-ube4b","metl-l-2m-3d-ube4b","metl-bind-2m-3d-gb1-standard","metl-bind-2m-3d-gb1-binding","metl-l-2m-1d-gfp-ft-design","metl-l-2m-3d-gfp-ft-design"]
metl.load_from_ident(metl_model)

Downloading: "https://zenodo.org/records/11051645/files/METL-L-2M-3D-GB1-epegcFiH.pt?download=1" to /root/.cache/torch/hub/checkpoints/epegcFiH.pt
100%|██████████| 9.40M/9.40M [00:01<00:00, 7.44MB/s]


Initialized PDB bucket matrices in: 0.000
Initialized PDB bucket matrices in: 0.000


Depending on the model chosen, different files might be needed. This example is setup to use metl-l-2m-3d-gb1 and will need a few data for prediction. **Link to the model descriptions here again to describe the IO of each model**.

Specifically, for this 3d gb1 model we will need:
- A wild type sequnece
- a PDB structure file (as this is a 3d model)
- variants to use with METL

In [None]:
# @title Protein wild type
# @markdown Enter the wild type of your protein here. The wildtype for gb1 is provided to use with the default model example here.
wildtype = 'MQYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE' # @param {type:"string", placeholder:"Enter a wildtype here"}

The PDB file is something that we will reference from within the colab through its file path. If you don't need a PDB file for your use-case, simply set this variable to None.

In [None]:
# @title PDB file upload
# @markdown If your model needs a PDB file, run this cell and upload the file with the provided button that appears below.
# @markdown
# @markdown If you would like to change the file, simply upload another one. The last uploaded file will be what is used.
# @markdown If you would like to predict with the pre-loaded GB1 model, download (this pdb file)[https://github.com/gitter-lab/metl-pretrained/blob/main/pdbs/2qmt_p.pdb]

def update_pdb_file(file_name):
  global pdb_file_path
  for name, data in file_name['new'].items():
    clear_output()
    display(pdb_upload)
    print(f"Selected file: {name}")
    pdb_file_path = f'./{name}'

    with open(name, 'wb') as f:
      f.write(data['content'])

pdb_upload = widgets.FileUpload(
    accept='.pdb',
    multiple=False
)
pdb_upload.observe(update_pdb_file, names='value')
pdb_upload

FileUpload(value={}, accept='.pdb', description='Upload')

Lastly, we will then collect some variants. The code in this notebook supports variants in JSON list format. Upload a file (2 cells below), or enter a JSON list formatted variants in the text box below.

In [None]:
# @title Variant text input
variants_string = """["T17P,T54F", "V28L,F51A", "T17P,V28L,F51A,T54F"]
["T13P,T33F"]"""
style = {'description_width':'initial'}

variant_text = widgets.Textarea(
    value='',
    placeholder=variants_string,
    description='Variant String:',
    disabled=False,
    style = style,
    layout=widgets.Layout(height='100px', width='500px'),
)

variant_text.add_class('variant_text_area')

style = """
<style>
  .variant_text_area > textarea::placeholder {
    color: var(--colab-primary-text-color);
  }

  .variant_text_area > textarea {
    background-color: var(--colab-secondary-surface-color);
    color: var(--colab-primary-text-color);
  }
</style>
"""

display(HTML(style))
display(variant_text)

Textarea(value='', description='Variant String:', layout=Layout(height='100px', width='500px'), placeholder='[…

If you would rather upload a file, run the cell below and use it to upload a file. If a file is upload, the input above will not be looked at for variants


In [None]:
# @title Variant file upload
# @markdown If you want to upload a variant JSON file, run this cell and upload the file with the provided button that appears below.


def update_variant_file(button_input):
  global variant_file
  for name, data in button_input['new'].items():
    clear_output()
    display(variant_upload)
    print(f'Loaded file: {name}')
    variant_file = data['content'].decode('utf-8').splitlines()

variant_upload = widgets.FileUpload(
    accept='.json, .txt',
    multiple=False
)

variant_upload.observe(update_variant_file, names='value')
variant_upload

FileUpload(value={}, accept='.json, .txt', description='Upload')

In [None]:
# @title Variant Selecting Logic (just run)

clear_output()
if len(variant_text.value) > 0:
  print("Using text area input")
  variants = variant_text.value
elif variant_file:
  print("Using variants file")
  variants = variant_file
else:
  print("Using variant placeholder")
  variants = variant_text.placeholder.splitlines()

Using variant placeholder


For biologists, one-based indexing is commonly used. However, METL models were designed to used zero-based indexing. If one based indexing is needed, select it in the dropdown below.

In [None]:
# @title Transform input from 1 based indexing to zero based indexing
# @markdown Select indexing for residue mutations
indexing = "0" # @param ['0', '1']

Since both file and string variants give the same result, we only need to use one moving forwards. We will use the string_variants variable.

To predict with METL, we will need to use the loaded model and encoder with our variables we defined above. We will wrap this in a for loop to predict on all of our variants as we have multiple lines of them.

In [None]:
# @title METL Predicting loop
output = []

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

metl = metl.to(device)

if indexing == "1":
  predict_variants = to_zero_based(variants)
else:
  predict_variants = variants

for variant in predict_variants:
    # First in METL we need to encode our variants
    if not isinstance(variant, list):
      variant = json.loads(variant)
    encoded_variants = metl.encoder.encode_variants(wildtype, variant)

    #Next, we predict
    with torch.no_grad():
        if pdb_file_path:
            predictions = metl(torch.tensor(encoded_variants).to(device), pdb_fn=pdb_file_path)
        else:
            predictions = metl(torch.tensor(encoded_variants).to(device))

        output.append({
            "wt": wildtype,
            "variants": variant,
            "logits": predictions.tolist()
        })

Finally, we will save our output. We will save our output as a list of JSON Objects.

In [None]:
# @title Saving the predictions
with open('./output.json', 'w') as f:
    f.write(json.dumps(output, indent=2))