# De Novo Protein Design Workflow using NIMs on AIF w/ Jupyter-hub


This example notebook outlines a workflow for creating de novo protein binders using NVIDIA Inference Microservices (NIMs). This workflow leverages advanced AI models to enable computational biologists to design novel protein structures efficiently.

The input to this workflow is a protein sequence, which is then fed to AlphaFold2 for structural prediction; alternatively, this can be skipped and a precomputed protein structure (in PDB format) can be used as input. Protein backbones are then generated with RFDiffusion, sequences are generated with ProteinMPNN, and finally complex structures are predicted with AlphaFold2-multimer. 

This setup provides a powerful framework for exploring protein design, offering flexibility and precision in generating functional protein binders. For more information, refer to the respective repositories and documentation.

## Getting started with Demo NIMs

This is all performed bu AI-Factory `blueprint`

Initial startup of the `AlphaFold NIM` data is time consuming and requires roughly 1.2TB of disk space

After AIF NIMS set up is complete, check the status of the four running NIMS e.g with the following commands, remember now we can use the internal service ports and not the public addresses.

```bash
curl -fsS http://alphafold.protein-binder-design:8081/v1/health/ready
curl -fsS http://rfdiffusion.protein-binder-design:8082/v1/health/ready
curl -fsS http://proteinmpnn.protein-binder-design:8083/v1/health/ready
curl -fsS http://alphafold-multimer.protein-binder-design:8084/v1/health/ready
```

In [None]:
!curl -fsS http://alphafold.protein-binder-design:8081/v1/health/ready

First, we'll install some prerequisites so our examples work.

In [None]:
! pip install requests

In [None]:
from __future__ import annotations
from enum import Enum, StrEnum
from pathlib import Path
from typing import Dict, Any, Tuple, Optional, Union
import os, requests, json, time

One needs to use an NGC Personal Key to run the examples below.

In [None]:
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY") or input("Paste Run Key: ")

In [None]:
# --- Kubernetes wiring
NAMESPACE = "protein-binder-design"
SERVICES: Dict[str, int] = {
    "alphafold":          8081,
    "rfdiffusion":        8082,
    "proteinmpnn":        8083,
    "alphafold-multimer": 8084,
}
SERVICES_BY_PORT = {port: svc for svc, port in SERVICES.items()}

def svc_url(service: str, ns: str = NAMESPACE, port: Optional[int] = None, scheme: str = "http") -> str:
    p = SERVICES[service] if port is None else int(port)
    return f"{scheme}://{service}.{ns}:{p}"

# --- Auth header (optional)
HEADERS: Dict[str, str] = {}
if (k := os.getenv("NGC_API_KEY")):
    HEADERS["Authorization"] = f"Bearer {k}"

# --- Requests session; ignore any proxy envs inside the pod
SESSION = requests.Session()
SESSION.trust_env = False

# --- Make sure in-cluster names never go through a proxy (harmless if no proxy present)
no_proxy = [
    "localhost","127.0.0.1",
    ".svc",".svc.cluster.local",".cluster.local",
    "10.0.0.0/8","172.16.0.0/12","192.168.0.0/16",
    "10.233.0.0/16",  # adjust if your cluster CIDR differs
]
no_proxy += [f"{s}.{NAMESPACE}" for s in SERVICES]
os.environ["NO_PROXY"] = ",".join(no_proxy)
os.environ["no_proxy"] = os.environ["NO_PROXY"]

In [None]:
class NIM_PORTS(Enum):
    ALPHAFOLD2_PORT  = 8081
    RFDIFFUSION_PORT = 8082
    PROTEINMPNN_PORT = 8083
    AF2_MULTIMER_PORT= 8084

class NIM_ENDPOINTS(StrEnum):
    ALPHAFOLD2   = "protein-structure/alphafold2/predict-structure-from-sequence"
    RFDIFFUSION  = "biology/ipd/rfdiffusion/generate"
    PROTEINMPNN  = "biology/ipd/proteinmpnn/predict"
    AF2_MULTIMER = "protein-structure/alphafold2/multimer/predict-structure-from-sequences"

ENDPOINT_TO_SERVICE = {
    NIM_ENDPOINTS.ALPHAFOLD2:   "alphafold",
    NIM_ENDPOINTS.RFDIFFUSION:  "rfdiffusion",
    NIM_ENDPOINTS.PROTEINMPNN:  "proteinmpnn",
    NIM_ENDPOINTS.AF2_MULTIMER: "alphafold-multimer",
}

In [None]:
# Health check (accepts service name, port, or enum)
def check_nim_readiness(
    target: Union[str, int, NIM_PORTS],
    ns: str = NAMESPACE,
    ready_path: str = "/v1/health/ready",
    timeout: int = 5
) -> bool:
    # normalize to (service, port)
    service: Optional[str] = None
    port: Optional[int] = None

    if isinstance(target, NIM_PORTS):
        port = int(target.value)
        service = SERVICES_BY_PORT.get(port)
    elif isinstance(target, int):
        port = target
        service = SERVICES_BY_PORT.get(port)
    else:
        service = str(target)
        port = SERVICES.get(service)

    if service is None or port is None:
        raise ValueError(f"Unknown service/port: {target}")

    url = f"{svc_url(service, ns, port=port)}/{ready_path.lstrip('/')}"
    try:
        r = SESSION.get(url, headers=HEADERS, timeout=timeout)
        return r.ok and r.json().get("status") == "ready"
    except Exception as e:
        print(f"[readiness:{service}] {e}")
        return False

Main request helper (cluster-native; supports long jobs + infinite wait)
* Uses {service}.{namespace}:{port} automatically.
* read_timeout=0 or -1 â‡’ wait forever (like your original code).
* For long AF2 jobs, you can either set a very large timeout or keep it infinite.

In [None]:
LONG_ENDPOINTS = {NIM_ENDPOINTS.ALPHAFOLD2, NIM_ENDPOINTS.AF2_MULTIMER}

def query_nim(payload: Dict[str, Any],
              nim_endpoint: NIM_ENDPOINTS | str,
              *,
              service: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None,
              ns: str = NAMESPACE,
              connect_timeout: int = 5,
              read_timeout: Optional[int] = None,   # <=0 => wait forever
              echo: bool = False) -> Tuple[int, Dict]:

    # normalize endpoint
    if isinstance(nim_endpoint, str):
        nim_endpoint = NIM_ENDPOINTS(nim_endpoint)

    service = service or ENDPOINT_TO_SERVICE[nim_endpoint]
    base = svc_url(service, ns)
    url  = f"{base}/{str(nim_endpoint).lstrip('/')}"

    hdrs = dict(HEADERS)
    if headers: hdrs.update(headers)

    # default read timeout: long for AF2, moderate for others
    if read_timeout is None:
        read_timeout = 5400 if nim_endpoint in LONG_ENDPOINTS else 900

    timeout = None if (isinstance(read_timeout, (int, float)) and read_timeout <= 0) \
              else (connect_timeout, read_timeout)

    if echo:
        print("*"*80)
        print("URL:", url)
        print("Payload keys:", list(payload.keys()))
        print(f"Timeouts: {timeout!r}")  # None => wait forever
        print("*"*80)

    r = SESSION.post(url, json=payload, headers=hdrs, timeout=timeout)
    if r.status_code == 200:
        try:
            return r.status_code, r.json()
        except Exception:
            return r.status_code, {"text": r.text}
    raise Exception(f"Error {r.status_code}: {r.text[:500]}")

In [None]:
def get_reduced_pdb(pdb_id: str, rcsb_path: str = None) -> str:
    pdb = Path(pdb_id)
    if not pdb.exists() and rcsb_path is not None:
        pdb.write_text(requests.get(rcsb_path).text)
    lines = filter(lambda line: line.startswith("ATOM"), pdb.read_text().split("\n"))
    return "\n".join(list(lines))


In [None]:
class ExampleRequestParams:
    def __init__(self,
                target_sequence: str,
                contigs: str, 
                hotspot_res: List[str],
                input_pdb_chains: List[str],
                ca_only: bool,
                use_soluble_model: bool,
                sampling_temp: List[float],
                diffusion_steps: int = 15,
                num_seq_per_target: int = 20):
        self.target_sequence = target_sequence
        self.contigs = contigs
        self.hotspot_res = hotspot_res
        self.input_pdb_chains = input_pdb_chains
        self.ca_only = ca_only
        self.use_soluble_model = use_soluble_model
        self.sampling_temp = sampling_temp
        self.diffusion_steps = diffusion_steps
        self.num_seq_per_target = num_seq_per_target

### Example data
Below, we include three example input sets. Note that these are of varying difficulty and will exhibit different runtimes and resource utilizations.
- Example **1R42** should run on most systems with 4 GPUs with 40GB of VRAM or more.
- Example **5PTN**
- Example **6VXX** requires 4 GPUs with 80GB of VRAM each.

In [None]:
example_6vxx = ExampleRequestParams(
    target_sequence="MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPSGAGSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDPPEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKGSGRENLYFQGGGGSGYIPEAPRDGQAYVRKDGEWVLLSTFLGHHHHHHHH",
    contigs="A353-410/0 100-200",
    hotspot_res=["A360","A361","A362","A366"],
    input_pdb_chains=["A"],
    ca_only=False,
    use_soluble_model=False,
    sampling_temp=[0.1],
    diffusion_steps=15,
    num_seq_per_target=20
)
example_5ptn = ExampleRequestParams(
    target_sequence="NITEEFYQSTCSAVSKGYLSALRTGWYTSVITIELSNIKKIKCNGTDAKIKLIKQELDKYKNAVTELQLLMQSTPATNNQARGSGSGRSLGFLLGVGSAIASGVAVSKVLHLEGEVNKIKSALLSTNKAVVSLSNGVSVLTSKVLDLKNYIDKQLLPIVNKQSCSIPNIETVIEFQQKNNRLLEITREFSVNAGVTTPVSTYMLTNSELLSLINDMPITNDQKKLMSNNVQIVRQQSYSIMSIIKEEVLAYVVQLPLYGVIDTPCWKLHTSPLCTTNTKEGSNICLTRTDRGWYCDNAGSVSFFPQAETCKVQSNRVFCDTMNSLTLPSEVNLCNVDIFNPKYDCKIMTSKTDVSSSVITSLGAIVSCYGKTKCTASNKNRGIIKTFSNGCDYVSNKGVDTVSVGNTLYYVNKQEGKSLYVKGEPIINFYDPLVFPSDQFDASISQVNEKINQSLAFIRKSDELLSAIGGYIPEAPRDGQAYVRKDGEWVLLSTFLGGLVPRGSHHHHHH",
    contigs="A1-25/0 70-100",
    hotspot_res=["A14","A15","A17","A18"],
    input_pdb_chains=["A"],
    ca_only=False,
    use_soluble_model=False,
    sampling_temp=[0.1],
    diffusion_steps=15,
    num_seq_per_target=20
)
example_1r42 = ExampleRequestParams(
    target_sequence="STIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLAQMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNERLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHLHAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGLPNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGFHEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEMKREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRLGKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYAD",
    contigs="A114-353/0 50-100",
    hotspot_res=["A119","A123","A233","A234","A235"],
    input_pdb_chains=["A"],
    ca_only=False,
    use_soluble_model=False,
    sampling_temp=[0.1],
    diffusion_steps=15,
    num_seq_per_target=20
)

In [None]:
## Set the example here to switch example inputs.
## Note: Example 6vxx requires a GPU with at least 80GB of VRAM.
example = example_5ptn

### Check that the NIM is ready from Python

We can test whether each NIM is up and running using our check_nim_readiness function

In [None]:
for s in ("alphafold","rfdiffusion","proteinmpnn","alphafold-multimer"):
    print(s, "ready:", check_nim_readiness(s))

# AlphaFold2

AlphaFold2 is a deep learning model for predicting protein structure from amino acid sequence that has achieved state-of-the-art performance. The NVIDIA AlphaFold2 NIM includes GPU-accelerated MMseqs2, which accelerates the MSA portion of the structural prediction pipeline.

**Inputs**:
- `sequence`: An amino acid sequence
- `algorithm`: The algorithm used for Multiple Sequence Alignment (MSA). This can be either of `jackhmmer` or `mmseqs2`. MMSeqs2 is significantly faster.

**Outputs**:
- A list of predicted structures in PDB format.

In [None]:
from datetime import datetime
# Estimated 9 minutes on H100 for example 5ptn
alphafold2_query = {
    "sequence": example.target_sequence,
    "algorithm": "mmseqs2",  # if this needs internet, ensure namespace egress allows it
}

print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] start predicted structures in PDB format")
rc, alphafold2_response = query_nim(
    payload=alphafold2_query,
    nim_endpoint=NIM_ENDPOINTS.ALPHAFOLD2,  # enum, not .value
    read_timeout=0,                         # wait forever (or use a big number)
    echo=True
)
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end predicted structures in PDB format")

In [None]:
## Print the first two lines (160 characters) of the alphafold2 response
alphafold2_response[0][0:160]

# RFDiffusion

This section demonstrates how to use RFDiffusion NIM in a *de novo* protein design workflow. Inspired by AI image generation models, RFDiffusion applies generative diffusion techniques to create novel protein structures. It excels in designing complex protein architectures, including binders and symmetric assemblies, by sculpting atomic clouds into functional proteins.

**Inputs**
- `input_pdb` is the protein target in PDB format
- `contigs` is the RFDiffusion language for how to specify regions to work on. See the official [RFDiffusion repo](https://github.com/RosettaCommons/RFdiffusion?tab=readme-ov-file#running-the-diffusion-script) for a full breakdown. A20-60/0 50-100 means to generate a binder to chain A residue 20-60, where the binder is 50-100 residues long. The /0 specifies a chain break.
- `hotspot_res` hot spot residues (specifically for binders)
- `diffusion_steps` number of diffusion_steps

**Output**:
- `output_pdb` is the output pdb
- `protein` is the input pdb

In [None]:
## Expected runtime: ~15 seconds to 1 minute
## H100 runtime: 2 seconds
rfdiffusion_query = {
        "input_pdb" : alphafold2_response[0], ## Take the first structure prediction (of 5) from AlphaFold2
        "contigs" : "51-51/A163-181/60-60", #example.contigs
        # "hotspot_res" : example.hotspot_res,
        "diffusion_steps" : example.diffusion_steps
    }
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] start RFDiffusion generate")
rc, rfdiffusion_response = query_nim(
    payload=rfdiffusion_query,
    nim_endpoint=NIM_ENDPOINTS.RFDIFFUSION,  # <-- key change
    read_timeout=0,                          # wait forever (or set a big number)
    echo=True
)
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end RFDiffusion -> {rc}")

In [None]:
## Print the first 160 characters of the RFDiffusion PDB output
print(rfdiffusion_response["output_pdb"][0:160])

# ProteinMPNN
ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based graph neural network used in *de novo* protein design workflows. It predicts amino acid sequences for given protein backbones, leveraging evolutionary, functional, and structural information to generate sequences that are likely to fold into the desired 3D structures. This tool integrates seamlessly with NIMs into workflows involving RFDiffusion for backbone generation and AlphaFold-2 Multimer for interaction prediction, enhancing the accuracy and efficiency of protein design.

**Inputs**: 
- `input_pdb` Input protein for which amino acid sequences need to be predicted
- `ca_only` Defaults to false, CA-only model helps to address specific needs in protein design where focusing on the alpha carbon (CA)
- `use_soluble_model` ProteinMPNN offers soluble models for applications requiring high solubility and non-soluble models for membrane protein studies and industrial applications where solubility is less critical.
- `num_seq_per_target` how many seqs to generate for a given target protein structure
- `sampling_temp` ranges from 0 to 1 ranges from 0 to 1 and controls the diversity of design outcomes by adjusting the probability values for the 20 amino acids at each sequence position. Higher values increase
 
**Outputs**:
- `ProteinMPNN.fa` which is a fasta file containing the generated sequences for the given structure.

In [None]:
## Expected runtime: < 30 seconds for 20 short sequences
## H100 Runtime: 5 seconds
proteinmpnn_query = {
        "input_pdb" : rfdiffusion_response["output_pdb"],
        "input_pdb_chains" : example.input_pdb_chains,
        "ca_only" : example.ca_only,
        "use_soluble_model" : example.use_soluble_model,
        "num_seq_per_target" : example.num_seq_per_target,
        "sampling_temp" : example.sampling_temp
}

print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] start ProteinMPNN predict")
rc, proteinmpnn_response = query_nim(
    payload=proteinmpnn_query,
    nim_endpoint=NIM_ENDPOINTS.PROTEINMPNN,  # <-- key change
    read_timeout=0,                           # wait indefinitely (or set a big number)
    echo=True
)
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end ProteinMPNN -> {rc}")

In the next step, we'll extract FASTA sequences from the output FASTA file created by ProteinMPNN. Then, we'll create binder-target pairs that we can feed to AlphaFold2-Multimer to predict the binder-target complex structure.

In [None]:
fasta_sequences = [x.strip() for x in proteinmpnn_response["mfasta"].split("\n") if '>' not in x][2:]

binder_target_pairs = [[binder, example.target_sequence] for binder in fasta_sequences]

print(f"Generated {len(fasta_sequences)} FASTA sequences and {len(binder_target_pairs)} binder-target pairs.")

# AlphaFold2-Multimer

AlphaFold2-Multimer is a deep learning model that extends the AlphaFold2 pipelines to predict the combined structure a list of input peptide sequences. 

**Inputs**:

- `sequences`: A list of peptide sequences. For this use case, a single pair of sequences (one peptide chain from the ProteinMPNN result plus the original protein sequence used as input to this workflow).
- `algorithm`: The algorithm uses for Multiple Sequence Alignment (MSA). This can be either `jackhmmer` or `mmseqs2`. MMSeqs2 is significantly faster.

**Output**:

- A list of lists of predicted structures in PDB format. A list of five predictions is returned for each input binder-target pair.

In [None]:
## Expected runtime: 20 min per binder-target pair.
## Total runtime: roughly 3 hours

from datetime import datetime
# sanity: ensure the multimer service is reachable once up front
assert check_nim_readiness("alphafold-multimer"), "alphafold-multimer service not ready"

n_processed = 0
multimer_response_codes = [0 for _ in binder_target_pairs]
multimer_results = [None for _ in binder_target_pairs]

# NOTE: change this to process more or fewer target-binder pairs.
pairs_to_process = 1

for binder_target_pair in binder_target_pairs:
    multimer_query = {
        "sequences": binder_target_pair,
        "selected_models": [1],
    }
    start = datetime.now()
    print(f"[{start:%Y-%m-%d %H:%M:%S}] Processing pair {n_processed+1} of {len(binder_target_pairs)}")

    try:
        rc, multimer_response = query_nim(
            payload=multimer_query,
            nim_endpoint=NIM_ENDPOINTS.AF2_MULTIMER,  # enum (not .value)
            read_timeout=0,                            # wait indefinitely; or set big seconds
            echo=False
        )
    except Exception as e:
        rc, multimer_response = -1, {"error": str(e)}

    multimer_response_codes[n_processed] = rc
    multimer_results[n_processed] = multimer_response

    end = datetime.now()
    elapsed = (end - start).total_seconds()
    print(f"[{end:%Y-%m-%d %H:%M:%S}] Finished pair {n_processed+1}/{len(binder_target_pairs)} in {elapsed:.1f}s (rc={rc})")

    n_processed += 1
    if n_processed >= pairs_to_process:
        break

In [None]:
## Print just the first 160 characters of the first multimer response
result_idx = 0
prediction_idx = 0
print(multimer_results[result_idx][prediction_idx][0:160])

### Assessing the predicted binders and structures

There are many metrics that can be used to assess the quality of the predicted binder-target structure. The predicted local distance difference test (pLDDT) is a measure of per-residue confidence in the local structure. It has a range of zero to one hundred, with higher scores considered more accurate.

The following snippet ranks the results of the binder-target pair AlphaFold2-Multimer predictions by their pLDDT.

In [None]:
# Function to calculate average pLDDT over all residues 
def calculate_average_pLDDT(pdb_string):
    total_pLDDT = 0.0
    atom_count = 0
    pdb_lines = pdb_string.splitlines()
    for line in pdb_lines:
        # PDB atom records start with "ATOM"
        if line.startswith("ATOM"):
            atom_name = line[12:16].strip() # Extract atom name
            if atom_name == "CA":  # Only consider atoms with name "CA"
                try:
                    # Extract the B-factor value from columns 61-66 (following PDB format specifications)
                    pLDDT = float(line[60:66].strip())
                    total_pLDDT += pLDDT
                    atom_count += 1
                except ValueError:
                    pass  # Skip lines where B-factor can't be parsed as a float

    if atom_count == 0:
        return 0.0  # Return 0 if no N atoms were found

    average_pLDDT = total_pLDDT / atom_count
    return average_pLDDT


In [None]:
plddts = []
for idx in range(0, len(multimer_results)):
    if multimer_results[idx] is not None:
        plddts.append(calculate_average_pLDDT(multimer_results[idx][0]))

In [None]:
## Combine the results with their pLDDTs
binder_target_results = list(zip(binder_target_pairs, multimer_results, plddts))

## Sort the results by plddt
sorted_binder_target_results = sorted(binder_target_results, key=lambda x : x[2])

## print the top 5 results
for i in range(0, len(sorted_binder_target_results)):
    print("-"*80)
    print(f"rank: {i}")
    print(f"binder: {sorted_binder_target_results[i][0][0]}")
    print(f"target: {sorted_binder_target_results[i][0][1]}")
    print(f"pLDDT: {sorted_binder_target_results[i][2]}")
    print("-"*80)

These sequences show the highest pLDDT for their binder-target pair.