# De Novo Protein Design Workflow using NIMs on AIF w/ Jupyter-hub


This example notebook outlines a workflow for creating de novo protein binders using NVIDIA Inference Microservices (NIMs). This workflow leverages advanced AI models to enable computational biologists to design novel protein structures efficiently.

The input to this workflow is a protein sequence, which is then fed to AlphaFold2 for structural prediction; alternatively, this can be skipped and a precomputed protein structure (in PDB format) can be used as input. Protein backbones are then generated with RFDiffusion, sequences are generated with ProteinMPNN, and finally complex structures are predicted with AlphaFold2-multimer. 

This setup provides a powerful framework for exploring protein design, offering flexibility and precision in generating functional protein binders. For more information, refer to the respective repositories and documentation.

## Getting started with Demo NIMs

This is all performed bu AI-Factory `blueprint`

Initial startup of the `AlphaFold NIM` data is time consuming and requires roughly 1.2TB of disk space

After AIF NIMS set up is complete, check the status of the four running NIMS e.g with the following commands, remember now we can use the internal service ports and not the public addresses.

```bash
curl -fsS http://alphafold.protein-binder-design:8081/v1/health/ready
curl -fsS http://rfdiffusion.protein-binder-design:8082/v1/health/ready
curl -fsS http://proteinmpnn.protein-binder-design:8083/v1/health/ready
curl -fsS http://alphafold-multimer.protein-binder-design:8084/v1/health/ready
```

In [6]:
!curl -fsS http://alphafold-multimer.protein-binder-design:8084/v1/health/ready

{"status":"ready"}

Using epheneral POD IP's. Why we do this, he discovered an issue with using service ports for long term connections, such as alphafold-multimer.

To find the POD IP's ...

```bash
#!/usr/bin/env bash
set -euo pipefail

# Usage:
#   ./resolve_nim_pods.sh alphafold.protein-binder-design \
#       rfdiffusion.protein-binder-design \
#       proteinmpnn.protein-binder-design:8083 \
#       alphafold-multimer.protein-binder-design:8084
#
# If no arguments are given, it will default to the four NIM services.

if ! command -v kubectl >/dev/null 2>&1; then
  echo "ERROR: kubectl not found in PATH" >&2
  exit 1
fi

if [ "$#" -eq 0 ]; then
  SERVICES=(
    "alphafold.protein-binder-design"
    "rfdiffusion.protein-binder-design"
    "proteinmpnn.protein-binder-design:8083"
    "alphafold-multimer.protein-binder-design:8084"
  )
else
  SERVICES=("$@")
fi

printf "%-35s %-20s %s\n" "SERVICE (FQDN[:PORT])" "SERVICE->TARGET PORTS" "POD IPs"
printf "%-35s %-20s %s\n" "----------------------" "--------------------" "------"

for svc_fqdn in "${SERVICES[@]}"; do
  entry="$svc_fqdn"

  # Strip optional :port if present, keep it just for display
  base="${entry%%:*}"   # alphafold.protein-binder-design
  port_suffix=""
  if [[ "$entry" == *:* ]]; then
    port_suffix="${entry#*:}"  # e.g. 8083
  fi

  # Parse service name and namespace: <svc>.<namespace>
  svc_name="${base%%.*}"
  ns="${base#*.}"

  if [[ -z "$svc_name" || -z "$ns" || "$svc_name" == "$ns" ]]; then
    echo "WARN: cannot parse service/namespace from '$entry', skipping" >&2
    continue
  fi

  # Get port â†’ targetPort mapping from the Service
  if ! svc_json=$(kubectl -n "$ns" get svc "$svc_name" -o json 2>/dev/null); then
    printf "%-35s %-20s %s\n" "$entry" "N/A" "Service not found"
    continue
  fi

  ports=$(echo "$svc_json" \
    | jq -r '.spec.ports[] | "\(.port)->\(.targetPort)"' 2>/dev/null || true)

  if [[ -z "$ports" ]]; then
    ports_str="(no ports)"
  else
    # Join lines into a single space-separated string
    ports_str=$(echo "$ports" | tr '\n' ' ' | sed 's/[[:space:]]*$//')
  fi

  # Get pod IPs from Endpoints
  if ! ep_json=$(kubectl -n "$ns" get endpoints "$svc_name" -o json 2>/dev/null); then
    printf "%-35s %-20s %s\n" "$entry" "$ports_str" "No Endpoints object"
    continue
  fi

  pod_ips=$(echo "$ep_json" \
    | jq -r '.subsets[]?.addresses[]?.ip' 2>/dev/null || true)

  if [[ -z "$pod_ips" ]]; then
    pod_ips_str="(no ready endpoints)"
  else
    # space-separated list of IPs
    pod_ips_str=$(echo "$pod_ips" | tr '\n' ' ' | sed 's/[[:space:]]*$//')
  fi

  printf "%-35s %-20s %s\n" "$entry" "$ports_str" "$pod_ips_str"
done
```

And it should produce...

```bash
[root@bills-k8s-clustermaster1 ~]# ./resolve_nim_pods.sh
SERVICE (FQDN[:PORT])                         SERVICE->TARGET PORTS  POD IPs
----------------------                        ---------------------  -------
alphafold.protein-binder-design               8081->8000             10.233.73.172
rfdiffusion.protein-binder-design             8082->8000             10.233.73.175
proteinmpnn.protein-binder-design:8083        8083->8000             10.233.73.176
alphafold-multimer.protein-binder-design:8084 8084->8000             10.233.73.177
```

In [7]:
!curl -fsS http://10.233.73.177:8000/v1/health/ready

{"status":"ready"}

In [8]:
SERVICE_POD_IPS = {
    "alphafold":          "10.233.73.172",
    "rfdiffusion":        "10.233.73.175",
    "proteinmpnn":        "10.233.73.176",
    "alphafold-multimer": "10.233.73.177",
}

ENV_VAR_NAMES = {
    "alphafold":          "ALPHAFOLD_POD_IP",
    "rfdiffusion":        "RFDIFFUSION_POD_IP",
    "proteinmpnn":        "PROTEINMPNN_POD_IP",
    "alphafold-multimer": "ALPHAFOLD_MULTIMER_POD_IP",
}

for svc, ip in SERVICE_POD_IPS.items():
    env_name = ENV_VAR_NAMES[svc]
    os.environ[env_name] = ip
    print(f"Set {env_name}={ip}")

Set ALPHAFOLD_POD_IP=10.233.73.172
Set RFDIFFUSION_POD_IP=10.233.73.175
Set PROTEINMPNN_POD_IP=10.233.73.176
Set ALPHAFOLD_MULTIMER_POD_IP=10.233.73.177


First, we'll install some prerequisites so our examples work.

In [9]:
%%capture cap
pip install requests py3Dmol python-dotenv

In [10]:
from __future__ import annotations
from enum import Enum, StrEnum
from pathlib import Path
from typing import Dict, Any, Tuple, Optional, Union
import os, requests, json, time
from dotenv import load_dotenv

One needs to use an NGC Personal Key to run the examples below.

In [11]:
load_dotenv()
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY") or input("Paste Run Key: ")

The Jupyter-Lab extension makes embedding the GPU-diagnostic dashboards as movable windows within an interactive Jupyter-Lab environment.

See: [https://developer.nvidia.com/blog/gpu-dashboards-in-jupyter-lab/](https://developer.nvidia.com/blog/gpu-dashboards-in-jupyter-lab/)

In [12]:
%%capture cap
pip install jupyterlab-nvdashboard

__refresh__ the page then you should see the GPU Dashboard on the left

In [13]:
# --- Kubernetes wiring
NAMESPACE = "protein-binder-design"
SERVICES: Dict[str, int] = {
    "alphafold":          8081,
    "rfdiffusion":        8082,
    "proteinmpnn":        8083,
    "alphafold-multimer": 8084,
}
SERVICES_BY_PORT = {port: svc for svc, port in SERVICES.items()}

# When talking directly to the pod IP, all NIMs listen on 8000
POD_TARGET_PORT = 8000

# Optional per-service POD IP overrides, populated from env vars
# e.g. export ALPHAFOLD_POD_IP=10.233.73.176
POD_IP_ENV = {
    "alphafold":          "ALPHAFOLD_POD_IP",
    "rfdiffusion":        "RFDIFFUSION_POD_IP",
    "proteinmpnn":        "PROTEINMPNN_POD_IP",
    "alphafold-multimer": "ALPHAFOLD_MULTIMER_POD_IP",
}

POD_IP_OVERRIDES: Dict[str, str] = {
    svc: ip for svc, env_name in POD_IP_ENV.items()
    if (ip := os.getenv(env_name))
}

def svc_url(
    service: str,
    ns: str = NAMESPACE,
    port: Optional[int] = None,
    scheme: str = "http",
) -> str:
    """
    Build URL base for a NIM service.

    - If a POD IP env var is set (e.g. ALPHAFOLD_POD_IP), use:
        http://<pod-ip>:8000
    - Otherwise, fall back to ClusterIP service:
        http://<service>.<namespace>:<service-port>
    """
    env_var = POD_IP_ENV.get(service)
    pod_ip = os.getenv(env_var) if env_var else None

    if pod_ip:
        # use pod IP + pod port (8000)
        p = POD_TARGET_PORT if port is None else int(port)
        host = pod_ip
    else:
        # fall back to Service DNS + service port
        p = SERVICES[service] if port is None else int(port)
        host = f"{service}.{ns}"

    return f"{scheme}://{host}:{p}"

_POLL_SECS = os.getenv("NVCF_POOL_SECONDS","5")
# --- Auth header (optional)
HEADERS: Dict[str, str] = {
    "content-type": "application/json",
    "NVCF-POOL-SECONDS": _POLL_SECS,
}

# Optional bearer token from env
if (k := os.getenv("NGC_API_KEY")):
    HEADERS["Authorization"] = f"Bearer {k}"
elif (k := os.getenv("NVIDIA_API_KEY")):
    HEADERS["Authorization"] = f"Bearer {k}"

# --- Requests session; ignore any proxy envs inside the pod
SESSION = requests.Session()
SESSION.trust_env = False

# --- Make sure in-cluster names never go through a proxy (harmless if no proxy present)
no_proxy = [
    "localhost","127.0.0.1",
    ".svc",".svc.cluster.local",".cluster.local",
    "10.0.0.0/8","172.16.0.0/12","192.168.0.0/16",
    "10.233.0.0/16",  # adjust if your cluster CIDR differs
]
no_proxy += [f"{s}.{NAMESPACE}" for s in SERVICES]
os.environ["NO_PROXY"] = ",".join(no_proxy)
os.environ["no_proxy"] = os.environ["NO_PROXY"]

def nim_base_url(nim_endpoint: NIM_ENDPOINTS) -> str:
    """
    Resolve the base URL for a given NIM endpoint.

    - If a POD IP env var is set (e.g. ALPHAFOLD_POD_IP), this returns:
        http://<pod-ip>:8000
    - Otherwise, it falls back to:
        http://<service>.<namespace>:<service-port>
    """
    service = ENDPOINT_TO_SERVICE[nim_endpoint]
    return svc_url(service)   # svc_url already knows about pod IP overrides

In [14]:
class NIM_PORTS(Enum):
    ALPHAFOLD2_PORT  = 8081
    RFDIFFUSION_PORT = 8082
    PROTEINMPNN_PORT = 8083
    AF2_MULTIMER_PORT= 8084

class NIM_ENDPOINTS(StrEnum):
    ALPHAFOLD2   = "protein-structure/alphafold2/predict-structure-from-sequence"
    RFDIFFUSION  = "biology/ipd/rfdiffusion/generate"
    PROTEINMPNN  = "biology/ipd/proteinmpnn/predict"
    AF2_MULTIMER = "protein-structure/alphafold2/multimer/predict-structure-from-sequences"

ENDPOINT_TO_SERVICE = {
    NIM_ENDPOINTS.ALPHAFOLD2:   "alphafold",
    NIM_ENDPOINTS.RFDIFFUSION:  "rfdiffusion",
    NIM_ENDPOINTS.PROTEINMPNN:  "proteinmpnn",
    NIM_ENDPOINTS.AF2_MULTIMER: "alphafold-multimer",
}

In [15]:
# Health check (accepts service name, port, or enum)
def check_nim_readiness(
    target: Union[str, int, NIM_PORTS],
    ns: str = NAMESPACE,
    ready_path: str = "/v1/health/ready",
    timeout: int = 5,
    verbose: bool = False,
) -> bool:
    """
    Check NIM readiness.

    target can be:
      - service name: "alphafold", "rfdiffusion", ...
      - port number: 8081, 8082, ...
      - NIM_PORTS enum

    If a POD IP override is configured for the service (e.g. ALPHAFOLD_POD_IP),
    this will hit http://<pod-ip>:8000/v1/health/ready.
    Otherwise it falls back to http://<service>.<ns>:<service-port>/v1/health/ready.
    """
    # normalize to service name
    service: Optional[str] = None
    port: Optional[int] = None

    if isinstance(target, NIM_PORTS):
        port = int(target.value)
        service = SERVICES_BY_PORT.get(port)
    elif isinstance(target, int):
        port = target
        service = SERVICES_BY_PORT.get(port)
    else:
        service = str(target)
        port = SERVICES.get(service)

    if service is None:
        raise ValueError(f"Unknown service/port: {target}")

    # IMPORTANT: let svc_url decide port based on pod-IP override.
    # Do NOT pass `port` here, or you'll force the Service port.
    base = svc_url(service, ns)
    url = f"{base}/{ready_path.lstrip('/')}"

    if verbose:
        print(f"[readiness:{service}] GET {url}")

    try:
        r = SESSION.get(url, headers=HEADERS, timeout=timeout)
        if verbose:
            print(f"[readiness:{service}] status={r.status_code} body={r.text!r}")
        return r.ok and r.json().get("status") == "ready"
    except Exception as e:
        print(f"[readiness:{service}] {e}")
        return False

Main request helper (cluster-native; supports long jobs + infinite wait)
* Uses {service}.{namespace}:{port} automatically.
* read_timeout=0 or -1 â‡’ wait forever (like your original code).
* For long AF2 jobs, you can either set a very large timeout or keep it infinite.

In [16]:
import time
from typing import Optional, Dict, Any, Tuple

def query_nim(
    *,
    payload: dict,
    nim_endpoint: NIM_ENDPOINTS,
    read_timeout: Optional[int] = 0,   # 0 or None => no timeout
    poll_on_202: bool = True,
    force_async: bool = True,
    echo: bool = False,
) -> tuple[int, Any]:
    """
    Generic NIM caller.

    - Builds URL from NIM_ENDPOINTS + ENDPOINT_TO_SERVICE + svc_url (pod IP aware).
    - If read_timeout is 0/None, we let requests.post wait indefinitely.
    - If poll_on_202 is True, we handle NIM-style 202+Location polling.
    """
    base = nim_base_url(nim_endpoint)
    url = f"{base}/{nim_endpoint.value}"

    timeout = None if not read_timeout else read_timeout

    if echo:
        print(f"[query_nim] POST {url}")
        print(f"  Payload keys: {list(payload.keys())}")
        print(f"  Timeout: {timeout}  poll_on_202={poll_on_202}  force_async={force_async}")

    # For most on-prem NIMs, you can just do a straight POST and get 200
    resp = SESSION.post(url, headers=HEADERS, json=payload, timeout=timeout)

    # Simple 200 path
    if resp.status_code != 202 or not poll_on_202:
        try:
            body = resp.json()
        except Exception:
            body = resp.text
        return resp.status_code, body

    # 202 + polling path (if your NIM uses it)
    # Expecting a Location header pointing to a result URL
    loc = resp.headers.get("Location") or resp.headers.get("location")
    if not loc:
        # No Location header â€“ just return what we got
        try:
            body = resp.json()
        except Exception:
            body = resp.text
        return resp.status_code, body

    if echo:
        print(f"[query_nim] 202 accepted, polling {loc}")

    # Basic polling loop
    while True:
        poll_resp = SESSION.get(loc, headers=HEADERS, timeout=timeout)
        if poll_resp.status_code in (200, 4_00, 4_04, 5_00):
            try:
                body = poll_resp.json()
            except Exception:
                body = poll_resp.text
            return poll_resp.status_code, body

        # Otherwise, sleep and try again
        time.sleep(int(os.getenv("NVCF_POOL_SECONDS", "5")))

In [17]:
def get_reduced_pdb(pdb_id: str, rcsb_path: str = None) -> str:
    pdb = Path(pdb_id)
    if not pdb.exists() and rcsb_path is not None:
        pdb.write_text(requests.get(rcsb_path).text)
    lines = filter(lambda line: line.startswith("ATOM"), pdb.read_text().split("\n"))
    return "\n".join(list(lines))


In [18]:
class ExampleRequestParams:
    def __init__(self,
                target_sequence: str,
                contigs: str, 
                hotspot_res: List[str],
                input_pdb_chains: List[str],
                ca_only: bool,
                use_soluble_model: bool,
                sampling_temp: List[float],
                diffusion_steps: int = 15,
                num_seq_per_target: int = 20):
        self.target_sequence = target_sequence
        self.contigs = contigs
        self.hotspot_res = hotspot_res
        self.input_pdb_chains = input_pdb_chains
        self.ca_only = ca_only
        self.use_soluble_model = use_soluble_model
        self.sampling_temp = sampling_temp
        self.diffusion_steps = diffusion_steps
        self.num_seq_per_target = num_seq_per_target

### Example data
Below, we include three example input sets. Note that these are of varying difficulty and will exhibit different runtimes and resource utilizations.
- Example **1R42** should run on most systems with 4 GPUs with 40GB of VRAM or more.
- Example **5PTN**
- Example **6VXX** requires 4 GPUs with 80GB of VRAM each.

In [19]:
example_6vxx = ExampleRequestParams(
    target_sequence="MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPSGAGSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDPPEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKGSGRENLYFQGGGGSGYIPEAPRDGQAYVRKDGEWVLLSTFLGHHHHHHHH",
    contigs="A353-410/0 100-200",
    hotspot_res=["A360","A361","A362","A366"],
    input_pdb_chains=["A"],
    ca_only=False,
    use_soluble_model=False,
    sampling_temp=[0.1],
    diffusion_steps=15,
    num_seq_per_target=20
)
example_5ptn = ExampleRequestParams(
    target_sequence="NITEEFYQSTCSAVSKGYLSALRTGWYTSVITIELSNIKKIKCNGTDAKIKLIKQELDKYKNAVTELQLLMQSTPATNNQARGSGSGRSLGFLLGVGSAIASGVAVSKVLHLEGEVNKIKSALLSTNKAVVSLSNGVSVLTSKVLDLKNYIDKQLLPIVNKQSCSIPNIETVIEFQQKNNRLLEITREFSVNAGVTTPVSTYMLTNSELLSLINDMPITNDQKKLMSNNVQIVRQQSYSIMSIIKEEVLAYVVQLPLYGVIDTPCWKLHTSPLCTTNTKEGSNICLTRTDRGWYCDNAGSVSFFPQAETCKVQSNRVFCDTMNSLTLPSEVNLCNVDIFNPKYDCKIMTSKTDVSSSVITSLGAIVSCYGKTKCTASNKNRGIIKTFSNGCDYVSNKGVDTVSVGNTLYYVNKQEGKSLYVKGEPIINFYDPLVFPSDQFDASISQVNEKINQSLAFIRKSDELLSAIGGYIPEAPRDGQAYVRKDGEWVLLSTFLGGLVPRGSHHHHHH",
    contigs="A1-25/0 70-100",
    hotspot_res=["A14","A15","A17","A18"],
    input_pdb_chains=["A"],
    ca_only=False,
    use_soluble_model=False,
    sampling_temp=[0.1],
    diffusion_steps=15,
    num_seq_per_target=20
)
example_1r42 = ExampleRequestParams(
    target_sequence="STIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLAQMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNERLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHLHAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGLPNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGFHEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEMKREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRLGKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYAD",
    contigs="A114-353/0 50-100",
    hotspot_res=["A119","A123","A233","A234","A235"],
    input_pdb_chains=["A"],
    ca_only=False,
    use_soluble_model=False,
    sampling_temp=[0.1],
    diffusion_steps=15,
    num_seq_per_target=20
)

In [20]:
## Set the example here to switch example inputs.
## Note: Example 6vxx requires a GPU with at least 80GB of VRAM.
example = example_1r42

### Check that the NIM is ready from Python

We can test whether each NIM is up and running using our check_nim_readiness function

In [21]:
for env in ("ALPHAFOLD_POD_IP", "RFDIFFUSION_POD_IP",
            "PROTEINMPNN_POD_IP", "ALPHAFOLD_MULTIMER_POD_IP"):
    print(env, "=", os.environ.get(env))

ALPHAFOLD_POD_IP = 10.233.73.172
RFDIFFUSION_POD_IP = 10.233.73.175
PROTEINMPNN_POD_IP = 10.233.73.176
ALPHAFOLD_MULTIMER_POD_IP = 10.233.73.177


In [22]:
for s in ("alphafold","rfdiffusion","proteinmpnn","alphafold-multimer"):
    print(s, "ready:", check_nim_readiness(s, verbose=True))

[readiness:alphafold] GET http://10.233.73.172:8000/v1/health/ready
[readiness:alphafold] status=200 body='{"status":"ready"}'
alphafold ready: True
[readiness:rfdiffusion] GET http://10.233.73.175:8000/v1/health/ready
[readiness:rfdiffusion] status=200 body='{"status":"ready"}'
rfdiffusion ready: True
[readiness:proteinmpnn] GET http://10.233.73.176:8000/v1/health/ready
[readiness:proteinmpnn] status=200 body='{"status":"ready"}'
proteinmpnn ready: True
[readiness:alphafold-multimer] GET http://10.233.73.177:8000/v1/health/ready
[readiness:alphafold-multimer] status=200 body='{"status":"ready"}'
alphafold-multimer ready: True


# AlphaFold2

AlphaFold2 is a deep learning model for predicting protein structure from amino acid sequence that has achieved state-of-the-art performance. The NVIDIA AlphaFold2 NIM includes GPU-accelerated MMseqs2, which accelerates the MSA portion of the structural prediction pipeline.

**Inputs**:
- `sequence`: An amino acid sequence
- `algorithm`: The algorithm used for Multiple Sequence Alignment (MSA). This can be either of `jackhmmer` or `mmseqs2`. MMSeqs2 is significantly faster.

**Outputs**:
- A list of predicted structures in PDB format.

In [23]:
print(f"endpoint: {NIM_ENDPOINTS.ALPHAFOLD2}")

endpoint: protein-structure/alphafold2/predict-structure-from-sequence


In [24]:
from datetime import datetime
from cache_nim import query_nim_cached

# Estimated 9 minutes on H100 for example 5ptn
alphafold2_query = {
    "sequence": example.target_sequence,
    "algorithm": "mmseqs2",  # if this needs internet, ensure namespace egress allows it
}

def _call_alphafold2(q):
    # Your exact call:
    return query_nim(
        payload=q,
        nim_endpoint=NIM_ENDPOINTS.ALPHAFOLD2,  # enum, not .value
        read_timeout=0,                         # wait forever (or use a big number)
        poll_on_202=True,                       # if server is async, weâ€™ll poll
        force_async=True,                       # nudge service into 202+poll mode (ignored if unsupported)
        echo=True
    )

# Optional: write a PDB file if present in the response structure
def _save_alphafold2_artifacts(response, outdir):
    # Adjust these keys to match what your service returns
    pdb_text = None
    if isinstance(response, dict):
        # common places people park text blobs
        pdb_text = (
            response.get("pdb")
            or (response.get("result") or {}).get("pdb")
            or (response.get("outputs") or {}).get("pdb")
        )
    if pdb_text:
        (outdir / "result.pdb").write_text(pdb_text)
        
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] start predicted structures in PDB format")

rc, alphafold2_response = query_nim_cached(
    step="alphafold2",
    endpoint=str(NIM_ENDPOINTS.ALPHAFOLD2),
    payload=alphafold2_query,
    fetch_fn=_call_alphafold2,
    cache_dir=".nim_cache",              # customize if you like
    version=os.getenv("ALPHAFOLD2_TAG"),# helps invalidate when model/container changes
    ttl_seconds=None,                    # or e.g., 30*24*3600 for 30 days
    refresh=False,                       # set True to force recompute
    save_extra=_save_alphafold2_artifacts,
)
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end predicted structures in PDB format")
if alphafold2_response and isinstance(alphafold2_response, dict):
    if alphafold_response.get("_from_cache"):
        print("âœ… cache hit for alphafold2")
    else:
        print("ðŸ§  computed alphafold2 result and cached it")


[2025-11-19 18:38:46] start predicted structures in PDB format
[2025-11-19 18:38:46] end predicted structures in PDB format


In [25]:
isinstance(alphafold2_response, dict)

False

In [26]:
## Print the first two lines (160 characters) of the alphafold2 response
alphafold2_response[0][0:160]

'ATOM      1  N   SER A   1     -33.577 -10.521  32.547  1.00 68.70           N  \nATOM      2  H   SER A   1     -33.701 -11.499  32.767  1.00 68.70           H '

In [27]:
if rc == 200:
    import py3Dmol
    view = py3Dmol.view(width=800, height=600)
    view.addModel(alphafold2_response[0], "pdb")
    view.setStyle({"cartoon": {"color": "spectrum"}})
    view.setBackgroundColor("black")
    
    view.zoomTo()
    view.show()
else:
    print(f"Unexpected HTTP status: {rc}")
    print(f"Response: {rc}")

# RFDiffusion

This section demonstrates how to use RFDiffusion NIM in a *de novo* protein design workflow. Inspired by AI image generation models, RFDiffusion applies generative diffusion techniques to create novel protein structures. It excels in designing complex protein architectures, including binders and symmetric assemblies, by sculpting atomic clouds into functional proteins.

**Inputs**
- `input_pdb` is the protein target in PDB format
- `contigs` is the RFDiffusion language for how to specify regions to work on. See the official [RFDiffusion repo](https://github.com/RosettaCommons/RFdiffusion?tab=readme-ov-file#running-the-diffusion-script) for a full breakdown. A20-60/0 50-100 means to generate a binder to chain A residue 20-60, where the binder is 50-100 residues long. The /0 specifies a chain break.
- `hotspot_res` hot spot residues (specifically for binders)
- `diffusion_steps` number of diffusion_steps

**Output**:
- `output_pdb` is the output pdb
- `protein` is the input pdb

In [28]:
## Expected runtime: ~15 seconds to 1 minute
## H100 runtime: 2 seconds
## Albeit we don't really need to use the caching S/W but at least it gives us a checkpoint of the data.


def _call_rfdiffusion(p):
    # Your exact call:
    return query_nim(
        payload=p,
        nim_endpoint=NIM_ENDPOINTS.RFDIFFUSION,
        read_timeout=0,
        poll_on_202=True,
        force_async=True,
        echo=True,
    )
# Optional: write a PDB file if present in the response structure
def _save_rfdiffusion_artifacts(response, outdir):
    # Adjust these keys to match what your service returns
    pdb_text = None
    if isinstance(response, dict):
        # common places people park text blobs
        pdb_text = (
            response.get("pdb")
            or (response.get("result") or {}).get("pdb")
            or (response.get("outputs") or {}).get("pdb")
        )
    if pdb_text:
        (outdir / "result.pdb").write_text(pdb_text)
        
rfdiffusion_query = {
        "input_pdb" : alphafold2_response[0], ## Take the first structure prediction (of 5) from AlphaFold2
        "contigs" : "51-51/A163-181/60-60", #example.contigs
        # "hotspot_res" : example.hotspot_res,
        "diffusion_steps" : example.diffusion_steps
    }
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] start RFDiffusion generate")

rc, rfdiffusion_response = query_nim_cached(
    step="rfdiffusion",
    endpoint=str(NIM_ENDPOINTS.RFDIFFUSION),
    payload=rfdiffusion_query,
    fetch_fn=_call_rfdiffusion,
    cache_dir=".nim_cache",              # customize if you like
    version=os.getenv("RFDIFFUSION_TAG"),# helps invalidate when model/container changes
    ttl_seconds=None,                    # or e.g., 30*24*3600 for 30 days
    refresh=False,                       # set True to force recompute
    save_extra=_save_rfdiffusion_artifacts,
)
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end RFDiffusion -> {rc}")
if rfdiffusion_response and isinstance(rfdiffusion_response, dict):
    if rfdiffusion_response.get("_from_cache"):
        print("âœ… cache hit for rfdiffusion")
    else:
        print("ðŸ§  computed rfdiffusion result and cached it")
        
#rc, rfdiffusion_response = query_nim(
#    payload=rfdiffusion_query,
#    nim_endpoint=NIM_ENDPOINTS.RFDIFFUSION,  # <-- key change
#    read_timeout=0,                          # wait forever (or set a big number)
#    poll_on_202=True,                       # if server is async, weâ€™ll poll
#    force_async=True,                       # nudge service into 202+poll mode (ignored if unsupported)
#    echo=True
#)
#print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end RFDiffusion -> {rc}")

[2025-11-19 18:38:57] start RFDiffusion generate
[2025-11-19 18:38:57] end RFDiffusion -> 200
âœ… cache hit for rfdiffusion


In [29]:
## Print the first 160 characters of the RFDiffusion PDB output
print(rfdiffusion_response["output_pdb"][0:160])

ATOM      1  N   GLY A   1      -9.691 -15.140  15.199  1.00  0.00
ATOM      2  CA  GLY A   1      -8.857 -15.213  14.006  1.00  0.00
ATOM      3  C   GLY A   1


In [30]:
if rc == 200:
    import py3Dmol
    view = py3Dmol.view(width=800, height=600)
    view.addModel(rfdiffusion_response["output_pdb"], "pdb")
    view.setStyle({"cartoon": {"color": "spectrum"}})
    view.setBackgroundColor("black")
    
    view.zoomTo()
    view.show()
else:
    print(f"Unexpected HTTP status: {rc}")


# ProteinMPNN
ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based graph neural network used in *de novo* protein design workflows. It predicts amino acid sequences for given protein backbones, leveraging evolutionary, functional, and structural information to generate sequences that are likely to fold into the desired 3D structures. This tool integrates seamlessly with NIMs into workflows involving RFDiffusion for backbone generation and AlphaFold-2 Multimer for interaction prediction, enhancing the accuracy and efficiency of protein design.

**Inputs**: 
- `input_pdb` Input protein for which amino acid sequences need to be predicted
- `ca_only` Defaults to false, CA-only model helps to address specific needs in protein design where focusing on the alpha carbon (CA)
- `use_soluble_model` ProteinMPNN offers soluble models for applications requiring high solubility and non-soluble models for membrane protein studies and industrial applications where solubility is less critical.
- `num_seq_per_target` how many seqs to generate for a given target protein structure
- `sampling_temp` ranges from 0 to 1 ranges from 0 to 1 and controls the diversity of design outcomes by adjusting the probability values for the 20 amino acids at each sequence position. Higher values increase
 
**Outputs**:
- `ProteinMPNN.fa` which is a fasta file containing the generated sequences for the given structure.

In [31]:
## Expected runtime: < 30 seconds for 20 short sequences
## H100 Runtime: 5 seconds
proteinmpnn_query = {
        "input_pdb" : rfdiffusion_response["output_pdb"],
        "input_pdb_chains" : example.input_pdb_chains,
        "ca_only" : example.ca_only,
        "use_soluble_model" : example.use_soluble_model,
        "num_seq_per_target" : example.num_seq_per_target,
        "sampling_temp" : example.sampling_temp
}

def _call_proteinmpnn(payload: dict) -> Tuple[int, Any]:
    # This is exactly what you were doing before, just wrapped.
    return query_nim(
        payload=payload,
        nim_endpoint=NIM_ENDPOINTS.PROTEINMPNN,
        read_timeout=0,      # wait indefinitely (or set a big number)
        poll_on_202=True,    # if server is async, weâ€™ll poll
        force_async=True,    # nudge service into 202+poll mode
        echo=True,
    )
    
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] start ProteinMPNN predict")

rc, proteinmpnn_response = query_nim_cached(
    step="proteinmpnn",                          # logical pipeline step name
    endpoint="PROTEINMPNN",                      # any string; used in hash + metadata
    payload=proteinmpnn_query,
    fetch_fn=_call_proteinmpnn,
    cache_dir=".nim_cache",                      # or whatever youâ€™re already using
    version=None,                                # or e.g. "proteinmpnn-v1", image tag, git SHA
    ttl_seconds=None,                            # or set if you want cache expiry
    refresh=False,                               # set True to force recompute
    # save_extra=optional hook, see below
)
print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] end ProteinMPNN -> {rc}")

# If you want to know if it was cached:
if isinstance(proteinmpnn_response, dict) and proteinmpnn_response.get("_from_cache"):
    print("ProteinMPNN result came from cache")

[2025-11-19 18:39:02] start ProteinMPNN predict
[2025-11-19 18:39:02] end ProteinMPNN -> 200
ProteinMPNN result came from cache


In [32]:
def parse_mfasta(mfasta_str):
    """
    Parse a multi-FASTA string into:
    [
      {"idx": 0, "header": ">", "sequence": "ACDE..."},
      ...
    ]
    """
    records = []
    header = None
    seq_lines = []

    for line in mfasta_str.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            # flush previous record
            if header is not None:
                records.append({
                    "idx": len(records),
                    "header": header,
                    "sequence": "".join(seq_lines),
                })
            header = line[1:].strip()  # drop '>'
            seq_lines = []
        else:
            seq_lines.append(line)

    # flush last record
    if header is not None:
        records.append({
            "idx": len(records),
            "header": header,
            "sequence": "".join(seq_lines),
        })

    return records

In [33]:
if rc == 200:
    mfasta = proteinmpnn_response["mfasta"]
    scores = proteinmpnn_response.get("scores", None)
    probs = proteinmpnn_response.get("probs", None)

    records = parse_mfasta(mfasta)

    print(f"\nProteinMPNN returned {len(records)} design(s):\n")

    for i, rec in enumerate(records):
        score_str = ""
        if isinstance(scores, (list, tuple)) and i < len(scores):
            score_str = f" score={scores[i]:.3f}" if isinstance(scores[i], (int, float)) else f" score={scores[i]}"

        header = rec["header"] or f"design_{i+1}"
        print(f">{header}{score_str}")
        print(rec["sequence"])
        print()
else:
    print(f"Unexpected HTTP status: {rc}")
    print("Response:", proteinmpnn_response)


ProteinMPNN returned 21 design(s):

>input, score=2.6207, global_score=2.6207, fixed_chains=[], designed_chains=['A'], model_name=v_48_002, git_hash=unknown, seed=299 score=0.991
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEYVVLKNEMARANHYEDYGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

>T=0.1, sample=1, score=0.9907, global_score=0.9907, seq_recovery=0.0923 score=1.035
MKELSKDMKALFKSILGDESIENVYKEIHKVYKEVEINGRKFVIADGTLKDEEVEEILNEIAKKLGYKSWKESGKHFSLFEKKLEGELSSLKVSEKVYQIETKYEDIDLVCMAISKEDGKVEYYLFRNNL

>T=0.1, sample=2, score=1.0352, global_score=1.0352, seq_recovery=0.1077 score=1.027
MKKIDKSMETLFLSILGDESIKNVYKKIYEVYKTVELDGYTFVIAKGELKDEEVEKILNAIAKKLGYESYEKSGKHFSIFTGKVEGELSSKEVSEKIYQIETKYENISLIAMEISKENGEVNYYLFKNNL

>T=0.1, sample=3, score=1.0273, global_score=1.0273, seq_recovery=0.0615 score=0.977
MKEIEESTKTLFLSVLGDESIENVYKKVFKVYKEVELDGRKFVIAEAELEREEVEEVLDKIAKKLGYKSWKESGSLFSLFTKEVKDELSSKKVSDEIFEIETKYEDISLIAMEISREDGKVKYYLFKNKL

>T=0.1, sample=4, score=0.9769, globa

In [34]:
import py3Dmol

# Structure from RFdiffusion / ProteinMPNN input
pdb_str = proteinmpnn_query["input_pdb"]

view = py3Dmol.view(width=800, height=600)
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "spectrum"}})
view.setBackgroundColor("black")
view.zoomTo()
view.show()

# Below: sequences from ProteinMPNN
records = parse_mfasta(proteinmpnn_response["mfasta"])
scores = proteinmpnn_response.get("scores", None)

for i, rec in enumerate(records):
    score_str = ""
    if isinstance(scores, (list, tuple)) and i < len(scores):
        score_str = f" score={scores[i]:.3f}" if isinstance(scores[i], (int, float)) else f" score={scores[i]}"
    print(f">{rec['header']}{score_str}")
    print(rec["sequence"])
    print()

>input, score=2.6207, global_score=2.6207, fixed_chains=[], designed_chains=['A'], model_name=v_48_002, git_hash=unknown, seed=299 score=0.991
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEYVVLKNEMARANHYEDYGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

>T=0.1, sample=1, score=0.9907, global_score=0.9907, seq_recovery=0.0923 score=1.035
MKELSKDMKALFKSILGDESIENVYKEIHKVYKEVEINGRKFVIADGTLKDEEVEEILNEIAKKLGYKSWKESGKHFSLFEKKLEGELSSLKVSEKVYQIETKYEDIDLVCMAISKEDGKVEYYLFRNNL

>T=0.1, sample=2, score=1.0352, global_score=1.0352, seq_recovery=0.1077 score=1.027
MKKIDKSMETLFLSILGDESIKNVYKKIYEVYKTVELDGYTFVIAKGELKDEEVEKILNAIAKKLGYESYEKSGKHFSIFTGKVEGELSSKEVSEKIYQIETKYENISLIAMEISKENGEVNYYLFKNNL

>T=0.1, sample=3, score=1.0273, global_score=1.0273, seq_recovery=0.0615 score=0.977
MKEIEESTKTLFLSVLGDESIENVYKKVFKVYKEVELDGRKFVIAEAELEREEVEEVLDKIAKKLGYKSWKESGSLFSLFTKEVKDELSSKKVSDEIFEIETKYEDISLIAMEISREDGKVKYYLFKNKL

>T=0.1, sample=4, score=0.9769, global_score=0.9769, seq_recovery=0.1077 s

In the next step, we'll extract FASTA sequences from the output FASTA file created by ProteinMPNN. Then, we'll create binder-target pairs that we can feed to AlphaFold2-Multimer to predict the binder-target complex structure.

In [35]:
fasta_sequences = [x.strip() for x in proteinmpnn_response["mfasta"].split("\n") if '>' not in x][2:]

binder_target_pairs = [[binder, example.target_sequence] for binder in fasta_sequences]

print(f"Generated {len(fasta_sequences)} FASTA sequences and {len(binder_target_pairs)} binder-target pairs.")

Generated 20 FASTA sequences and 20 binder-target pairs.


# AlphaFold2-Multimer

AlphaFold2-Multimer is a deep learning model that extends the AlphaFold2 pipelines to predict the combined structure a list of input peptide sequences. 

**Inputs**:

- `sequences`: A list of peptide sequences. For this use case, a single pair of sequences (one peptide chain from the ProteinMPNN result plus the original protein sequence used as input to this workflow).
- `algorithm`: The algorithm uses for Multiple Sequence Alignment (MSA). This can be either `jackhmmer` or `mmseqs2`. MMSeqs2 is significantly faster.

**Output**:

- A list of lists of predicted structures in PDB format. A list of five predictions is returned for each input binder-target pair.

In [67]:
import requests
import time
from datetime import datetime

AF2_URL = "http://10.233.73.177:8000/protein-structure/alphafold2/multimer/predict-structure-from-sequences"

def query_af2_multimer_debug(multimer_query, max_wait=7200):
    """
    Debug client for AF2 multimer using the exact URL query_nim uses.
    Handles both synchronous (200) and async (202+poll) patterns.
    """
    headers = {
        "Content-Type": "application/json",
        # Add auth header here if your NIM needs it, e.g.:
        # "Authorization": f"Bearer {NVIDIA_API_KEY}",
    }

    print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] POST {AF2_URL}")
    try:
        resp = requests.post(AF2_URL, json=multimer_query, headers=headers, timeout=None)
    except Exception as e:
        print("Initial POST failed:", repr(e))
        return -1, {"error": str(e)}

    print("Initial status:", resp.status_code)

    # Synchronous success
    if resp.status_code in (200, 201):
        try:
            return resp.status_code, resp.json()
        except Exception:
            return resp.status_code, resp.text

    # Non-async error
    if resp.status_code != 202:
        print("Unexpected status from AF2 POST:", resp.status_code)
        print("Body (truncated):", resp.text[:500])
        return resp.status_code, resp.text

    # Async: 202 â†’ figure out where to poll
    try:
        data = resp.json()
    except Exception:
        data = {}

    poll_url = (
        resp.headers.get("Location")
        or data.get("location")
        or data.get("result_url")
    )
    task_id = data.get("id") or data.get("task_id")

    if not poll_url and task_id:
        # adjust if your NIM uses a different task URL
        poll_url = f"{AF2_URL.rsplit('/', 1)[0]}/tasks/{task_id}"

    if not poll_url:
        print("Got 202 but no poll URL or task id; body:", data)
        return resp.status_code, data

    print("Polling URL:", poll_url)

    t0 = time.time()
    last_status = None
    while True:
        elapsed = time.time() - t0
        if max_wait is not None and elapsed > max_wait:
            print(f"Max wait {max_wait}s exceeded, stopping poll.")
            return -1, {"error": "max_wait exceeded", "last_status": last_status}

        try:
            poll_resp = requests.get(poll_url, headers=headers, timeout=None)
        except Exception as e:
            print("Poll request failed:", repr(e))
            return -1, {"error": str(e)}

        last_status = poll_resp.status_code
        print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] poll -> {poll_resp.status_code}")

        if poll_resp.status_code in (200, 201):
            try:
                return poll_resp.status_code, poll_resp.json()
            except Exception:
                return poll_resp.status_code, poll_resp.text

        if poll_resp.status_code >= 400 and poll_resp.status_code != 202:
            print("Error from poll endpoint:", poll_resp.status_code)
            print("Body (truncated):", poll_resp.text[:500])
            return poll_resp.status_code, poll_resp.text

        time.sleep(30)

In [68]:
url = "http://10.233.73.177:8000/v1/health/ready"
r = requests.get(url, timeout=10)
print(r.status_code, r.text)

200 {"status":"ready"}


In [84]:
import requests, json, time
from datetime import datetime
url = "http://10.233.73.177:8000/protein-structure/alphafold2/multimer/predict-structure-from-sequences"

sequences = binder_target_pairs[0] # ["MNVIDIAIAMAI", "IAMNVIDIAAI"]

headers = {"content-type": "application/json"}
data = {
    "sequences": sequences, 
    "selected_models": [1],
    }

print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] BEFORE POST")
t0 = time.time()
try:
    # stream=True: return after headers/status, don't force full body download
    resp = requests.post(
        url, 
        headers=headers, 
        json=data,
        timeout=None,
    )
    t1 = time.time()
    print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] AFTER POST (elapsed {t1 - t0:.1f}s)")
    print("Status:", resp.status_code)
    print("Headers:", dict(resp.headers))
    # Try to read a tiny chunk from the body
    try:
        chunk = next(resp.iter_content(chunk_size=1024), b"")
        print("First chunk size:", len(chunk))
    except StopIteration:
        print("No body data (empty response)")
except Exception as e:
    t1 = time.time()
    print(f"[{datetime.now():%Y-%m-%d %H:%M:%S}] POST raised after {t1 - t0:.1f}s: {e!r}")

[2025-11-19 21:07:29] BEFORE POST
[2025-11-19 21:42:00] AFTER POST (elapsed 2071.3s)
Status: 200
Headers: {'date': 'Wed, 19 Nov 2025 21:07:28 GMT', 'server': 'uvicorn', 'content-length': '4780256', 'content-type': 'application/json'}
First chunk size: 1024


In [88]:
resp.raise_for_status()
body = resp.content.decode("utf-8", errors="replace")
print(f"Body length: {len(body)}")
print(f"Body start: {repr(body[:200])}")

Body length: 4780256
Body start: '["ATOM      1  N   MET A   1      -5.521  -1.624  -3.788  1.00 34.39           N  \\nATOM      2  H   MET A   1      -5.588  -1.042  -4.611  1.00 34.39           H  \\nATOM      3  H2  MET A   1      -6'


In [89]:
import json
import re

def extract_first_json_value(body: str):
    # 1) Try normal JSON first
    try:
        return json.loads(body)
    except json.JSONDecodeError:
        pass

    # 2) Try newline-delimited JSON (NDJSON style)
    for line in body.splitlines():
        line = line.strip()
        if not line:
            continue
        try:
            return json.loads(line)
        except json.JSONDecodeError:
            continue

    # 3) Try to locate the first JSON-looking chunk and raw-decode it
    m = re.search(r"[\[\{]", body)
    if m:
        start = m.start()
        snippet = body[start:]
        dec = json.JSONDecoder()
        try:
            obj, _ = dec.raw_decode(snippet)
            return obj
        except json.JSONDecodeError:
            pass

    raise ValueError("Could not extract a JSON value from response body")

In [92]:
payload = extract_first_json_value(body)
# grab the first PDB from the payload
pdb_str = payload[0]

print("First 200 chars:", repr(pdb_str[:200]))
print("First line:", pdb_str.splitlines()[0])
print("ATOM lines:", sum(1 for ln in pdb_str.splitlines()
                        if ln.startswith(("ATOM", "HETATM"))))
# for py3Dmol cells that do [0]
multimer_alphafold2_response = [pdb_str]

# for your pLDDT loop: multimer_results[idx][0]
multimer_results = [(pdb_str,)]

First 200 chars: 'ATOM      1  N   MET A   1      -5.521  -1.624  -3.788  1.00 34.39           N  \nATOM      2  H   MET A   1      -5.588  -1.042  -4.611  1.00 34.39           H  \nATOM      3  H2  MET A   1      -6.349'
First line: ATOM      1  N   MET A   1      -5.521  -1.624  -3.788  1.00 34.39           N  
ATOM lines: 11657


### Assessing the predicted binders and structures

There are many metrics that can be used to assess the quality of the predicted binder-target structure. The predicted local distance difference test (pLDDT) is a measure of per-residue confidence in the local structure. It has a range of zero to one hundred, with higher scores considered more accurate.

The following snippet ranks the results of the binder-target pair AlphaFold2-Multimer predictions by their pLDDT.

In [93]:
# Function to calculate average pLDDT over all residues 
def calculate_average_pLDDT(pdb_string):
    total_pLDDT = 0.0
    atom_count = 0
    pdb_lines = pdb_string.splitlines()
    for line in pdb_lines:
        # PDB atom records start with "ATOM"
        if line.startswith("ATOM"):
            atom_name = line[12:16].strip() # Extract atom name
            if atom_name == "CA":  # Only consider atoms with name "CA"
                try:
                    # Extract the B-factor value from columns 61-66 (following PDB format specifications)
                    pLDDT = float(line[60:66].strip())
                    total_pLDDT += pLDDT
                    atom_count += 1
                except ValueError:
                    pass  # Skip lines where B-factor can't be parsed as a float

    if atom_count == 0:
        return 0.0  # Return 0 if no N atoms were found

    average_pLDDT = total_pLDDT / atom_count
    return average_pLDDT


In [94]:
plddts = []
for idx in range(0, len(multimer_results)):
    if multimer_results[idx] is not None:
        plddts.append(calculate_average_pLDDT(multimer_results[idx][0]))

In [95]:
## Combine the results with their pLDDTs
binder_target_results = list(zip(binder_target_pairs, multimer_results, plddts))

## Sort the results by plddt
sorted_binder_target_results = sorted(binder_target_results, key=lambda x : x[2])

## print the top 5 results
for i in range(0, len(sorted_binder_target_results)):
    print("-"*80)
    print(f"rank: {i}")
    print(f"binder: {sorted_binder_target_results[i][0][0]}")
    print(f"target: {sorted_binder_target_results[i][0][1]}")
    print(f"pLDDT: {sorted_binder_target_results[i][2]}")
    print("-"*80)

--------------------------------------------------------------------------------
rank: 0
binder: MKKIDKSMETLFLSILGDESIKNVYKKIYEVYKTVELDGYTFVIAKGELKDEEVEKILNAIAKKLGYESYEKSGKHFSIFTGKVEGELSSKEVSEKIYQIETKYENISLIAMEISKENGEVNYYLFKNNL
target: STIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLAQMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNERLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHLHAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGLPNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGFHEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEMKREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRLGKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYAD
pLDDT: 78.20869325997245
--------------------------------------------------------------------------------


These sequences show the highest pLDDT for their binder-target pair.

In [96]:
view = py3Dmol.view(width=800, height=600)
view.addModel(multimer_alphafold2_response[0], "pdb")
view.setStyle({"stick": {}})
view.setBackgroundColor("black")
view.zoomTo()
view.show()

In [105]:
import py3Dmol
import ipywidgets as widgets
from IPython.display import display

def show_pdb_fancy(pdb_str: str,
                   width: int = 800,
                   height: int = 600,
                   cartoon_color: str = "spectrum",
                   bg_color: str = "black"):
    """
    Robust version for JupyterHub:
      - Re-renders the viewer into an Output widget on every change
      - Style and zoom controls work even if live py3Dmol updates don't
    """
    style_dd = widgets.Dropdown(
        options=["cartoon", "sticks", "lines", "cartoon+sticks", "surface"],
        value="cartoon",
        description="Style:",
    )

    zoom_slider = widgets.FloatSlider(
        value=1.0,
        min=0.3,
        max=2.5,
        step=0.1,
        description="Zoom:",
        continuous_update=False,
    )

    view_out = widgets.Output()

    def render():
        with view_out:
            view_out.clear_output()

            view = py3Dmol.view(width=width, height=height)
            view.addModel(pdb_str, "pdb")
            view.setBackgroundColor(bg_color)

            # XYZ axes
            axis_len = 20.0
            radius = 0.3

            # X axis (red)
            view.addCylinder({
                "start": {"x": 0, "y": 0, "z": 0},
                "end":   {"x": axis_len, "y": 0, "z": 0},
                "radius": radius,
                "color": "red",
            })
            view.addLabel("X", {
                "position": {"x": axis_len + 2, "y": 0, "z": 0},
                "backgroundOpacity": 0.0,
                "fontColor": "red",
            })

            # Y axis (green)
            view.addCylinder({
                "start": {"x": 0, "y": 0, "z": 0},
                "end":   {"x": 0, "y": axis_len, "z": 0},
                "radius": radius,
                "color": "green",
            })
            view.addLabel("Y", {
                "position": {"x": 0, "y": axis_len + 2, "z": 0},
                "backgroundOpacity": 0.0,
                "fontColor": "green",
            })

            # Z axis (blue)
            view.addCylinder({
                "start": {"x": 0, "y": 0, "z": 0},
                "end":   {"x": 0, "y": 0, "z": axis_len},
                "radius": radius,
                "color": "blue",
            })
            view.addLabel("Z", {
                "position": {"x": 0, "y": 0, "z": axis_len + 2},
                "backgroundOpacity": 0.0,
                "fontColor": "blue",
            })

            # Apply style
            style = style_dd.value
            if style == "cartoon":
                view.setStyle({'model': -1}, {"cartoon": {"color": cartoon_color}})
            elif style == "sticks":
                view.setStyle({'model': -1}, {"stick": {"colorscheme": "Jmol"}})
            elif style == "lines":
                view.setStyle({'model': -1}, {"line": {}})
            elif style == "cartoon+sticks":
                view.setStyle(
                    {'model': -1},
                    {
                        "cartoon": {"color": cartoon_color},
                        "stick":   {"colorscheme": "Jmol"},
                    },
                )
            elif style == "surface":
                # no cartoon, just surface
                view.addSurface(py3Dmol.SES, {"opacity": 0.7, "color": "white"})

            view.zoomTo()
            if zoom_slider.value != 1.0:
                view.zoom(zoom_slider.value)

            view.show()

    def on_style_change(change):
        render()

    def on_zoom_change(change):
        render()

    style_dd.observe(on_style_change, names="value")
    zoom_slider.observe(on_zoom_change, names="value")

    # Initial render
    render()

    controls = widgets.HBox([style_dd, zoom_slider])
    display(controls, view_out)

In [106]:
if rc == 200:
    pdb_str = multimer_alphafold2_response[0]  # same as before
    show_pdb_fancy(pdb_str)
else:
    print(f"Unexpected HTTP status: {rc}")
    print(f"Response: {rc}")

HBox(children=(Dropdown(description='Style:', options=('cartoon', 'sticks', 'lines', 'cartoon+sticks', 'surfacâ€¦

Output()