Todo:
- Add all prediction apps, update URLs below
- Ensure prediction apps have clear visualizations
- Double check input/output files and example output format are correct!
- AntibodyProfiler: Accept HLT format PDB files, antigen chain
- Stress-test servers, spin up capacity before workshop

Nice to have:
- Add previously computed examples to visualize on BioLib.com
- New BioLib app: chotia_to_hlt for choosing own target and antibody framework
- Antibody liability app
- Leaderboard
- "Master" BioLib app implementing all functionality in one app (easier to parallelize)

## De novo generation of antibody binders with RFantibody
[![colab.ipynb](https://img.shields.io/badge/github-%23121011.svg?logo=github)](https://github.com/mhoie/workshop/blob/main/workshop.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mhoie/workshop/blob/main/workshop.ipynb)

In this notebook, you may choose your own antibody framework and target protein structure, and design novel antibody binders. This workflow has been shown to generate weak antibody binders in the μM to nM range, with up to ~5-10% experimental success rates for some degree of binding.

---

Antibody therapeutics represent a substantial market (approximately $550M USD) with tremendous potential for treating various diseases. Traditional approaches to antibody discovery are slow and laborious, typically involving immunizing mice or screening random libraries. 

This notebook implements the RFantibody pipeline for structure-based design of de novo antibodies against a chosen target.

It takes two inputs:
- i) An input antibody framework (e.g. hu-4D5-8_Fv.pdb - a humanized single-domain antibody already approved in two FDA therapies)
- ii) A target protein of interest (e.g. respiratory syncytial virus (RSV) protein) with binding site (epitope residues)

And runs the following three methods:
1. **De novo design of an antibody backbone targeting a protein of interest** - using an antibody-finetuned version of RFdiffusion ([Nature paper](https://www.nature.com/articles/s41586-023-06415-8))
2. **Design of the CDR loop residues** - using ProteinMPNN ([Science paper](https://www.science.org/doi/10.1126/science.add2187))
3. **Filtering designs on predicted structure 'self-consistency'** - using an antibody-finetuned version of RoseTTAFold2 ([Preprint](https://www.biorxiv.org/content/10.1101/2023.05.24.542179v1)), shown to correlate with significantly improved experimental success rates.

The RFantibody pipeline itself is described in detail in [this preprint](https://www.biorxiv.org/content/10.1101/2024.03.14.585103v2).

Advantages:
- Designs novel antibodies binding a target protein
- Can target most epitope binding region of interest (preferring structured regions)
- Focuses on designing antibody CDR loops (main residues determining binding)
- Designs may be filtered by "self-cosistency" of predicted structures, correlating with experimental success rates

Current limitations:
- Generated antibodies at best tend to be weak binders (low binding affinities in μM to nM range)
- Often low experimental success rates (~5-10% for some degree of binding) - heavily dependent on filtering
- No screening for e.g. human immunogenicity

## Pre-requisites for this workshop

The only pre-requisites for this workshop are the following:
- i) Register a BioLib account on [https://biolib.com/sign-up](https://biolib.com/sign-up), for running RFantibody jobs (requires a GPU) in the cloud.
- ii) An antibody PDB and target protein structure PDB. Examples are provided below, but you may also provide your own following the instructions below.

#### i) Install pybiolib and login to biolib.com

In [1]:
# Download BioLib
!pip install --quiet --upgrade pybiolib

In [2]:
# Login with BioLib
import biolib
biolib.login()

2025-03-25 08:54:50,699 | INFO : Already signed in


#### ii) Set input antibody and target PDB
We're working with two primary input files for RFAntibody:

1. **framework_pdb**: `hu-4D5-8_Fv.pdb` - A humanized single domain antibody (VHH) framework
2. **target_pdb**: `rsv_site3.pdb` - The target protein (RSV site 3) of interest
3. List of hotspot residues defining our epitope (target binding site)

You may also choose your own antibody framework and target PDB, prepared with this script:
https://github.com/RosettaCommons/RFantibody?tab=readme-ov-file#input-preparation

Let's verify that these files exists.

In [40]:
# Download input files if not already present
!wget --no-verbose -nc "https://raw.githubusercontent.com/mhoie/workshop/refs/heads/main/example_input/hu-4D5-8_Fv.pdb"
!wget --no-verbose -nc "https://raw.githubusercontent.com/mhoie/workshop/refs/heads/main/example_input/rsv_site3.pdb"

2025-03-21 14:08:02 URL:https://raw.githubusercontent.com/mhoie/workshop/refs/heads/main/example_input/hu-4D5-8_Fv.pdb [131172/131172] -> "hu-4D5-8_Fv.pdb" [1]
2025-03-21 14:08:03 URL:https://raw.githubusercontent.com/mhoie/workshop/refs/heads/main/example_input/rsv_site3.pdb [461944/461944] -> "rsv_site3.pdb" [1]


In [3]:
import os
framework_pdb = "hu-4D5-8_Fv.pdb"
target_pdb = "rsv_site3.pdb"
hotspot_res = "[T305,T456]"
print(f"Framework PDB exists: {os.path.exists(framework_pdb)}")
print(f"Target PDB exists: {os.path.exists(target_pdb)}")
print(f"(If these are missing, please download from https://github.com/mhoie/bioit-rfantibody before proceeding)")

Framework PDB exists: True
Target PDB exists: True
(If these are missing, please download from https://github.com/mhoie/bioit-rfantibody before proceeding)


## Step 1 of 4: Generate antibody-antigen docking pose, with [RFDiffusion (antibody-finetuned)](https://biolib.com/BioLibDevelopment/prediction-app/)
*Estimated time: ~2-3 minutes*

This step takes the input antibody framework and target protein, and designs the 3D structure of new CDR loops in interaction with the target protein (antibody-antigen docking pose). The CDR loops will be generated as backbones only (no residues), with the actual residues to be determined in the next step.

#### [Input parameters](https://biolib.com/BioLibDevelopment/prediction-app/)

Already set above:
- **antibody.framework_pdb**: Path to the antibody framework we're using (e.g. hu-4D5-8_Fv.pdb)
- **antibody.target_pdb**: Path to the target protein structure (e.g. rsv_site3.pdb)
- **ppi.hotspot_res**: List of hotspot residues defining our epitope (target binding site)

New parameters:
- **antibody.design_loops**: Dictionary mapping each CDR loop to a range of allowed loop lengths
  - L1, L2, L3: Light chain CDR loops
  - H1, H2, H3: Heavy chain CDR loops
  - Numbers specify length ranges (e.g., L1:8-13 means loop L1 can be 8-13 residues long)
  - Example: `[L1:8-13,L2:7,L3:9-11,H1:7,H2:6,H3:5-13]`
- **inference.num_designs**: Number of designs to generate (20)


In [5]:
# Input parameters (antibody framework and target PDBs are set above)
design_loops = "[L1:8-13,L2:7,L3:9-11,H1:7,H2:6,H3:5-13]"
num_designs = 1

# Output directory
outdir_rfdiff = "output/rfdiffusion"

# Run RFdiffusion through BioLib
app_rfdiff = biolib.load('BioITWorkshop/RFDiffusionAntibody')  # Replace with actual RFantibody app ID
job_rfdiff = app_rfdiff.start(
    target_pdb=target_pdb,
    framework_pdb=framework_pdb,
    hotspot_res=hotspot_res,
    design_loops=design_loops,
    num_designs=num_designs,
)

2025-03-25 08:57:34,527 | INFO : Loaded project BioITWorkshop/RFDiffusionAntibody:0.0.2
2025-03-25 08:57:38,097 | INFO : View the result in your browser at: https://biolib.com/results/14a4d058-2126-412c-a42f-821dbf471bf0/


#### Wait for RFdiffusion output files (~2-3 minutes)
The RFdiffusion step generates PDB files containing the antibody framework with designed CDR loops docked to the target protein. At this stage, the CDR loops have backbone structures but no amino acid sequences yet!

Output files:
- ab_rfdiffusion_output.pdb - Antibody PDB backbone (N, Ca, C, O atoms only), lacking the CDR loop residues (which will be predicted in the next step)

Example output format:
```pdb
ATOM      1  N   GLU H   1      23.793  -8.718 -21.757  1.00  1.00

ATOM      2  CA  GLU H   1      23.755  -8.421 -20.330  1.00  1.00

ATOM      3  C   GLU H   1      23.563  -6.931 -20.082  1.00  1.00

ATOM      4  O   GLU H   1      23.856  -6.105 -20.947  1.00  1.00

ATOM      5  N   VAL H   2      22.855  -6.630 -18.891  1.00  1.00
```

In [None]:
# Try to save job output files
#job_rfdiff = biolib.get_job("ff4b0442-acb9-4c44-95fe-28e87b692e7b") # DEBUG
status = job_rfdiff.get_status()
if status == "completed":
    print("Saving files...")
    job_rfdiff.list_output_files()
    job_rfdiff.save_files(outdir_rfdiff)
else:
    print(f"Job status is not completed ({status}), please wait a moment (or try again if failed)")

Saving files...
2025-03-25 09:02:17,858 | INFO : Saving 4 files to output/rfdiffusion...


## Step 2 of 4: Design binding CDR loop residues with [ProteinMPNN](https://biolib.com/BioLibDevelopment/prediction-app/)
*Estimated time: <1 minute*

The second step takes the docks generated by RFdiffusion and assigns amino acid sequences to the CDR loops using ProteinMPNN. We use the base version of ProteinMPNN (not an antibody-finetuned model) with a wrapper script that focuses on designing just the CDR loops.

#### [Input parameters](https://biolib.com/BioLibDevelopment/prediction-app/)
- **pdbdir**: Directory containing the previous RFdiffusion output PDB files

In [44]:
# Input directory
input_dir = outdir_rfdiff  # Using the output from RFdiffusion

# Output directory
outdir_mpnn = "output/proteinmpnn"

# Run ProteinMPNN
app_mpnn = biolib.load('BioITWorkshop/ProteinMPNNAb')  # Replace with actual app ID
job_mpnn = app_mpnn.start(
    pdb=input_dir,
    num_seq_per_struct=3
)

2025-03-25 12:07:34,473 | INFO : Loaded project BioITWorkshop/ProteinMPNNAb:0.0.3
2025-03-25 12:07:38,260 | INFO : View the result in your browser at: https://biolib.com/results/1733e9ae-f906-4e20-ab85-0d08334b6119/


#### Wait for ProteinMPNN output files (<1 minute)
ProteinMPNN outputs PDB files with the same structures as the input but with amino acid sequences designed for the CDR loops. By default, it provides one sequence per input structure.

Output files:
- example.pdb (antibody structure with predicted CDR residues)

Example output:
```pdb
ATOM      1  N   GLU H   1      23.793  -8.718 -21.757  1.00  0.00
ATOM      2  CA  GLU H   1      23.755  -8.421 -20.330  1.00  0.00
ATOM      3  C   GLU H   1      23.563  -6.931 -20.082  1.00  0.00
ATOM      4  O   GLU H   1      23.856  -6.105 -20.947  1.00  0.00
ATOM      5  N   VAL H   2      22.855  -6.630 -18.891  1.00  0.00
ATOM      6  CA  VAL H   2      22.864  -5.216 -18.533  1.00  0.00
```

In [59]:
# Try to save job output files
status = job_mpnn.get_status()
if status == "completed":
    job_mpnn.save_files(outdir_mpnn)
else:
    print(f"Job status is not completed ({status}), please wait a moment (or try again if failed)")

Job status is not completed (in_progress), please wait a moment (or try again if failed)


## Step 3 / 4: Filter designs for predicted structure self-consistency, with [RosettaFold2 antibody fine-tuned](https://biolib.com/BioLibDevelopment/prediction-app/)
*Estimated time: ~1-2 minutes*

The final step uses an antibody-finetuned version of RoseTTAFold2 (RF2) to predict the structure of the designed sequences and assess whether RF2 is confident that the sequence will bind as designed.

The RFantibody protocol recommends filtering on the following metrics, shown to lead to an up to 10X improvement in experimental success rates.
- RF2 predicted alignment error (pAE) < 10
- RMSD between design and RF2 prediction < 2Å

### [Input parameters](https://biolib.com/BioLibDevelopment/prediction-app/)

- **input.pdb_dir**: Directory containing the PDB files from ProteinMPNN
- **num_recycles**: Number of recycling iterations in the model (default: 10)
- **hotspot**: Percentage of hotspots provided to the model (default: 10%)

In [20]:
# Input directory
input_dir = outdir_mpnn  # Using the output from ProteinMPNN

# Output directory
outdir_rf2 = "output/rosettafold2"

# Run RosettaFold2
app_rf2 = biolib.load('BioITWorkshop/RF2Antibody')  # Replace with actual app ID
job_rf2 = app_rf2.start(
    pdb=input_dir,
    num_recycles=3,
)

2025-03-25 09:07:27,249 | INFO : Loaded project BioITWorkshop/RF2Antibody:0.0.3
2025-03-25 09:07:29,278 | INFO : View the result in your browser at: https://biolib.com/results/67e46298-a095-4549-9b44-97bfcccd12a8/


#### Wait for RosettaFold2 output files (1-2 minutes)
RosettaFold2 predicts the structure of the designed antibodies and provides confidence metrics. We can use these to filter for promising designs.

Output files:
- example.pdb - Predicted structure

Example output:
```
PLACEHOLDER
```

In [21]:
# Try to save job output files
status = job_rf2.get_status()
if status == "completed":
    job_rf2.save_files(outdir_rf2)
else:
    print(f"Job status is not completed ({status}), please wait a moment (or try again if failed)")

2025-03-25 09:16:42,313 | INFO : Saving 2 files to output/rosettafold2...


## Step 4 of 4: Assess designs for pharmaceutical liabilities (AntibodyProfiler)

*Estimated time: <1 minute*

Here, we will use our in-house AntibodyProfiler tool to assess our antibody PDBs for how closely they relate to molecular properties of already approved therapeutic antibodies. Antibody designs outside this range may indicate higher risk of pharmaceutical liabilities, providing another early filtering step before designs are sent to the lab for experimental validation.

We recommend selecting designs within 2-sigma deviation of approved antibodies for the following metrics:
- Total CDR length between X and Y 
- CDR3 heavy chain length between X and Y 
- Patches of surface hydrophobicity in CDR vicinity (PSH) between X and Y 
- Patches of positive charge in CDR vicinity (PPC) between X and Y
- Patches of negative charge in CDR vicinity (PNC) between X and Y 
- Structural fragment variable charge symmetry parameter (SFvCSP) between X and Y

### [Input parameters](https://biolib.com/BioLibDevelopment/prediction-app/)

- **input.pdb_dir**: Directory containing the PDB files from ProteinMPNN
- **num_recycles**: Number of recycling iterations in the model (default: 10)
- **hotspot**: Percentage of hotspots provided to the model (default: 10%)

In [27]:
# Input directory
input_dir = outdir_rf2  # Using the output from RosettaFold2

# Output directory
outdir_abprofiler = "output/abprofiler"

# Run RosettaFold2
app_abprofiler = biolib.load('BioLibDevelopment/AntibodyProfiler') # TODO - push to main branch so the last version is used
job_abprofiler = app_abprofiler.start(
    pdb=input_dir, # DEBUG - need to add automatic csv file generation for dir input, accepting hlt format
)

2025-03-25 11:53:27,142 | INFO : Loaded project BioLibDevelopment/AntibodyProfiler:0.0.32
2025-03-25 11:53:30,678 | INFO : View the result in your browser at: https://biolib.com/results/c7054aa6-17ad-47eb-8128-997ff58ad845/


#### Wait for AntibodyProfiler output files (<1 minute)
AntibodyProfiler calculates developability metrics compared to approved therapeutic antibodies already on the market, giving warning flags for input PDBs outside this range.

Output file:
- metrics.csv (Summary metrics compared to therapeutically approved antibodies for each PDB)

Example output:
```csv
pdb,	        metric,             value, recommended_range,	status,
example,	cdrh3_length,	    7,	    6-23,	        good,
example,	psh,	            100.74, 35-150,             good,
example,	total_cdr_length,   41,	    37-60,              good,
```

In [29]:
# Try to save job output files
status = job_abprofiler.get_status()
if status == "completed":
    job_abprofiler.save_files(outdir_abprofiler)
else:
    print(f"Job status is not completed ({status}), please wait a moment (or try again if failed)")

2025-03-25 11:54:00,842 | INFO : Saving 10 files to output/abprofiler...


## Conclusion

This notebook has demonstrated the complete RFantibody pipeline for structure-based design of de novo antibodies. The workflow consists of three main steps:

1. **RFdiffusion (antibody fine-tuned)**: Generating antibody-target docking poses with designed CDR loop structures
2. **ProteinMPNN (antibody fine-tuned)**: Designing amino acid sequences for the CDR loops
3. **RosettaFold2 (antibody fine-tuned)**: Filtering designs based on predicted structure quality
4. **AntibodyProfiler**: Further selection of designs based on similarity to therapeutically approved antibodies

This computational pipeline can generate designs with a success rate of approximately 5-10% for some degree of binding to the target. Further experimental validation and optimization is likely to be required to improve binding affinity and other pharmaceutical properties.

For larger scale antibody design campaigns, we recommended to generating thousands of designs to increase the chances of finding high-quality binders, as the current filtering metrics are still highly limited.