# Benchmark creation with [Polaris](https://github.com/polaris-hub/polaris)

## Background

### Target details
Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue. 

- Target Protein: EGFR
- Organism: HUMAN
- Uniprot Accession ID: [P00533](https://www.uniprot.org/uniprotkb/P00533/entry)
- Protein sequence: LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS
- Structure PDB: [6ARU](https://www.rcsb.org/structure/6aru)


![64ru](https://cdn.rcsb.org/images/structures/6aru_assembly-1.jpeg)

### Binding protein designs
This dataset contains 202 designed EGFR-binding protein sequences, along with experimental binding affinity results tested by the AdaptyvBio team, plus 11 additional sequences ordered by Anthony Gitter and tested by the AdaptyvBio team.

## Benchmark description

This retrospective benchmark evaluates protein design methods by challenge participants to design a binding protein for the extracellular domain of EGFR, a cancer-associated drug target. A set of 213 protein sequences including positive controls, along with their experimental binding affinities (binary labels), is available for testing. 

`Balanced accuracy` is used to evaluate the performance of design methods in differentiating between binders and non-binders.

## Additional notes
- `Cetuximab_scFv` and `P01133-971-1023`served as positive controls in the binding assay.
- `ahmedsameh-Q3` and `ahmedsameh-yy2` were disqualified from the competition due to high similarity levels with known EGFR-binder sequences. They are retained in this benchmark solely for evaluation purposes.
- The two weak binders `alecl-Sequence1` and `alan.blakely-design:5 n:6|mpnn:1.247|plddt:0.825|ptm:0.709|pae:10.151|rmsd:3.535` are classified as non-binders in this benchmark.

## Reference: 
- https://design.adaptyvbio.com/
- https://foundry.adaptyvbio.com/egfr_design_competition
- https://github.com/adaptyvbio/egfr_competition_1
- https://github.com/agitter/adaptyvbio-egfr

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import datamol as dm
import numpy as np

# polaris benchmark
from polaris.benchmark import SingleTaskBenchmarkSpecification

# polaris hub
from polaris.cli import PolarisHubClient
from polaris.utils.types import HubOwner

# utils
root = pathlib.Path("__file__").absolute().parents[3]
os.chdir(root)
sys.path.insert(0, str(root))

In [2]:
# Get the owner and organization
# Get the owner and organization
org = "AdaptyvBio"
data_name = "EGFR_binders"
dataset_version = "v1"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug="adaptyv-bio", type="organization")
owner

BENCHMARK_DIR = f"{gcp_root}/benchmarks"
DATASET_JSON = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/datasets/{data_name}-{dataset_version}/dataset.json"

FIGURE_DIR = f"{gcp_root}/figures"

## Load existing Dataset

In [3]:
# Load the saved Dataset
import polaris as po

# Load the dataset from the Hub
dataset = po.load_dataset("adaptyv-bio/egfr-binders-v1")

⠙ Fetching dataset... 

[32m2024-10-25 10:42:05.356[0m | [1mINFO    [0m | [36mpolaris._artifact[0m:[36m_validate_version[0m:[36m66[0m - [1mThe version of Polaris that was used to create the artifact (0.8.7.dev1+g23fd61e.d20240926) is different from the currently installed version of Polaris (0.8.4.dev0+gd05937e.d20240903).[0m


✅ SUCCESS: [1mFetched dataset.[0m
 


  self._color = self._set_color(value) if value else value


## Benchmark creation with `Polaris`
This is a retrospective benchmark, all the data points are in the test split for evaluation. 

In [4]:
split = ([], list(range(dataset.table.shape[0])))

## Create a benchmark

In [5]:
from utils.docs_utils import load_readme

In [6]:
benchmark_version = "v1"
target_cols = ["binding_class"]
input_col = ["sequence"]
benchmark_name = "EGFR_binders_binary_cls"

In [8]:
benchmark_name = f"{benchmark_name}-{benchmark_version}"
benchmark = SingleTaskBenchmarkSpecification(
    name=benchmark_name,
    dataset=dataset,
    target_cols=target_cols,
    target_types={"binding_class": "classification"},
    input_cols=input_col,
    split=split,
    main_metric="balanced_accuracy",
    metrics=[
        "accuracy",
        "balanced_accuracy",
        "f1",
        "mcc",
        "cohen_kappa",
    ],
    tags=["protein-design", "singletask"],
    owner=owner,
    description=f"Single task benchmark for protein binder design targeting the EGFR.",
    readme=load_readme("org-AdaptyvBio/EGFR_binders/v1/benchmark.md")
)

[32m2024-10-25 10:42:44.597[0m | [1mINFO    [0m | [36mpolaris.benchmark._base[0m:[36m_validate_split[0m:[36m188[0m - [1mThis benchmark only specifies a test set. It will return an empty train set in `get_train_test_split()`[0m


In [9]:
benchmark

0,1
name,EGFR_binders_binary_cls-v1-v1
description,Single task benchmark for protein binder design targeting the EGFR.
tags,protein-designsingletask
user_attributes,
owner,adaptyv-bio
polaris_version,0.8.4.dev0+gd05937e.d20240903
target_cols,binding_class
input_cols,sequence
metrics,accuracybalanced_accuracyf1mcccohen_kappa
main_metric,balanced_accuracy

0,1
binding_class,classification

0,1
test,213

0,1
binding_class,2


In [14]:
# upload to polaris hub
benchmark.upload_to_hub(access="private", owner=owner)

✅ SUCCESS: [1mYour benchmark has been successfully uploaded to the Hub. View it here: https://polarishub.io/benchmarks/adaptyv-bio/EGFR_binders_binary_cls-v1[0m
 


  self._color = self._set_color(value) if value else value


{'id': 'ElsN9peBLY9q9yjju2mzu',
 'createdAt': '2024-10-25T14:34:41.653Z',
 'deletedAt': None,
 'name': 'EGFR_binders_binary_cls-v1',
 'slug': 'egfr-binders-binary-cls-v1',
 'description': 'Single task benchmark for protein binder design targeting the EGFR.',
 'tags': ['protein-design', 'singletask'],
 'userAttributes': {},
 'access': 'private',
 'isCertified': False,
 'polarisVersion': '0.8.4.dev0+gd05937e.d20240903',
 'readme': '',
 'state': 'ready',
 'ownerId': 'Ek6QRdreDbHuVNCUNjdbr',
 'creatorId': 'NKnaHGybLqwSHcaMEHqfF',
 'targetCols': ['binding_class'],
 'inputCols': ['sequence'],
 'md5Sum': '6c6060257e434989a7b401c1eab2e331',
 'metrics': ['accuracy', 'balanced_accuracy', 'f1', 'mcc', 'cohen_kappa'],
 'mainMetric': 'balanced_accuracy',
 'split': [[],
  [0,
   1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   34

In [11]:
benchmark.dataset.table.tail

Unnamed: 0,name,username,sequence_name,KD,sequence,dna,plddt,pae_interaction,similarity_check,model_names,methods,binding_class,replicate,expression,binding,kon,koff,binding_strength
0,Cetuximab_scFv,,,6.638345e-09,QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE...,ATGCAGGTGCAGCTGAAACAGAGCGGCCCGGGCCTGGTGCAGCCAT...,,,,,,True,,,,,,
1,ahmedsameh-Q3,ahmedsameh,Q3,3.694188e-08,WVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGTGGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.840455,28.217942,0.992,"[""Rosetta""]","[""Physics Based""]",True,,,,,,
2,ahmedsameh-yy2,ahmedsameh,yy2,6.275390e-08,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSHAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.288939,28.177070,0.992,"[""Rosetta""]","[""Physics Based""]",True,,,,,,
3,martin.pacesa-EGFR_l138_s90285_mpnn2,martin.pacesa,EGFR_l138_s90285_mpnn2,4.909414e-07,SPFDLFLDRLPEQDPEMTEEGKWWAEEMKRMVGPHFEELEEYIRNN...,ATGAGCCCGTTTGATCTGTTTCTGGATCGCCTGCCGGAACAGGATC...,88.653551,16.878782,,"[""AF2 Backprop""]","[""Hallucination""]",True,,,,,,
4,x.rustamov-m_18_41,x.rustamov,m_18_41,4.773972e-06,SAGQAQIEEVKARADKAKTLEELKELRKEAYEKNWKAYMAVVDETE...,ATGAGCGCGGGCCAGGCGCAGATTGAAGAAGTGAAAGCGCGCGCAG...,89.580600,14.921833,,"[""AF2 Backprop""]","[""Hallucination""]",True,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208,gitter-yolo5,,,,MTTSSIRRQMKNIVNNYSEAEIKVREATSNDPWGPSSSLMTEIADL...,,,,,,,False,1.0,high,False,,,none
209,gitter-yolo6,,,,MQSVLTQSPASLSASVGDRVTITCRASQDISNYLNWYQQKPGKAPK...,,,,,,,False,1.0,high,False,,,none
210,gitter-yolo7,,,,QVQLQESGPGLVKPSETLSLTCTVSGGSISSGDYYWTWIRQPPGKG...,,,,,,,False,1.0,medium,False,,,none
211,gitter-yolo8,,,,DIQMTQSPSSLSASVGDRVTITCRASQDISNYLNWYQQKPGKAPKL...,,,,,,,False,1.0,medium,False,,,none
