# Dataset creation with [Polaris](https://github.com/polaris-hub/polaris)

## Background

### Target details
Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue. 

- Target Protein: EGFR
- Organism: HUMAN
- Uniprot Accession ID: [P00533](https://www.uniprot.org/uniprotkb/P00533/entry)
- Protein sequence: LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS
- Structure PDB: [6ARU](https://www.rcsb.org/structure/6aru)


![64ru](https://cdn.rcsb.org/images/structures/6aru_assembly-1.jpeg)

### Binding protein designs
This dataset contains 202 designed EGFR-binding protein sequences, along with experimental binding affinity results tested by the AdaptyvBio team.

## Reference:
- https://design.adaptyvbio.com/
- https://foundry.adaptyvbio.com/egfr_design_competition

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

# polaris dataset
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.utils.types import HubOwner


root = pathlib.Path("__file__").absolute().parents[3]
os.chdir(root)
sys.path.insert(0, str(root))

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Get the owner and organization
org = "AdaptyvBio"
data_name = "EGFR_binders"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug="adaptyv-bio", type="organization")
owner

HubOwner(slug='adaptyv-bio', external_id=None, type='organization')

In [3]:
BENCHMARK_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/benchmarks"
DATASET_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/datasets"
FIGURE_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/figures"

## Load existing data

In [4]:
# Load the curated data
PATH = "gs://polaris-public/polaris-recipes/org-AdaptyvBio/EGFR_binders/raw/result_summary_with_class.csv"
table = pd.read_csv(PATH)

In [5]:
table

Unnamed: 0,name,username,sequence_name,kd,sequence,dna,plddt,pae_interaction,similarity_check,model_names,methods,binding_class
0,Cetuximab_scFv,,,6.638345e-09,QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE...,ATGCAGGTGCAGCTGAAACAGAGCGGCCCGGGCCTGGTGCAGCCAT...,,,,,,True
1,ahmedsameh-Q3,ahmedsameh,Q3,3.694188e-08,WVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGTGGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.840455,28.217942,0.992,"[""Rosetta""]","[""Physics Based""]",True
2,ahmedsameh-yy2,ahmedsameh,yy2,6.275390e-08,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSHAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.288939,28.177070,0.992,"[""Rosetta""]","[""Physics Based""]",True
3,martin.pacesa-EGFR_l138_s90285_mpnn2,martin.pacesa,EGFR_l138_s90285_mpnn2,4.909414e-07,SPFDLFLDRLPEQDPEMTEEGKWWAEEMKRMVGPHFEELEEYIRNN...,ATGAGCCCGTTTGATCTGTTTCTGGATCGCCTGCCGGAACAGGATC...,88.653551,16.878782,,"[""AF2 Backprop""]","[""Hallucination""]",True
4,x.rustamov-m_18_41,x.rustamov,m_18_41,4.773972e-06,SAGQAQIEEVKARADKAKTLEELKELRKEAYEKNWKAYMAVVDETE...,ATGAGCGCGGGCCAGGCGCAGATTGAAGAAGTGAAAGCGCGCGCAG...,89.580600,14.921833,,"[""AF2 Backprop""]","[""Hallucination""]",True
...,...,...,...,...,...,...,...,...,...,...,...,...
197,tim-silica_corpora_sampled_epitope_1_generated...,tim,silica_corpora_sampled_epitope_1_generated_var...,,QVQLVESGGGLVKPGGSLRLSCAASGSTSSNYAAAWFRQAPGKERE...,ATGCAGGTGCAGCTGGTGGAAAGCGGCGGCGGCCTGGTTAAACCAG...,74.065691,28.327152,0.886,"[""Custom (Generative)""]","[""De Novo""]",False
198,ahmedsameh-y4,ahmedsameh,y4,,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.784242,28.178994,1.000,"[""Rosetta""]","[""Physics Based""]",False
199,ahmedsameh-y6,ahmedsameh,y6,,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.753258,28.208359,0.992,"[""Rosetta""]","[""Physics Based""]",False
200,ahmedsameh-s3,ahmedsameh,s3,,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.456894,28.246956,0.992,"[""Rosetta""]","[""Physics Based""]",False


### Below we specify the meta information of data columns

It's necessary to specify the key bioactivity columns, molecule structures and identifiers in the dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

This dataset includes two weak binders. Since only six designs exhibit moderate to strong binding affinities, these two weak binders are classified as positive binders in the binary classification setting. 

In [6]:
# Rename column names for compheransive terms
table.rename(columns={"kd": "KD"}, inplace=True)

In [7]:
table.head(3)

Unnamed: 0,name,username,sequence_name,KD,sequence,dna,plddt,pae_interaction,similarity_check,model_names,methods,binding_class
0,Cetuximab_scFv,,,6.638345e-09,QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE...,ATGCAGGTGCAGCTGAAACAGAGCGGCCCGGGCCTGGTGCAGCCAT...,,,,,,True
1,ahmedsameh-Q3,ahmedsameh,Q3,3.694188e-08,WVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGTGGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.840455,28.217942,0.992,"[""Rosetta""]","[""Physics Based""]",True
2,ahmedsameh-yy2,ahmedsameh,yy2,6.27539e-08,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSHAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.288939,28.17707,0.992,"[""Rosetta""]","[""Physics Based""]",True


In [8]:
annotations = {
    "name": ColumnAnnotation(description="Sequence design name."),
    "sequence": ColumnAnnotation(description="Protein sequence in fasta format."),
    "dna": ColumnAnnotation(description="DNA sequence of the design."),
    "plddt": ColumnAnnotation(
        description="pLDDT is a per-residue measure of local confidence."
    ),
    "pae_interaction": ColumnAnnotation(
        description="The confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex."
    ),
    "similarity_check": ColumnAnnotation(
        description="Similar the designed sequence to reference known sequences."
    ),
    "KD": ColumnAnnotation(
        description="The equilibrium dissociation constant (KD) for the measure of binding affinity.",
        user_attributes={
            "unit": "M",
            "objective": "Lower value",
        },
    ),
    "binding_class": ColumnAnnotation(
        description="The binding affinity as boolean classes labels.",
        user_attributes={
            "objective": "True",
        },
    ),
    "model_names": ColumnAnnotation(
        description="The name of the model used for design."
    ),
    "methods": ColumnAnnotation(description="The method used for design."),
}

### Define `Dataset` object

In [9]:
from utils.docs_utils import load_readme

In [10]:
dataset_version = "v0"
dataset_name = f"{data_name}-{dataset_version}"

In [11]:
dataset = Dataset(
    table=table,
    name=dataset_name,
    description="This dataset includes binding protein designs targeting the Epidermal growth factor receptor(EGFR), a drug target associated with various diseases.",
    source="https://design.adaptyvbio.com/",
    annotations=annotations,
    owner=owner,
    tags=["protein-design"],
    license="CC-BY-4.0",
    readme=load_readme("org-AdaptyvBio/EGFR_binders/v0/dataset.md"),
)

### Dataset overview

In [12]:
dataset

[32m2024-09-26 12:12:28.761[0m | [1mINFO    [0m | [36mpolaris.mixins._checksum[0m:[36mmd5sum[0m:[36m27[0m - [1mComputing the checksum. This can be slow for large datasets.[0m


0,1
name,EGFR_binders-v0
description,"This dataset includes binding protein designs targeting the Epidermal growth factor receptor(EGFR), a drug target associated with various diseases."
tags,protein-design
user_attributes,
owner,adaptyv-bio
polaris_version,0.8.7.dev1+g23fd61e.d20240926
default_adapters,
zarr_root_path,
readme,"## Background ### Target details Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue. - Target Protein: EGFR - Organism: HUMAN - Uniprot Accession ID: [P00533](https://www.uniprot.org/uniprotkb/P00533/entry) - Protein sequence: LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS - Structure PDB: [6ARU](https://www.rcsb.org/structure/6aru) ![64ru](https://cdn.rcsb.org/images/structures/6aru_assembly-1.jpeg) ### Binding protein designs This dataset contains 202 designed EGFR-binding protein sequences, along with experimental binding affinity results tested by the AdaptyvBio team. ## Benchmark description This retrospective benchmark evaluates protein design methods by challenging participants to design a binding protein for the extracellular domain of EGFR, a cancer-associated drug target. A set of 202 previously designed protein sequences, along with their experimental binding affinities (binary labels), is available for testing. `Balenced Accuracy` is used to evaluate the performance of design methods on both binders and non-binders. ## Reference: - https://design.adaptyvbio.com/ - https://foundry.adaptyvbio.com/egfr_design_competition"
annotations,nameis_pointerFalsemodalityUNKNOWNdescriptionSequence design name.user_attributesdtypeobjectcontent_typeNonesequenceis_pointerFalsemodalityUNKNOWNdescriptionProtein sequence in fasta format.user_attributesdtypeobjectcontent_typeNonednais_pointerFalsemodalityUNKNOWNdescriptionDNA sequence of the design.user_attributesdtypeobjectcontent_typeNoneplddtis_pointerFalsemodalityUNKNOWNdescriptionpLDDT is a per-residue measure of local confidence.user_attributesdtypefloat64content_typeNonepae_interactionis_pointerFalsemodalityUNKNOWNdescriptionThe confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex.user_attributesdtypefloat64content_typeNonesimilarity_checkis_pointerFalsemodalityUNKNOWNdescriptionSimilar the designed sequence to reference known sequences.user_attributesdtypefloat64content_typeNoneKDis_pointerFalsemodalityUNKNOWNdescriptionThe equilibrium dissociation constant (KD) for the measure of binding affinity.user_attributesunitMobjectiveLower valuedtypefloat64content_typeNonebinding_classis_pointerFalsemodalityUNKNOWNdescriptionThe binding affinity as boolean classes labels.user_attributesobjectiveTruedtypeboolcontent_typeNonemodel_namesis_pointerFalsemodalityUNKNOWNdescriptionThe name of the model used for design.user_attributesdtypeobjectcontent_typeNonemethodsis_pointerFalsemodalityUNKNOWNdescriptionThe method used for design.user_attributesdtypeobjectcontent_typeNoneusernameis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobjectcontent_typeNonesequence_nameis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobjectcontent_typeNone

0,1
name,is_pointerFalsemodalityUNKNOWNdescriptionSequence design name.user_attributesdtypeobjectcontent_typeNone
sequence,is_pointerFalsemodalityUNKNOWNdescriptionProtein sequence in fasta format.user_attributesdtypeobjectcontent_typeNone
dna,is_pointerFalsemodalityUNKNOWNdescriptionDNA sequence of the design.user_attributesdtypeobjectcontent_typeNone
plddt,is_pointerFalsemodalityUNKNOWNdescriptionpLDDT is a per-residue measure of local confidence.user_attributesdtypefloat64content_typeNone
pae_interaction,is_pointerFalsemodalityUNKNOWNdescriptionThe confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex.user_attributesdtypefloat64content_typeNone
similarity_check,is_pointerFalsemodalityUNKNOWNdescriptionSimilar the designed sequence to reference known sequences.user_attributesdtypefloat64content_typeNone
KD,is_pointerFalsemodalityUNKNOWNdescriptionThe equilibrium dissociation constant (KD) for the measure of binding affinity.user_attributesunitMobjectiveLower valuedtypefloat64content_typeNone
binding_class,is_pointerFalsemodalityUNKNOWNdescriptionThe binding affinity as boolean classes labels.user_attributesobjectiveTruedtypeboolcontent_typeNone
model_names,is_pointerFalsemodalityUNKNOWNdescriptionThe name of the model used for design.user_attributesdtypeobjectcontent_typeNone
methods,is_pointerFalsemodalityUNKNOWNdescriptionThe method used for design.user_attributesdtypeobjectcontent_typeNone

0,1
is_pointer,False
modality,UNKNOWN
description,Sequence design name.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,Protein sequence in fasta format.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,DNA sequence of the design.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,pLDDT is a per-residue measure of local confidence.
user_attributes,
dtype,float64
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,The confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex.
user_attributes,
dtype,float64
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,Similar the designed sequence to reference known sequences.
user_attributes,
dtype,float64
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,The equilibrium dissociation constant (KD) for the measure of binding affinity.
user_attributes,unitMobjectiveLower value
dtype,float64
content_type,

0,1
unit,M
objective,Lower value

0,1
is_pointer,False
modality,UNKNOWN
description,The binding affinity as boolean classes labels.
user_attributes,objectiveTrue
dtype,bool
content_type,

0,1
objective,True

0,1
is_pointer,False
modality,UNKNOWN
description,The name of the model used for design.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,The method used for design.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object
content_type,


### Upload the dataset to the hub

In [15]:
dataset.upload_to_hub()

⠧ Uploading artifact... 

[32m2024-09-26 12:16:54.027[0m | [1mINFO    [0m | [36mpolaris.hub.client[0m:[36m_upload_dataset[0m:[36m593[0m - [1mCopying Parquet file to the Hub. This may take a while.[0m


✅ SUCCESS: [1mYour standard dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/adaptyv-bio/EGFR_binders-v0[0m
 


  self._color = self._set_color(value) if value else value


In [15]:
# test to load the benchmark

import polaris as po

# Load the benchmark from the Hub
benchmark = po.load_benchmark("adaptyv-bio/EGFR_binders_binary_cls-v0")

# Get the train and test data-loaders
train, test = benchmark.get_train_test_split()

⠦ Fetching artifact... 

[32m2024-09-24 11:22:09.695[0m | [1mINFO    [0m | [36mpolaris.mixins._checksum[0m:[36mverify_checksum[0m:[36m65[0m - [1mTo verify the checksum, we need to recompute it. This can be slow for large datasets.[0m
  self._color = self._set_color(value) if value else value
[32m2024-09-24 11:22:09.701[0m | [1mINFO    [0m | [36mpolaris.benchmark._base[0m:[36m_validate_split[0m:[36m189[0m - [1mThis benchmark only specifies a test set. It will return an empty train set in `get_train_test_split()`[0m


✅ SUCCESS: [1mFetched artifact.[0m
 
✅ SUCCESS: [1mFetched artifact.[0m
 


In [16]:
from sklearn.metrics import (
    accuracy_score,
    average_precision_score,
    explained_variance_score,
    f1_score,
    matthews_corrcoef,
    mean_absolute_error,
    mean_squared_error,
    r2_score,
    roc_auc_score,
    balanced_accuracy_score,
)

In [17]:
benchmark.dataset.table["binding_class"].unique()

array([ True, False])

In [18]:
balanced_accuracy_score(predictions, predictions)

NameError: name 'predictions' is not defined

In [19]:
import numpy as np

# Work your magic to accurately predict the test set
predictions = np.array([True for x in test])

# Evaluate your predictions
results = benchmark.evaluate(predictions)

# Submit your results
# results.upload_to_hub(owner="lu-valencelabs")

In [20]:
results

Test set,Target label,Metric,Score
test,binding_class,accuracy,0.0396039604
test,binding_class,balanced_accuracy,0.5
test,binding_class,f1,0.0761904762
test,binding_class,mcc,0.0
test,binding_class,cohen_kappa,0.0
name,,,
description,,,
tags,,,
user_attributes,,,
owner,,,

0,1
slug,adaptyv-bio
external_id,org_2lqe2oSyR0fGZEEVd1NUlwqnfjx
type,organization

Test set,Target label,Metric,Score
test,binding_class,accuracy,0.0396039604
test,binding_class,balanced_accuracy,0.5
test,binding_class,f1,0.0761904762
test,binding_class,mcc,0.0
test,binding_class,cohen_kappa,0.0
