# Dataset creation with [Polaris](https://github.com/polaris-hub/polaris)

## Background

### Target details
Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue. 

- Target Protein: EGFR
- Organism: HUMAN
- Uniprot Accession ID: [P00533](https://www.uniprot.org/uniprotkb/P00533/entry)
- Protein sequence: LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS
- Structure PDB: [6ARU](https://www.rcsb.org/structure/6aru)


![64ru](https://cdn.rcsb.org/images/structures/6aru_assembly-1.jpeg)

### Binding protein designs
This dataset includes 202 designed EGFR-binding protein sequences with experimental binding affinity results from the AdaptyvBio team, along with 11 additional sequences ordered by Anthony Gitter and tested by the same team, resulting in 7 confirmed EGFR binders (including positive controls).

### Updates
Compared to version v0, the two weak binders which were classified as binders are now classified as non-binders in v1 due to their relatively weak interactions, which may not be stable or effective in inhibiting EGFR.
The names of weak binders are "alecl-Sequence1" and "alan.blakely-design:5 n:6|mpnn:1.247|plddt:0.825|ptm:0.709|pae:10.151|rmsd:3.535".

## Reference:
- https://design.adaptyvbio.com/
- https://foundry.adaptyvbio.com/egfr_design_competition
- https://github.com/agitter/adaptyvbio-egfr

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

# polaris dataset
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.utils.types import HubOwner


root = pathlib.Path("__file__").absolute().parents[3]
os.chdir(root)
sys.path.insert(0, str(root))

In [2]:
# Get the owner and organization
org = "AdaptyvBio"
data_name = "EGFR_binders"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug="adaptyv-bio", type="organization")
owner

HubOwner(slug='adaptyv-bio', external_id=None, type='organization')

In [3]:
BENCHMARK_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/benchmarks"
DATASET_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/datasets"
FIGURE_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/figures"

## Load existing data

In [4]:
# Load the curated data
PATH = "gs://polaris-public/polaris-recipes/org-AdaptyvBio/EGFR_binders/raw/round1_results_summary_with_class_v1.csv"
table = pd.read_csv(PATH)

In [5]:
table

Unnamed: 0,name,username,sequence_name,kd,sequence,dna,plddt,pae_interaction,similarity_check,model_names,methods,binding_class,replicate,expression,binding,kon,koff,binding_strength
0,Cetuximab_scFv,,,6.638345e-09,QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE...,ATGCAGGTGCAGCTGAAACAGAGCGGCCCGGGCCTGGTGCAGCCAT...,,,,,,True,,,,,,
1,ahmedsameh-Q3,ahmedsameh,Q3,3.694188e-08,WVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGTGGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.840455,28.217942,0.992,"[""Rosetta""]","[""Physics Based""]",True,,,,,,
2,ahmedsameh-yy2,ahmedsameh,yy2,6.275390e-08,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSHAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.288939,28.177070,0.992,"[""Rosetta""]","[""Physics Based""]",True,,,,,,
3,martin.pacesa-EGFR_l138_s90285_mpnn2,martin.pacesa,EGFR_l138_s90285_mpnn2,4.909414e-07,SPFDLFLDRLPEQDPEMTEEGKWWAEEMKRMVGPHFEELEEYIRNN...,ATGAGCCCGTTTGATCTGTTTCTGGATCGCCTGCCGGAACAGGATC...,88.653551,16.878782,,"[""AF2 Backprop""]","[""Hallucination""]",True,,,,,,
4,x.rustamov-m_18_41,x.rustamov,m_18_41,4.773972e-06,SAGQAQIEEVKARADKAKTLEELKELRKEAYEKNWKAYMAVVDETE...,ATGAGCGCGGGCCAGGCGCAGATTGAAGAAGTGAAAGCGCGCGCAG...,89.580600,14.921833,,"[""AF2 Backprop""]","[""Hallucination""]",True,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208,gitter-yolo5,,,,MTTSSIRRQMKNIVNNYSEAEIKVREATSNDPWGPSSSLMTEIADL...,,,,,,,False,1.0,high,False,,,none
209,gitter-yolo6,,,,MQSVLTQSPASLSASVGDRVTITCRASQDISNYLNWYQQKPGKAPK...,,,,,,,False,1.0,high,False,,,none
210,gitter-yolo7,,,,QVQLQESGPGLVKPSETLSLTCTVSGGSISSGDYYWTWIRQPPGKG...,,,,,,,False,1.0,medium,False,,,none
211,gitter-yolo8,,,,DIQMTQSPSSLSASVGDRVTITCRASQDISNYLNWYQQKPGKAPKL...,,,,,,,False,1.0,medium,False,,,none


### Below we specify the meta information of data columns

It's necessary to specify the key bioactivity columns, molecule structures and identifiers in the dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

This dataset includes two weak binders. Since only six designs exhibit moderate to strong binding affinities, these two weak binders are classified as positive binders in the binary classification setting. 

In [6]:
# Rename column names for compheransive terms
table.rename(columns={"kd": "KD"}, inplace=True)

In [7]:
table.head(3)

Unnamed: 0,name,username,sequence_name,KD,sequence,dna,plddt,pae_interaction,similarity_check,model_names,methods,binding_class,replicate,expression,binding,kon,koff,binding_strength
0,Cetuximab_scFv,,,6.638345e-09,QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLE...,ATGCAGGTGCAGCTGAAACAGAGCGGCCCGGGCCTGGTGCAGCCAT...,,,,,,True,,,,,,
1,ahmedsameh-Q3,ahmedsameh,Q3,3.694188e-08,WVQLQESGGGLVQPGGSLRLSCAASGRTFSSYAMGWFRQAPGKQRE...,ATGTGGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.840455,28.217942,0.992,"[""Rosetta""]","[""Physics Based""]",True,,,,,,
2,ahmedsameh-yy2,ahmedsameh,yy2,6.27539e-08,QVQLQESGGGLVQPGGSLRLSCAASGRTFSSHAMGWFRQAPGKQRE...,ATGCAGGTGCAGCTGCAGGAAAGCGGCGGCGGCTTAGTGCAACCAG...,77.288939,28.17707,0.992,"[""Rosetta""]","[""Physics Based""]",True,,,,,,


In [8]:
annotations = {
    "name": ColumnAnnotation(description="Sequence design name."),
    "sequence": ColumnAnnotation(description="Protein sequence in fasta format."),
    "dna": ColumnAnnotation(description="DNA sequence of the design."),
    "plddt": ColumnAnnotation(
        description="pLDDT is a per-residue measure of local confidence."
    ),
    "pae_interaction": ColumnAnnotation(
        description="The confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex."
    ),
    "similarity_check": ColumnAnnotation(
        description="Similar the designed sequence to reference known sequences."
    ),
    "KD": ColumnAnnotation(
        description="The equilibrium dissociation constant (KD) for the measure of binding affinity.",
        user_attributes={
            "unit": "M",
            "objective": "Lower value",
        },
    ),
    "binding_class": ColumnAnnotation(
        description="The binding affinity as boolean classes labels.",
        user_attributes={
            "objective": "True",
        },
    ),
    "model_names": ColumnAnnotation(
        description="The name of the model used for design."
    ),
    "methods": ColumnAnnotation(description="The method used for design."),
}

### Define `Dataset` object

In [9]:
from utils.docs_utils import load_readme

In [10]:
dataset_version = "v1"
dataset_name = f"{data_name}-{dataset_version}"

In [11]:
dataset = Dataset(
    table=table,
    name=dataset_name,
    description="This dataset includes binding protein designs targeting the Epidermal growth factor receptor(EGFR), a drug target associated with various diseases.",
    source="https://design.adaptyvbio.com/",
    annotations=annotations,
    owner=owner,
    tags=["protein-design"],
    license="CC-BY-4.0",
    readme=load_readme("org-AdaptyvBio/EGFR_binders/v1/dataset.md"),
)

### Dataset overview

In [12]:
dataset

[32m2024-10-25 11:20:11.580[0m | [1mINFO    [0m | [36mpolaris.mixins._checksum[0m:[36mmd5sum[0m:[36m27[0m - [1mComputing the checksum. This can be slow for large datasets.[0m


0,1
name,EGFR_binders-v1
description,"This dataset includes binding protein designs targeting the Epidermal growth factor receptor(EGFR), a drug target associated with various diseases."
tags,protein-design
user_attributes,
owner,adaptyv-bio
polaris_version,0.8.4.dev0+gd05937e.d20240903
default_adapters,
zarr_root_path,
readme,"## Background ### Target details Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue. - Target Protein: EGFR - Organism: HUMAN - Uniprot Accession ID: [P00533](https://www.uniprot.org/uniprotkb/P00533/entry) - Protein sequence: LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS - Structure PDB: [6ARU](https://www.rcsb.org/structure/6aru) ![64ru](https://cdn.rcsb.org/images/structures/6aru_assembly-1.jpeg) ### Binding protein designs This dataset contains 202 designed EGFR-binding protein sequences, along with experimental binding affinity results tested by the AdaptyvBio team, plus 11 additional sequences ordered by Anthony Gitter and tested by the AdaptyvBio team. ### Updates Compared to version v0, the two weak binders which were classified as binders are now classified as non-binders in v1 due to their relatively weak interactions, which may not be stable or effective in inhibiting EGFR. The names of weak binders are ""alecl-Sequence1"" and ""alan.blakely-design:5 n:6|mpnn:1.247|plddt:0.825|ptm:0.709|pae:10.151|rmsd:3.535"". ## Reference: - https://design.adaptyvbio.com/ - https://foundry.adaptyvbio.com/egfr_design_competition - https://github.com/agitter/adaptyvbio-egfr"
annotations,nameis_pointerFalsemodalityUNKNOWNdescriptionSequence design name.user_attributesdtypeobjectsequenceis_pointerFalsemodalityUNKNOWNdescriptionProtein sequence in fasta format.user_attributesdtypeobjectdnais_pointerFalsemodalityUNKNOWNdescriptionDNA sequence of the design.user_attributesdtypeobjectplddtis_pointerFalsemodalityUNKNOWNdescriptionpLDDT is a per-residue measure of local confidence.user_attributesdtypefloat64pae_interactionis_pointerFalsemodalityUNKNOWNdescriptionThe confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex.user_attributesdtypefloat64similarity_checkis_pointerFalsemodalityUNKNOWNdescriptionSimilar the designed sequence to reference known sequences.user_attributesdtypefloat64KDis_pointerFalsemodalityUNKNOWNdescriptionThe equilibrium dissociation constant (KD) for the measure of binding affinity.user_attributesunitMobjectiveLower valuedtypefloat64binding_classis_pointerFalsemodalityUNKNOWNdescriptionThe binding affinity as boolean classes labels.user_attributesobjectiveTruedtypeboolmodel_namesis_pointerFalsemodalityUNKNOWNdescriptionThe name of the model used for design.user_attributesdtypeobjectmethodsis_pointerFalsemodalityUNKNOWNdescriptionThe method used for design.user_attributesdtypeobjectusernameis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobjectsequence_nameis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobjectreplicateis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypefloat64expressionis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobjectbindingis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobjectkonis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypefloat64koffis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypefloat64binding_strengthis_pointerFalsemodalityUNKNOWNdescriptionNoneuser_attributesdtypeobject

0,1
name,is_pointerFalsemodalityUNKNOWNdescriptionSequence design name.user_attributesdtypeobject
sequence,is_pointerFalsemodalityUNKNOWNdescriptionProtein sequence in fasta format.user_attributesdtypeobject
dna,is_pointerFalsemodalityUNKNOWNdescriptionDNA sequence of the design.user_attributesdtypeobject
plddt,is_pointerFalsemodalityUNKNOWNdescriptionpLDDT is a per-residue measure of local confidence.user_attributesdtypefloat64
pae_interaction,is_pointerFalsemodalityUNKNOWNdescriptionThe confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex.user_attributesdtypefloat64
similarity_check,is_pointerFalsemodalityUNKNOWNdescriptionSimilar the designed sequence to reference known sequences.user_attributesdtypefloat64
KD,is_pointerFalsemodalityUNKNOWNdescriptionThe equilibrium dissociation constant (KD) for the measure of binding affinity.user_attributesunitMobjectiveLower valuedtypefloat64
binding_class,is_pointerFalsemodalityUNKNOWNdescriptionThe binding affinity as boolean classes labels.user_attributesobjectiveTruedtypebool
model_names,is_pointerFalsemodalityUNKNOWNdescriptionThe name of the model used for design.user_attributesdtypeobject
methods,is_pointerFalsemodalityUNKNOWNdescriptionThe method used for design.user_attributesdtypeobject

0,1
is_pointer,False
modality,UNKNOWN
description,Sequence design name.
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,Protein sequence in fasta format.
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,DNA sequence of the design.
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,pLDDT is a per-residue measure of local confidence.
user_attributes,
dtype,float64

0,1
is_pointer,False
modality,UNKNOWN
description,The confidence level in the interaction between different parts of a protein or between different proteins in a protein-protein complex.
user_attributes,
dtype,float64

0,1
is_pointer,False
modality,UNKNOWN
description,Similar the designed sequence to reference known sequences.
user_attributes,
dtype,float64

0,1
is_pointer,False
modality,UNKNOWN
description,The equilibrium dissociation constant (KD) for the measure of binding affinity.
user_attributes,unitMobjectiveLower value
dtype,float64

0,1
unit,M
objective,Lower value

0,1
is_pointer,False
modality,UNKNOWN
description,The binding affinity as boolean classes labels.
user_attributes,objectiveTrue
dtype,bool

0,1
objective,True

0,1
is_pointer,False
modality,UNKNOWN
description,The name of the model used for design.
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,The method used for design.
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,float64

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,float64

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,float64

0,1
is_pointer,False
modality,UNKNOWN
description,
user_attributes,
dtype,object


### Upload the dataset to the hub

In [14]:
dataset.upload_to_hub()

✅ SUCCESS: [1mYour dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/adaptyv-bio/EGFR_binders-v1[0m
 


  self._color = self._set_color(value) if value else value
