# Dataset creation with [Polaris](https://github.com/polaris-hub/polaris)

## Background

### Target details
Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue. 

- Target Protein: EGFR
- Organism: HUMAN
- Uniprot Accession ID: [P00533](https://www.uniprot.org/uniprotkb/P00533/entry)
- Protein sequence: LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPS
- Structure PDB: [6ARU](https://www.rcsb.org/structure/6aru)


![64ru](https://cdn.rcsb.org/images/structures/6aru_assembly-1.jpeg)

### Binding protein designs
This dataset contains 202 designed EGFR-binding protein sequences, along with experimental binding affinity results tested by the AdaptyvBio team.

## Reference:
- https://design.adaptyvbio.com/
- https://foundry.adaptyvbio.com/egfr_design_competition

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

# polaris dataset
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.utils.types import HubOwner


root = pathlib.Path("__file__").absolute().parents[2]
os.chdir(root)
sys.path.insert(0, str(root))

In [2]:
root

PosixPath('/Users/lu.zhu/Documents/Codebase/ValenceLab/polaris-recipes')

In [3]:
# Get the owner and organization
org = "AdaptyvBio"
data_name = "EGFR_binders"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug="adaptyv-bio", type="organization")
owner

HubOwner(slug='adaptyv-bio', external_id=None, type='organization')

In [4]:
BENCHMARK_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/benchmarks"
DATASET_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/datasets"
FIGURE_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/figures"

## Load existing data

In [5]:
# Load the curated data
PATH = "gs://polaris-public/polaris-recipes/org-AdaptyvBio/EGFR_binders/raw/Competition_Binders_to_EGFR_mean.csv"
table = pd.read_csv(PATH)

### Below we specify the meta information of data columns

It's necessary to specify the key bioactivity columns, molecule structures and identifiers in the dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

This dataset includes two weak binders. Since only six designs exhibit moderate to strong binding affinities, these two weak binders are classified as positive binders in the binary classification setting. The `unknown` binding results are labeled as `False`.


In [6]:
# fill nans
table["binding"] = table["binding"].replace("weak", "True").values
table["binding"] = table["binding"].replace("unknown", "False").values

table["binding"].value_counts()

binding
False    194
True       8
Name: count, dtype: int64

In [7]:
# Rename column names for compheransive terms
table.rename(columns={"kd": "KD", "binding": "binding_class"}, inplace=True)

In [8]:
annotations = {
    "name": ColumnAnnotation(description="Sequence design name."),
    "sequence": ColumnAnnotation(description="Protein sequence in fasta format."),
    "KD": ColumnAnnotation(
        description="The equilibrium dissociation constant (KD) for the measure of binding affinity.",
        user_attributes={
            "unit": "M",
            "objective": "Lower value",
        },
    ),
    "binding_class": ColumnAnnotation(
        description="The binding affinity as boolean classes labels.",
        user_attributes={
            "objective": "True",
        },
    ),
}

### Define `Dataset` object

In [9]:
dataset_version = "v1"
dataset_name = f"{data_name}-{dataset_version}"

In [10]:
dataset = Dataset(
    table=table[annotations.keys()].copy(),
    name=dataset_name,
    description="This dataset includes binding protein designs targeting the Epidermal growth factor receptor(EGFR), a drug target associated with various diseases.",
    source="https://design.adaptyvbio.com/",
    annotations=annotations,
    owner=owner,
    tags=["protein-design"],
    license="CC-BY-4.0",
)

### Dataset overview

In [11]:
dataset

[32m2024-09-23 16:21:14.227[0m | [1mINFO    [0m | [36mpolaris.mixins._checksum[0m:[36mmd5sum[0m:[36m27[0m - [1mComputing the checksum. This can be slow for large datasets.[0m


0,1
name,EGFR_binders-v1
description,"This dataset includes binding protein designs targeting the Epidermal growth factor receptor(EGFR), a drug target associated with various diseases."
tags,protein-design
user_attributes,
owner,adaptyv-bio
polaris_version,0.8.5.dev6+gb44821c
default_adapters,
zarr_root_path,
readme,
annotations,nameis_pointerFalsemodalityUNKNOWNdescriptionSequence design name.user_attributesdtypeobjectcontent_typeNonesequenceis_pointerFalsemodalityUNKNOWNdescriptionProtein sequence in fasta format.user_attributesdtypeobjectcontent_typeNoneKDis_pointerFalsemodalityUNKNOWNdescriptionThe equilibrium dissociation constant (KD) for the measure of binding affinity.user_attributesunitMobjectiveLower valuedtypefloat64content_typeNonebinding_classis_pointerFalsemodalityUNKNOWNdescriptionThe binding affinity as boolean classes labels.user_attributesobjectiveTruedtypeobjectcontent_typeNone

0,1
name,is_pointerFalsemodalityUNKNOWNdescriptionSequence design name.user_attributesdtypeobjectcontent_typeNone
sequence,is_pointerFalsemodalityUNKNOWNdescriptionProtein sequence in fasta format.user_attributesdtypeobjectcontent_typeNone
KD,is_pointerFalsemodalityUNKNOWNdescriptionThe equilibrium dissociation constant (KD) for the measure of binding affinity.user_attributesunitMobjectiveLower valuedtypefloat64content_typeNone
binding_class,is_pointerFalsemodalityUNKNOWNdescriptionThe binding affinity as boolean classes labels.user_attributesobjectiveTruedtypeobjectcontent_typeNone

0,1
is_pointer,False
modality,UNKNOWN
description,Sequence design name.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,Protein sequence in fasta format.
user_attributes,
dtype,object
content_type,

0,1
is_pointer,False
modality,UNKNOWN
description,The equilibrium dissociation constant (KD) for the measure of binding affinity.
user_attributes,unitMobjectiveLower value
dtype,float64
content_type,

0,1
unit,M
objective,Lower value

0,1
is_pointer,False
modality,UNKNOWN
description,The binding affinity as boolean classes labels.
user_attributes,objectiveTrue
dtype,object
content_type,

0,1
objective,True


In [1]:
# # save the dataset to GCP
# SAVE_DIR = f"{DATASET_DIR}/{dataset_name}"
# dataset_path = dataset.to_json(SAVE_DIR)
# dataset_path

### Upload the dataset to the hub

In [15]:
dataset.upload_to_hub()

✅ SUCCESS: [1mYour standard dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/adaptyv-bio/EGFR_binders-v1[0m
 


  self._color = self._set_color(value) if value else value
