# Dataset creation with [Polaris](https://github.com/polaris-hub/polaris)

## Background
Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein that plays a critical role in cell growth, differentiation, and survival. It is frequently overexpressed or mutated in various cancers, including non-small cell lung cancer, colorectal cancer, and head and neck cancer. This makes EGFR a crucial target for cancer therapies such as Cetuximab, an antibody with more than 1B USD in annual revenue.



In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

# polaris dataset
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.utils.types import HubOwner


root = pathlib.Path("__file__").absolute().parents[2]
os.chdir(root)
sys.path.insert(0, str(root))

In [2]:
root

In [3]:
# Get the owner and organization
org = "AdaptyvBio"
data_name = "EGFR_binders"
dirname = dm.fs.join(root, f"org-{org}",data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug="adaptyv-bio", type="organization")
owner

In [4]:
BENCHMARK_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/benchmarks"
DATASET_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/datasets"
FIGURE_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/figures"

## Load existing data

In [5]:
# Load the curated data
PATH = "gs://polaris-public/polaris-recipes/org-AdaptyvBio/EGFR_binders/raw/Competition_Binders_to_EGFR_mean.csv"
PATH = "org-AdaptyvBio/EGFR_binders/Competition_Binders_to_EGFR_mean.csv"
table = pd.read_csv(PATH)

### Below we specify the meta information of data columns

It's necessary to specify the key bioactivity columns, molecule structures and identifiers in the dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

This dataset includes two weak binders. Given only 6 designs show moderate to strong binding affinities, we include these two weak binders as positive binders in the setting of binary classification scenario.
The `unknown` binding results are converted to `False` label.

In [6]:
# fill nans
table["binding"] = table["binding"].replace("weak", "True").values
table["binding"] = table["binding"].replace("unknown", "False").values

table["binding"].value_counts()

In [7]:
# Rename column names for compheransive terms
table.rename(columns={"kd": "KD", "binding": "binding_class"}, inplace=True)

In [8]:
annotations = {
    "name": ColumnAnnotation(
        description="Sequence design name."
    ),
    "sequence": ColumnAnnotation(
        description="Protein sequence in fasta format."
    ),
    "KD": ColumnAnnotation(
        description="The equilibrium dissociation constant (KD) for the measure of binding affinity.",
            user_attributes={
            "unit": "M", 
            "objective": "Lower value",}
    ),
    "binding_class": ColumnAnnotation(
        description="The binding affinity as boolean classes labels.",
        user_attributes={
            "objective": "True",}
    ),
}

### Define `Dataset` object

In [9]:
dataset_version = "v1"
dataset_name = f"{data_name}-{dataset_version}"

In [10]:
dataset = Dataset(
    table=table[annotations.keys()].copy(),
    name=dataset_name,
    description="This dataset includes binding protein designs targeting the (EGFR), a drug target associated with various diseases.",
    source="https://design.adaptyvbio.com/",
    annotations=annotations,
    owner=owner,
    tags=["protein-design"],
    license="CC-BY-4.0")

### Dataset overview

In [11]:
dataset

In [12]:
# # save the dataset to GCP
# SAVE_DIR = f"{DATASET_DIR}/{dataset_name}"
# dataset_path = dataset.to_json(SAVE_DIR)
# dataset_path

### Upload the dataset to the hub

In [14]:
dataset.upload_to_hub()