# Benchmarks for L1000 MCF7 dataset

## Background
The LINCS L1000 is a database of high-throughput transcriptomics that screened more than 30,000 perturbations on a set of 978 landmark genes [4] from multiple cell lines. VCAP and MCF7 are, respectively, prostate cancer and human breast cancer cell lines. In L1000, most of the perturbagens are chemical, meaning that small drug-like molecules are added to the cell lines to observe how the gene expressions change. This allows to generate biological signatures of the molecules, which are known to correlate with drug activity and side effects.

## Assay information
L1000 is a gene-expression profiling assay based on the direct measurement of a reduced representation of the transcriptome and computational inference of the portion of the transcriptome not explicitly measured. The number of landmark transcripts whose abundance is measured directly is approximately one thousand. Eighty additional invariant transcripts are also explicitly measured to enable quality control, scaling and normalization. Measurements of transcript abundance are made with a combination of a coupled ligase detection and polymerase chain reaction, optically-addressed microspheres, and a flow-cytometric detection system. 

For more information, see the [LINCS User Guide](https://docs.google.com/document/d/1q2gciWRhVCAAnlvF2iRLuJ7whrGP6QjpsCMq1yWz7dU/edit#heading=h.usef9o7fuux3).

## Benchmarking
**The goal** of this benchmark is to have the best predictive model for L1000 genomic signature, for each gene. MSE (mean squared error) measures how far the predicted gene signature is from the actual.

## Description of readout:
- Readouts: MSE
- Optimization objective: Lower value



In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import datamol as dm

# polaris benchmark
from polaris.benchmark import MultiTaskBenchmarkSpecification

# polaris hub
from polaris.utils.types import HubOwner

# utils
root = pathlib.Path("__file__").absolute().parents[3]
os.chdir(root)
sys.path.insert(0, str(root))
from utils.docs_utils import load_readme

In [2]:
# Get the owner and organization
org = "Graphium"
data_name = "l1000_mcf7"
dataset_name = f"{data_name}-v1"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug=org.lower(), type="organization")
owner

HubOwner(slug='graphium', external_id=None, type='organization')

In [3]:
BENCHMARK_DIR = f"{gcp_root}/benchmarks"
DATASET_JSON = f"{gcp_root}/datasets/{dataset_name}/dataset.json"

FIGURE_DIR = f"{gcp_root}/figures"

### Load existing data

In [4]:
# Load the saved Dataset
from polaris.dataset import Dataset

dataset = Dataset.from_json(DATASET_JSON)

<a id="benchmark"></a>
## Benchmark creation with `Polaris`
Creating a benchmark involves setting up a standard dataset, designing the train-test set and defining evaluation metrics which is used to establish baseline performance level. 

In [7]:
data_cols = [col for col in dataset.columns if col.startswith("geneID")]

mol_col = "SMILES"

### Get the train/test splits

In [11]:
import torch

split_path = f"{gcp_root}/data/raw/l1000_mcf7_random_splits.pt"
with dm.fs.fsspec.open(split_path) as f:
    split_dict = torch.load(f)

splits = [split_dict["train"], split_dict["val"], split_dict["test"]]

## Define multitask benchmarks with the above defined split

In [22]:
benchmark_splits = (splits[0] + splits[1], splits[2])

In [26]:
benchmark_version = "v1"
benchmark_name = f"{data_name}-{benchmark_version}"
readme_name = f"org-Graphium/l1000/{data_name}/benchmark_readme.md"
BENCHMARK_SAVE_DIR = f"{BENCHMARK_DIR}/{benchmark_name}"


benchmark = MultiTaskBenchmarkSpecification(
    name=benchmark_name,
    dataset=dataset,
    target_cols=data_cols,
    target_types={col: "regression" for col in data_cols},
    input_cols=mol_col,
    split=benchmark_splits,
    metrics=["mean_squared_error"],
    tags=["multitask"],
    description="A multitask regression benchmark for ZINC12K dataset.",
    owner=owner,
    readme=load_readme(readme_name),
)
path = benchmark.to_json(BENCHMARK_SAVE_DIR)
print(path)

[32m2024-07-17 00:25:01.083[0m | [1mINFO    [0m | [36mpolaris._mixins[0m:[36mmd5sum[0m:[36m27[0m - [1mComputing the checksum. This can be slow for large datasets.[0m


gs://polaris-public/polaris-recipes/org-Graphium/l1000_mcf7/benchmarks/l1000_mcf7-v1/benchmark.json


In [27]:
# Upload to hub
benchmark.upload_to_hub(owner=owner, access="private")

[32m2024-07-17 00:25:09.612[0m | [32m[1mSUCCESS [0m | [36mpolaris.hub.client[0m:[36mupload_benchmark[0m:[36m675[0m - [32m[1mYour benchmark has been successfully uploaded to the Hub. View it here: https://polarishub.io/benchmarks/graphium/l1000_mcf7-v1[0m


{'id': 'mP3HMWUNfNWW2A3uB0x6t',
 'createdAt': '2024-07-17T04:25:09.487Z',
 'deletedAt': None,
 'name': 'l1000_mcf7-v1',
 'slug': 'l1000_mcf7-v1',
 'description': 'A multitask regression benchmark for ZINC12K dataset.',
 'tags': ['multitask'],
 'userAttributes': {},
 'access': 'private',
 'isCertified': False,
 'polarisVersion': 'dev',
 'readme': '## Background\n\n\n## Assay information\n\n\n## Description of readout:\n\n\n## Data resource\n\n',
 'state': 'ready',
 'ownerId': 'zMTB7lQiiukqEmLQF7EjT',
 'creatorId': 'NKnaHGybLqwSHcaMEHqfF',
 'datasetId': '7C7RxULp5PcqQkiNRdoKK',
 'targetCols': ['geneID-10007',
  'geneID-1001',
  'geneID-10013',
  'geneID-10038',
  'geneID-10046',
  'geneID-10049',
  'geneID-10051',
  'geneID-10057',
  'geneID-10058',
  'geneID-10059',
  'geneID-10099',
  'geneID-10112',
  'geneID-10123',
  'geneID-10131',
  'geneID-10146',
  'geneID-10150',
  'geneID-10153',
  'geneID-10165',
  'geneID-1017',
  'geneID-10174',
  'geneID-10180',
  'geneID-1019',
  'geneID-