# PertCurator

Here we use `PertCurator` to curate perturbation related columns in an `AnnData` object of [McFarland et al. 2020](https://www.nature.com/articles/s41467-020-17440-w).

In [None]:
# pip install 'lamindb[jupyter,wetlab]' cellxgene-lamin
!lamin init --storage ./test-pert-curator --modules bionty,wetlab,ourprojects

In [None]:
import lamindb as ln
import wetlab as wl
import bionty as bt
import ourprojects as ops
import pandas as pd
import scanpy as sc

ln.track("HIRTYxL3aZc70000")

In [None]:
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head()

In [None]:
# Calculate an embedding because CELLxGENE requires one
sc.tl.pca(adata)

## Curate and register perturbations

Required columns:
- Either "pert_target" or "pert_name" and "pert_type" ("pert_type" allows: "genetic", "drug", "biologic", "physical")
- If pert_dose = True (default), requires "pert_dose" in form of number+unit. E.g. 10.0nM
- If pert_time = True (default), requires "pert_time" in form of number+unit. E.g. 10.0h

In [None]:
# rename the columns to match the expected format
adata.obs["pert_time"] = adata.obs["time"].apply(
    lambda x: str(x).split(", ")[-1] + "h" if pd.notna(x) else x
)  # we only take the last timepoint
adata.obs["pert_dose"] = adata.obs["dose_value"].map(
    lambda x: f"{x}{adata.obs['dose_unit'].iloc[0]}" if pd.notna(x) else None
)
adata.obs.rename(
    columns={"perturbation": "pert_name", "perturbation_type": "pert_type"},
    inplace=True,
)
# fix the perturbation type as suggested by the curator
adata.obs["pert_type"] = adata.obs["pert_type"].cat.rename_categories(
    {"CRISPR": "genetic", "drug": "compound"}
)

In [None]:
curator = wl.PertCurator(adata)

In [None]:
curator.validate()

### Genetic perturbations

In [None]:
# register genetic perturbations with their target genes
pert_target_map = {
    "sggpx4-1": "GPX4",
    "sggpx4-2": "GPX4",
    "sgor2j2": "OR2J2",  # cutting control
}

for sg_name, gene_symbol in pert_target_map.items():
    pert = wl.GeneticPerturbation(
        system="CRISPR-Cas9",
        name=sg_name,
        description="cutting control" if sg_name == "sgor2j2" else None,
    ).save()
    target = wl.PerturbationTarget(name=gene_symbol).save()
    pert.targets.add(target)
    gene = bt.Gene.from_source(symbol=gene_symbol, organism="human").save()
    target.genes.set([gene] if isinstance(gene, bt.Gene) else gene)

adata.obs["pert_target"] = adata.obs["pert_genetic"].map(pert_target_map)

# register the negative control without targets: Non-cutting control
wl.GeneticPerturbation(
    name="sglacz", system="CRISPR-Cas9", description="non-cutting control"
).save();

### Compounds

In [None]:
# the remaining compounds are not in CHEBI and we create records for them
curator.add_new_from("pert_compound")

## Curate non-pert metadata

In [None]:
# manually fix sex and set assay
adata.obs["sex"] = adata.obs["sex"].cat.rename_categories({"Unknown": "unknown"})
adata.obs["assay"] = "10x 3' v3"

# subset the adata to only include the validated genes
adata = adata[:, ~adata.var_names.isin(curator.non_validated["var_index"])].copy()

# standardize disease and sex as suggested
curator.standardize("disease")
curator.standardize("sex")

In [None]:
# Recreate Curator object because we are using a new adata
curator = wl.PertCurator(adata)
curator.validate()

In [None]:
curator.add_new_from("all")

In [None]:
curator.validate()

## References

In [None]:
reference = ops.Reference(
    name="Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action",
    abbr="McFarland 2020",
    url="https://www.nature.com/articles/s41467-020-17440-w",
    doi="10.1038/s41467-020-17440-w",
    text=(
        "Assays to study cancer cell responses to pharmacologic or genetic perturbations are typically "
        "restricted to using simple phenotypic readouts such as proliferation rate. Information-rich assays, "
        "such as gene-expression profiling, have generally not permitted efficient profiling of a given "
        "perturbation across multiple cellular contexts. Here, we develop MIX-Seq, a method for multiplexed "
        "transcriptional profiling of post-perturbation responses across a mixture of samples with single-cell "
        "resolution, using SNP-based computational demultiplexing of single-cell RNA-sequencing data. We show "
        "that MIX-Seq can be used to profile responses to chemical or genetic perturbations across pools of 100 "
        "or more cancer cell lines. We combine it with Cell Hashing to further multiplex additional experimental "
        "conditions, such as post-treatment time points or drug doses. Analyzing the high-content readout of "
        "scRNA-seq reveals both shared and context-specific transcriptional response components that can identify "
        "drug mechanism of action and enable prediction of long-term cell viability from short-term transcriptional "
        "responses to treatment."
    ),
).save()

## Register curated artifact

In [None]:
artifact = curator.save_artifact(description="McFarland AnnData")

In [None]:
# link the reference to the artifact
artifact.references.add(reference)

In [None]:
artifact.describe()