# Perturbation

This guide demonstrates how to curate a complex, real world perturbation dataset [McFarland et al. 2020](https://www.nature.com/articles/s41467-020-17440-w) using the {mod}`wetlab` schema.

In [None]:
# !pip install 'lamindb[jupyter,aws,bionty]' wetlab
!lamin init --storage ./test-perturbation --schema bionty,wetlab

In [None]:
import lamindb as ln
import bionty as bt
import wetlab as wl
import pandas as pd

pd.set_option("display.max_columns", None)

ln.context.uid = "K6sInKIQW5nt0003"
ln.context.track()

In [None]:
# See https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0006 to learn how this dataset was prepared
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head(3)

In [None]:
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    organism="human",
    using_key="laminlabs/lamindata"
)
curate.validate()

In [None]:
# The cells were subject to several types of perturbations that we will curate separately
adata.obs.perturbation_type.value_counts()

## Curate non-perturbation metadata

In [None]:
categoricals = {
    "depmap_id": bt.CellLine.ontology_id,
    "cell_line": bt.CellLine.name,
    "disease": bt.Disease.name,
    "organism": bt.Organism.name,
    "perturbation_type": ln.ULabel.name,
    "sex": bt.Phenotype.name,
    "time": ln.ULabel.name,
    "tissue_type": ln.ULabel.name,
}
sources = {
    "depmap_id": bt.Source.using("laminlabs/lamindata").filter(name="depmap").one(),
    "cell_line": bt.Source.using("laminlabs/lamindata").filter(name="depmap").one(),
}

curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals=categoricals,
    organism="human",
    sources=sources,
    using_key="laminlabs/lamindata"
)

curate.validate()

In [None]:
curate.add_new_from("perturbation_type")
curate.add_new_from("sex")
curate.add_new_from("time")
curate.add_new_from("tissue_type")
curate.add_new_from("cell_line")

## Modeling and curating perturbation metadata

The dataset has two types of perturbations: CRISPR and Compounds.
We will create their records and associated targets separately.

In [None]:
crispr_metadata = adata.obs[adata.obs["perturbation_type"] == "CRISPR"]
drug_metadata = adata.obs[adata.obs["perturbation_type"] == "drug"]

The {mod}`wetlab` schema has two major components:

1. {class}`wetlab.EnvironmentalTreatment` to model perturbations such as heat, {class}`wetlab.GeneticTreatment` to model perturbations such as CRISPR, and {class}`wetlab.CompoundTreatment` to model, for example, drugs. Several treatments together can be modeled using {class}`wetlab.CombinationTreatment`.
2. Known targets of treatments can be modeled through {class}`wetlab.TreatmentTarget` which can be one or several of {class}`bionty.Gene`, {class}`bionty.Protein`, or {class}`bionty.Pathway` records.

### Genetic perturbations

Genetic perturbations can be modeled in two ways depending on the available information by populating a:

1. {class}`wetlab.GeneticTreatment` record if the system such as the guide RNA name or sequence, the on- and off-target scores are known.
2. {class}`wetlab.TreatmentTarget` record that links to {class}`bionty.Gene` records.

In [None]:
crispr_metadata.head(3)

In [None]:
list(crispr_metadata["perturbation"].unique())

:::{dropdown} What are the associated targets?

The following targets are the direct targets of the perturbations, and while they may affect a pathway, we only curate the direct targets for simplicity.

1. **sgGPX4-1**: **Gene/Protein** - GPX4 (Glutathione Peroxidase 4)
2. **sgGPX4-2**: **Gene/Protein** - GPX4 (Glutathione Peroxidase 4)
3. **sgLACZ**: **Gene/Protein** - LACZ (β-galactosidase)
4. **sgOR2J2**: **Gene/Protein** - OR2J2 (Olfactory receptor family 2 subfamily J member 2)

:::

Since the perturbation metadata contains the guide RNA names, we model the genetic perturbations using both {class}`wetlab.GeneticTreatment` and {class}`wetlab.TreatmentTarget`.

In [None]:
treatments = [
    ("sgGPX4-1", "GPX4", "Glutathione Peroxidase 4"),
    ("sgGPX4-2", "GPX4", "Glutathione Peroxidase 4"),
    ("sgor2j2", "or2j2", "Olfactory receptor family 2 subfamily J member 2"),
    ("sgLACZ", "lacz", "beta-galactosidase control"),  # Control from E. coli
]
organism = bt.Organism.lookup().human

genetic_treatments = []
for name, symbol, target_name in treatments:
    treatment = wl.GeneticTreatment(system="CRISPR Cas9", name=name).save()
    if symbol != "lacz":
        gene_result = bt.Gene.from_source(symbol=symbol, organism=organism)
        gene = gene_result[0] if isinstance(gene_result, list) else gene_result
        gene = gene.save()
    else:
        gene = bt.Gene(symbol=symbol, organism=organism).save()
    target = wl.TreatmentTarget(name=target_name).save()
    target.genes.add(gene)
    treatment.targets.add(target)
    genetic_treatments.append(treatment)

### Compound perturbations

Although the targets are known for many compounds, we skip annotating them here to keep the guide brief.

:::{dropdown} What are the compound targets?

1. **AZD5591**: Unknown
2. **Afatinib**: **Proteins** - EGFR (Epidermal Growth Factor Receptor), HER2 (Human Epidermal growth factor Receptor 2)
3. **BRD3379**: Unknown
4. **Bortezomib**: **Protein complex** - Proteasome (specifically the 26S proteasome subunit)
5. **Dabrafenib**: **Gene/Protein** - BRAF (V600E mutation in the BRAF gene, which codes for a protein kinase)
6. **Everolimus**: **Protein** - mTOR (Mammalian Target of Rapamycin)
7. **Gemcitabine**: **Pathway/Process** - DNA synthesis (inhibition of ribonucleotide reductase and incorporation into DNA)
8. **Idasanutlin**: **Protein** - MDM2 (Mouse Double Minute 2 homolog)
9. **JQ1**: **Protein** - BRD4 (Bromodomain-containing protein 4)
10. **Navitoclax**: **Proteins** - BCL-2, BCL-XL (B-cell lymphoma 2 and B-cell lymphoma-extra large)
11. **Prexasertib**: **Protein** - CHK1 (Checkpoint kinase 1)
12. **Taselisib**: **Protein/Pathway** - PI3K (Phosphoinositide 3-kinase)
13. **Trametinib**: **Proteins** - MEK1/2 (Mitogen-Activated Protein Kinase Kinase 1 and 2)
14. **control**: Not applicable

:::

In [None]:
# We are using the chebi/chembl chemistry/drug ontology for the drug perturbations
chebi_source = bt.Source.filter(entity="Drug", name="chebi").one()
wl.Compound.add_source(chebi_source)
compounds = wl.Compound.public()

In [None]:
drug_metadata.head(3)

In [None]:
compounds = wl.Compound.from_values(drug_metadata["perturbation"], field="name")

In [None]:
# The remaining compounds are not in chebi and we create records for them
for missing in [
    "azd5591",
    "brd3379",
    "control",
    "idasanutlin",
    "prexasertib",
    "taselisib",
]:
    compounds.append(wl.Compound(name=missing))
ln.save(compounds)

In [None]:
unique_treatments = drug_metadata[
    ["perturbation", "dose_unit", "dose_value"]
].drop_duplicates()

compound_treatments = []
for _, row in unique_treatments.iterrows():
    compound = wl.Compound.get(name=row["perturbation"])
    treatment = wl.CompoundTreatment(
        name=compound.name,
        concentration=row["dose_value"],
        concentration_unit=row["dose_unit"],
    )
    compound_treatments.append(treatment)

ln.save(compound_treatments)

## Register curated artifact

In [None]:
artifact = curate.save_artifact(description="McFarland AnnData")

In [None]:
artifact.genetic_treatments.set(genetic_treatments)
artifact.compound_treatments.set(compound_treatments)

In [None]:
artifact.describe()

In [None]:
# clean up test instance
!rm -r test-perturbation
!lamin delete --force test-perturbation