# Curating perturbation dataset

In this guide we demonstrate how to annotate a complex, real world perturbation dataset (McFarland et al. 2020)[https://www.nature.com/articles/s41467-020-17440-w] in great detail.
We make use of the {mod}`wetlab` schema to enable efficient search for specific treatment targets and their associated perturbations.

In [None]:
# !pip install 'lamindb[jupyter,aws,bionty]' wetlab 
!lamin init --storage ./test-perturbation --schema bionty,wetlab

In [None]:
!wget -nc https://zenodo.org/record/7041849/files/McFarlandTsherniak2020.h5ad

In [None]:
import lamindb as ln
import bionty as bt
import wetlab as wl
import anndata as ad
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

ln.settings.creation.search_names = False

ln.context.uid = "K6sInKIQW5nt0000"
ln.context.track()

In [None]:
adata = ad.read_h5ad("McFarlandTsherniak2020.h5ad")

In [None]:
# Subsample to speed up subsequent steps
adata = adata[np.random.choice(adata.n_obs, size=int(0.5 * adata.n_obs), replace=False), :].copy()

In [None]:
adata.obs.head(3)

In [None]:
# The cells were subject to several types of perturbations that we will curate separately
adata.obs.perturbation_type.value_counts()

In [None]:
adata.obs = adata.obs.drop(columns="percent.mito")

## Curate non-perturbation metadata

In [None]:
categoricals = {
    "DepMap_ID": bt.CellLine.ontology_id,
    "cell_line": bt.CellLine.name,
    "disease": bt.Disease.name,
    "organism": bt.Organism.name,
    "perturbation_type": ln.ULabel.name,
    "sex": bt.Phenotype.name,
    "time": ln.ULabel.name,
    "tissue_type": ln.ULabel.name,
}
sources = {
    "DepMap_ID": bt.Source.filter(name="depmap").one(),
    "cell_line": bt.Source.filter(name="depmap").one(),
}

In [None]:
curate = ln.Curate.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,
    categoricals=categoricals, 
    organism="human",
    sources=sources
)

In [None]:
curate.add_new_from_columns()

In [None]:
curate.validate()

In [None]:
# We found a mix of ensembl IDs and gene symbols in the var_index -> get all gene symbols to ensembl IDs
gene_mapper = bt.Gene.standardize(curate.non_validated["var_index"], field="symbol", return_field="ensembl_gene_id", return_mapper=True, organism="human")
adata.var.index = adata.var.index.map(lambda x: gene_mapper.get(x, x))

In [None]:
# Since the focus of this guide is the curation of perturbations, we assume that we got the correct names by searching on LaminHub
adata.obs["disease"] = adata.obs["disease"].cat.rename_categories({"colon/colorectal cancer": "colorectal cancer",
                                                    "rhabdoid": "rhabdoid tumor",
                                                    "bladder cancer": "urinary bladder carcinoma",
                                                    "endometrial/uterine cancer": "uterine corpus cancer"})

In [None]:
adata.obs["cell_line"] = bt.CellLine.public(source=bt.Source.filter(name="depmap").one()).standardize(adata.obs["cell_line"], field="name")
bt.CellLine.public(source=bt.Source.filter(name="depmap").one()).inspect(adata.obs["cell_line"], field="name")

In [None]:
curate.add_validated_from_var_index()
curate.add_validated_from('DepMap_ID')
curate.add_new_from('perturbation_type')
curate.add_new_from('sex')
curate.add_new_from('time')
curate.add_new_from('tissue_type')
curate.add_validated_from('disease')
curate.add_new_from('cell_line')

In [None]:
curate = ln.Curate.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,
    categoricals=categoricals, 
    organism="human",
    sources=sources
)
curate.validate()

In [None]:
adata = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()

In [None]:
curate = ln.Curate.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,
    categoricals=categoricals, 
    organism="human",
)
curate.validate()

## Defining Treatment records

The dataset has two types of perturbations: CRISPR and Compounds.
We will create their records and associated targets separately.

In [None]:
crispr_metadata = adata.obs[adata.obs["perturbation_type"] == "CRISPR"]
drug_metadata = adata.obs[adata.obs["perturbation_type"] == "drug"]

## Genetic treatments

The following targets are the direct targets of the perturbations, and while they may affect a pathway, we only curate the direct targets for simplicity.

1. **sgGPX4-1**: **Gene/Protein** - GPX4 (Glutathione Peroxidase 4)
2. **sgGPX4-2**: **Gene/Protein** - GPX4 (Glutathione Peroxidase 4)
3. **sgLACZ**: **Gene/Protein** - LACZ (β-galactosidase)
4. **sgOR2J2**: **Gene/Protein** - OR2J2 (Olfactory receptor family 2 subfamily J member 2)

In [None]:
crispr_metadata.head(3)

In [None]:
list(crispr_metadata["perturbation"].unique())

In [None]:
sgGPX4_1_treatment = wl.GeneticTreatment(
            system="CRISPR Cas9",
            name=f"sgGPX4-1 knockdown",
).save()
gpx4_prot = bt.Protein.from_source(gene_symbol="GPX4", organism="human")[0].save()
gpx4_target = wl.TreatmentTarget(name="Glutathione Peroxidase 4").save()
gpx4_target.proteins.add(gpx4_prot)
sgGPX4_1_treatment.targets.add(gpx4_target)

In [None]:
sgGPX4_2_treatment = wl.GeneticTreatment(
            system="CRISPR Cas9",
            name=f"sgGPX4-2 knockdown",
).save()
sgGPX4_2_treatment.targets.add(gpx4_target)

In [None]:
sglacz_treatment = wl.GeneticTreatment(
            system="CRISPR Cas9",
            name=f"sgLACZ knockdown",
).save()
lacz_prot = bt.Protein.from_source(name="beta-galactosidase", organism="human").save()
lacz_target = wl.TreatmentTarget(name="beta-galactosidase").save()
lacz_target.proteins.add(lacz_prot)
sglacz_treatment.targets.add(lacz_target)

In [None]:
sgor2j2_treatment = wl.GeneticTreatment(
            system="CRISPR Cas9",
            name=f"or2j2 knockdown",
).save()
or2j2_prot = bt.Protein.from_source(name="Olfactory receptor 2J2", organism="human").save()
or2j2_target = wl.TreatmentTarget(name="Olfactory receptor family 2 subfamily J member 2").save()
or2j2_target.proteins.add(or2j2_prot)
sgor2j2_treatment.targets.add(or2j2_target)

In [None]:
genetic_treatments = [sgGPX4_1_treatment, sgGPX4_2_treatment, sgGPX4_1_treatment, sgor2j2_treatment]

## CompoundTreatments

Although the targets are known for many compounds, we skip annotating them here to keep the guide on point.

1. **AZD5591**: Unknown
2. **Afatinib**: **Proteins** - EGFR (Epidermal Growth Factor Receptor), HER2 (Human Epidermal growth factor Receptor 2)
3. **BRD3379**: Unknown
4. **Bortezomib**: **Protein complex** - Proteasome (specifically the 26S proteasome subunit)
5. **Dabrafenib**: **Gene/Protein** - BRAF (V600E mutation in the BRAF gene, which codes for a protein kinase)
6. **Everolimus**: **Protein** - mTOR (Mammalian Target of Rapamycin)
7. **Gemcitabine**: **Pathway/Process** - DNA synthesis (inhibition of ribonucleotide reductase and incorporation into DNA)
8. **Idasanutlin**: **Protein** - MDM2 (Mouse Double Minute 2 homolog)
9. **JQ1**: **Protein** - BRD4 (Bromodomain-containing protein 4)
10. **Navitoclax**: **Proteins** - BCL-2, BCL-XL (B-cell lymphoma 2 and B-cell lymphoma-extra large)
11. **Prexasertib**: **Protein** - CHK1 (Checkpoint kinase 1)
12. **Taselisib**: **Protein/Pathway** - PI3K (Phosphoinositide 3-kinase)
13. **Trametinib**: **Proteins** - MEK1/2 (Mitogen-Activated Protein Kinase Kinase 1 and 2)
14. **control**: Not applicable

In [None]:
# We are using the chebi/chembl chemistry/drug ontology for the drug perturbations
chebi_source = bt.Source.filter(entity="Drug", name="chebi").one()
wl.Compound.add_source(chebi_source)
compounds = wl.Compound.public()
compounds.df().head(3)

In [None]:
drug_metadata.head(3)

In [None]:
drug_metadata["perturbation"] = drug_metadata["perturbation"].cat.rename_categories(lambda category: category.lower())
compounds = wl.Compound.from_values(drug_metadata["perturbation"], field="name")

In [None]:
# The remaining compounds are not in chebi and we create records for them
for missing in ['azd5591', 'brd3379', 'control', 'idasanutlin', 'prexasertib', 'taselisib']:
    compounds.append(wl.Compound(name=missing))
ln.save(compounds)

In [None]:
unique_treatments = drug_metadata[['perturbation', 'dose_unit', 'dose_value']].drop_duplicates()

compound_treatments = []
for _, row in unique_treatments.iterrows():
    val_to_search = row['perturbation']
    compound = wl.Compound.search(val_to_search).first()
    treatment = wl.CompoundTreatment(name=compound.name,
                                        concentration=row['dose_value'],
                                        concentration_unit=row['dose_unit'])
    compound_treatments.append(treatment)
    
ln.save(compound_treatments)

## Set relationships

In [None]:
artifact = curate.save_artifact(description="McFarland AnnData")

In [None]:
artifact.genetic_treatments.set(genetic_treatments)
artifact.compound_treatments.set(compound_treatments)

In [None]:
artifact.describe()

In [None]:
ln.context.finish()

In [None]:
# clean up test instance
!rm -r test-perturbation
!lamin delete --force test-perturbation