# Curate DataFrames and AnnDatas

Curating a dataset with LaminDB means three things:

1. **Validate:** ensure the dataset meets predefined _validation criteria_
2. **Standardize:** transform the dataset so that it meets validation criteria, e.g., by fixing typos or using standard instead of ad hoc identifiers
3. **Annotate:** link the dataset against validated metadata so that it becomes queryable

If a dataset passes validation, curating it takes two lines of code:

```python
curator = ln.Curator.from_df(df, ...)  # create a Curator and pass criteria in "..."
curator.save_artifact()                # validates the content of the dataset and saves it as annotated artifact
```

Beyond having valid content, the curated dataset is now queryable via metadata identifiers found in the dataset because they have been validated & linked against LaminDB registries.

:::{admonition} Definition: valid metadata identifier

An identifier like `"Experiment 1"` is a valid value for `ULabel.name` if a record with `name` `"Experiment 1"` exists in the {class}`~lamindb.ULabel` registry.

```python
categoricals = {"experiment": ln.ULabel.name}  # the validation constraint
curator = ln.Curator.from_df(df, categoricals=categoricals)
curator.validate()
```

The DataFrame validates if 

- there is a column with name `"experiment"` in the dataframe whose values are all found in the `name` field of the {class}`~lamindb.ULabel` registry
- the column name `"experiment"` is found in the `name` field of the {class}`~lamindb.Feature` registry

:::

Beyond validating metadata identifiers, LaminDB also validates data types and dataset schema.

:::{dropdown} How does validation in LaminDB compare to validation in pandera?

Like LaminDB, [pandera](https://pandera.readthedocs.io/) validates the _dataset schema_ (i.e., column names and `dtype`s).

`pandera` is only available for `DataFrame`-like datasets and cannot annotate datasets; i.e., can't make datasets queryable.

However, it offers an API for range-checks, both for numerical and string-like data. If you need such checks, you can combine LaminDB and pandera-based validation.

```python
import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)  # this corresponds to curator.validate() in LaminDB
print(validated_df)
```

:::

In [None]:
# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty

## Curate a DataFrame

Let's start with a DataFrame that we'd like to validate.

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": ["cerebral pyramidal neuron", "astrocytic glia", "oligodendrocyte"],
        "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
        "donor": ["D0001", "D0002", "D0003"]
    },
    index = ["obs1", "obs2", "obs3"]
)
df

Define validation criteria and create a {class}`~lamindb.Curator` object.

In [None]:
# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)

The {meth}`~lamindb.core.BaseCurator.validate` method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

In [None]:
curate.validate()

In [None]:
# check the non-validated terms
curate.non_validated

For `cell_type`, we saw that "cerebral pyramidal neuron", "astrocytic glia" are not validated.

First, let's standardize synonym "astrocytic glia" as suggested

In [None]:
curate.standardize("cell_type")

In [None]:
# now we have only one non-validated term left
curate.non_validated

For "cerebral pyramidal neuron", let's understand which cell type in the public ontology might be the actual match.

In [None]:
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curate.lookup()` to get a lookup object of existing records in your instance
lookup = curate.lookup(public=True)
lookup

In [None]:
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# fix the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

For donor, we want to add the new donors: "D0001", "D0002", "D0003"

In [None]:
# this adds donors that were _not_ validated
curate.add_new_from("donor")

In [None]:
# validate again
curate.validate()

Save a curated artifact.

In [None]:
artifact = curate.save_artifact(description="My curated dataframe")

In [None]:
artifact.describe(print_types=True)

## Curate an AnnData

Here we additionally specify which `var_index` to validate against.

In [None]:
import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3], 
        "ENSG00000276977": [4, 5, 6], 
        "ENSG00000198851": [7, 8, 9], 
        "ENSG00000010610": [10, 11, 12], 
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18]
    }, 
    index=df.index  # because we already curated the dataframe above, it will validate 
)
adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
curate = ln.Curator.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)
curate.validate()

Non-validated terms can be accessed via:

In [None]:
curate.non_validated

Subset the `AnnData` to validated genes only:

In [None]:
adata_validated = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()

Now let's validate the subsetted object:

In [None]:
curate = ln.Curator.from_anndata(
    adata_validated, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)
curate.validate()

The validated object can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = curate.save_artifact(description="test AnnData")

Saved artifact has been annotated with validated features and labels:

In [None]:
artifact.describe()

We've walked through the process of validating, standardizing, and annotating datasets going through these key steps:

1. Defining validation criteria
2. Validating data against existing registries
3. Adding new validated entries to registries
4. Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren't DataFrame-like or AnnData-like, read: {doc}`curate-any`.