[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/registries.ipynb)

# Query & search registries

This guide walks through different ways of querying & searching LaminDB registries.

Let's start by creating a few exemplary datasets and saving them into a LaminDB instance (hidden cell).

In [None]:
# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-registries --schema bionty

# python
import lamindb as ln
import bionty as bt
from lamindb.core import datasets

ln.track("pd7UR7Z8hoTq0000")

# Create non-curated datasets
ln.Artifact(datasets.file_jpg_paradisi05(), key="images/my_image.jpg").save()
ln.Artifact(datasets.file_fastq(), key="raw/my_fastq.fastq").save()
ln.Artifact.from_df(datasets.df_iris(), key="iris/iris_collection.parquet").save()

# Create a more complex case
# observation-level metadata
ln.Feature(name="cell_medium", dtype="cat[ULabel]").save()
ln.Feature(name="sample_note", dtype="str").save()
ln.Feature(name="cell_type_by_expert", dtype="cat[bionty.CellType]").save()
ln.Feature(name="cell_type_by_model", dtype="cat[bionty.CellType]").save()
# dataset-level metadata
ln.Feature(name="temperature", dtype="float").save()
ln.Feature(name="study", dtype="cat[ULabel]").save()
ln.Feature(name="date_of_study", dtype="date").save()
ln.Feature(name="study_note", dtype="str").save()

## Permissible values for categoricals
ln.ULabel.from_values(["DMSO", "IFNG"], create=True).save()
ln.ULabel.from_values(
    ["Candidate marker study 1", "Candidate marker study 2"], create=True
).save()
bt.CellType.from_values(["B cell", "T cell"], create=True).save()

# Ingest dataset1
adata = datasets.small_dataset1(format="anndata")
curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={
        "cell_medium": ln.ULabel.name,
        "cell_type_by_expert": bt.CellType.name,
        "cell_type_by_model": bt.CellType.name,
    },
    organism="human",
)
artifact = curator.save_artifact(key="example_datasets/dataset1.h5ad")
artifact.features.add_values(adata.uns)

# Ingest dataset2
adata2 = datasets.small_dataset2(format="anndata")
curator = ln.Curator.from_anndata(adata2, var_index=bt.Gene.symbol, categoricals={"cell_medium": ln.ULabel.name, "cell_type_by_model": bt.CellType.name}, organism="human")
artifact2 = curator.save_artifact(key="example_datasets/dataset2.h5ad")
artifact2.features.add_values(adata2.uns)

## Get an overview

The easiest way to get an overview over all artifacts is by typing {meth}`~lamindb.Artifact.df`, which returns the 100 latest artifacts in the {class}`~lamindb.Artifact` registry.

In [None]:
import lamindb as ln

ln.Artifact.df()

You can include fields from other registries.

In [None]:
ln.Artifact.df(include=["created_by__name", "ulabels__name", "cell_types__name", "feature_sets__registry", "suffix"])

You can include information about which artifact measures which `feature`.

In [None]:
df = ln.Artifact.df(features=True)
ln.view(df)  # for clarity, leverage ln.view() to display dtype annotations

The flattened table that includes information from all relevant registries is easier to understand than the normalized data. For comparison, here is how to see the later.

In [None]:
ln.view()

## Auto-complete records

For registries with less than 100k records, auto-completing a `Lookup` object is the most convenient way of finding a record.

In [None]:
import bionty as bt

# query the database for all ulabels or all cell types
ulabels = ln.ULabel.lookup()
cell_types = bt.CellType.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

With auto-complete, we find a ulabel:

In [None]:
study1 = ulabels.candidate_marker_study_1
study1

## Get one record

{class}`~lamindb.core.Record.get` errors if more than one matching records are found.

In [None]:
print(study1.uid)

# by uid
ln.ULabel.get(study1.uid)

# by field
ln.ULabel.get(name="Candidate marker study 1")

## Query multiple records

Filter for all artifacts annotated by a ulabel:

In [None]:
ln.Artifact.filter(ulabels=study1).df()

To access the results encoded in a filter statement, execute its return value with one of:

- {meth}`~lamindb.core.QuerySet.df`: A pandas `DataFrame` with each record in a row.
- {meth}`~lamindb.core.QuerySet.all`: A {class}`~lamindb.core.QuerySet`.
- {meth}`~lamindb.core.QuerySet.one`: Exactly one record. Will raise an error if there is none. Is equivalent to the `.get()` method shown above.
- {meth}`~lamindb.core.QuerySet.one_or_none`: Either one record or `None` if there is no query result.

```{note}

{meth}`~lamindb.core.Record.filter` returns a {class}`~lamindb.core.QuerySet`.

The registries in LaminDB are Django Models and any [Django query](https://docs.djangoproject.com/en/stable/topics/db/queries/) works.

LaminDB re-interprets Django's API for data scientists.

```

```{dropdown} What does this have to do with SQL?

Under the hood, any `.filter()` call translates into a SQL select statement.

LaminDB's registries are object relational mappers (ORMs) that rely on Django for all the heavy lifting.

Of note, `.one()` and `.one_or_none()` are the two parts of LaminDB's API that are borrowed from SQLAlchemy. In its first year, LaminDB built on SQLAlchemy.

```

## Search for records

You can search every registry via {meth}`~lamindb.core.Record.search`. For example, the `Artifact` registry.

In [None]:
ln.Artifact.search("iris").df()

Here is more background on search and examples for searching the entire cell type ontology: {doc}`/faq/search` 

## Query related registries

Django has a double-under-score syntax to filter based on related tables.

This syntax enables you to traverse several layers of relations and leverage different comparators.

In [None]:
ln.Artifact.filter(created_by__handle__startswith="testuse").df()  

The filter selects all artifacts based on the users who ran the generating notebook. Under the hood, in the SQL database, it's joining the artifact table with the user table.

Another typical example is querying all datasets that measure a particular feature. For instance, which datasets measure `"CD8A"`. Here is how to do it:

In [None]:
cd8a = bt.Gene.get(symbol="CD8A")
# query for all feature sets that contain CD8A
feature_sets_with_cd8a = ln.FeatureSet.filter(genes=cd8a).all()
# get all artifacts 
ln.Artifact.filter(feature_sets__in=feature_sets_with_cd8a).df()

Instead of splitting this across three queries, the double-underscore syntax allows you to define a path for one query.

In [None]:
ln.Artifact.filter(feature_sets__genes__symbol="CD8A").df()

## Filter operators

You can qualify the type of comparison in a query by using a comparator.

Below follows a list of the most import, but Django supports about [two dozen field comparators](https://docs.djangoproject.com/en/stable/ref/models/querysets/#field-lookups) `field__comparator=value`.

### and

In [None]:
ln.Artifact.filter(suffix=".h5ad", ulabels=study1).df()

### less than/ greater than

Or subset to artifacts greater than 10kB. Here, we can't use keyword arguments, but need an explicit where statement.

In [None]:
ln.Artifact.filter(ulabels=study1, size__gt=1e4).df()

### in

In [None]:
ln.Artifact.filter(suffix__in=[".jpg", ".fastq.gz"]).df()

### order by

In [None]:
ln.Artifact.filter().order_by("-created_at").df()

### contains

In [None]:
ln.Transform.filter(name__contains="search").df().head(5)

And case-insensitive:

In [None]:
ln.Transform.filter(name__icontains="Search").df().head(5)

### startswith

In [None]:
ln.Transform.filter(name__startswith="Research").df()

### or

In [None]:
ln.Artifact.filter(ln.Q(suffix=".jpg") | ln.Q(suffix=".fastq.gz")).df()

### negate/ unequal

In [None]:
ln.Artifact.filter(~ln.Q(suffix=".jpg")).df()