# Ingest data - `db.ingest`

LaminDB offers a `ingest` function which ingest data of any format into the database.

In [None]:
import lamindb as ln

ln.nb.header()

## Ingest files

Let's first ingest a simple image file from [Paradisi *et al.* (2005)](https://bmcmolcellbiol.biomedcentral.com/articles/10.1186/1471-2121-6-27):

<img width="150" alt="Laminopathic nuclei" src="https://upload.wikimedia.org/wikipedia/commons/2/28/Laminopathic_nuclei.jpg">

In [None]:
filepath = ln.datasets.file_jpg_paradisi05()
filepath

To track this dataset, we stage it for ingestion via `.add`:

In [None]:
ln.db.ingest.add(filepath)

Staged files can be viewed via `.status`, they are assigned with a unique id and version number

In [None]:
ln.db.ingest.status

## Ingest data objects

You can also ingest a data object loaded into memory, for instance, a dataframe here:

In [None]:
import sklearn.datasets

df = sklearn.datasets.load_iris(as_frame=True).frame

df.head()

When ingesting in-memory objects, a `name` parameter needs to be passed:

In [None]:
ln.db.ingest.add(df, name="iris")

Upon ingestion, the data object will be saved as a corresponding file format. In this case, a dataframe is saved as a `.feather` file in LaminDB.

In [None]:
ln.db.ingest.status

## Ingest with feature models

So far, we haven't enabled ourselves to query for the features[^features] of ingested data, and used LaminDB like a data lake.

[^features]: We'll mostly use the term feature for synonyms variable (statistics), column and field (databases), dimension (machine learning).

We can also use LaminDB like a queryable data warehouse to store links[^relations] and monitor data integrity.

Let us explain how to implement this by providing feature models at ingestion!

[^relations]: We mostly use the term link for synonyms relations and references.

Let's now ingest a scRNA-seq count matrix in form of an `AnnData` object in memory

In [None]:
import scanpy as sc

data = sc.read(ln.datasets.file_mouse_sc_lymph_node())

data.var.head()

Features in data are indexed by Ensemble gene ids. For an overview of currently available gene ids, see: [`bt.lookup.gene_id`](https://lamin.ai/docs/bionty/api).

Hence, we use a feature model (see all feature models at: [`bt.lookup.feature_model`](https://lamin.ai/docs/bionty/bionty.lookup#bionty.lookup.feature_model)) based on Ensemble IDs and ingest the data with it.

[bionty (`bt`)](https://lamin.ai/docs/bionty) is a data model generator for biology.

In [None]:
import bionty as bt

In [None]:
feature_model = bt.Gene(
    id=bt.lookup.gene_id.ensembl_gene_id, species=bt.lookup.species.mouse
)

The feature_model curates features against reference, in this case a gene table configured in [`bionty.Gene`](https://lamin.ai/docs/bionty/bionty.gene#bionty.Gene).

Ingesting data with feature_model enables querying any fields of interest present in the model that describes the same feature. For instance, here we ingest genes with its ensembl ids, but [we can query them based on symbol, ncbi ids, etc](https://lamin.ai/docs/db/tutorials/query-load#Query-data-objects-by-linked-entities).

In [None]:
ln.db.ingest.add(
    data,
    name="mouse_sc_lymph_node",
    feature_model=feature_model,
    featureset_name="mouse_1k",  # optional
)

We can retrieve the integrity information through `.logs`:

In [None]:
ln.db.ingest.logs

In [None]:
ln.db.ingest.status

## Ingest from pipeline runs

We've now seen how individual datasets can be ingested, let's move on to ingesting datasets generated by a pipeline run. 

Here, we ingest a set of bioinformatics output files generated by [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger).

In [None]:
bfx_run_output = ln.datasets.scrnaseq_cellranger()

In [None]:
bfx_run_output

[`lnbfx`](https://lamin.ai/docs/lnbfx) is an open-source package that manage bioinformatics pipelines.

In [None]:
import lnbfx

bfx_run = lnbfx.BfxRun(pipeline_name="scrnaseq-cellranger")

In [None]:
ln.db.ingest.add(bfx_run_output, pipeline_run=bfx_run)

## Complete ingestion

Before completing the ingestion, let's check what we staged:

In [None]:
ln.db.ingest.status

Let's now commit these data to LaminDB:

In [None]:
ln.db.ingest.commit()

We see that several links are made in the background: the data object is associated with its source (this Jupyter notebook, `jupynb`) and the user who operates the notebook (`test-user1`).

`ln.db.ingest` detects whether data comes from a notebook, a pipeline, a connector, or a custom graphical user interface.

What is a data object (dobject) in more detail? See the API docs [here](https://lamin.ai/docs/lnschema-core/lnschema_core.dobject) or read on!