# Track data objects

We're now ready to track & query data!

```{tip}

In Jupyter notebooks and lab, you can see the documentation for a python function by hitting SHIFT + TAB. Hit it twice to expand the view.
```

In [None]:
import lamindb as ln

ln.nb.header()

```{note}

The call to `ln.nb.header()` tracks the notebook run as a data source.

Learn more: {doc}`/guide/run`.
```

## Track local files - data objects on disk

Let's add an image file ([Paradisi05](https://bmcmolcellbiol.biomedcentral.com/articles/10.1186/1471-2121-6-27)):

<img width="150" alt="Laminopathic nuclei" src="https://upload.wikimedia.org/wikipedia/commons/2/28/Laminopathic_nuclei.jpg">

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05()
filepath

We'll work with a single class for data objects in memory and on disk: {class}`~lamindb.DObject`. On disk, these are often (but not always, e.g., for `zarr`) files. Instantiating `DObject` creates a `dobject` record:

In [None]:
dobject = ln.DObject(filepath)

The `dobject` record captures metadata about the file and will be our way to query and load data.

In [None]:
dobject

We can also access linked metadata records, for instance, the record that stores metadata about this run.

In [None]:
dobject.source

As we're ingesting from a notebook, here, it defaults to the notebook run you saw printed above.

In [None]:
assert ln.nb.run == dobject.source

If we want to add metadata & data to database & storage, we can do so in a single transaction:

In [None]:
ln.add(dobject)

Getting the data back works through `.load()` - here, we get back a cryptic filename. More on this below.

In [None]:
dobject.load()

## Track data objects in memory

In [None]:
import sklearn.datasets

Let's now ingest an in-memory `DataFrame` storing the iris dataset:

In [None]:
df = sklearn.datasets.load_iris(as_frame=True).frame

df.head()

In [None]:
df.shape

When ingesting in-memory objects, a `name` argument needs to be passed:

In [None]:
dobject = ln.DObject(df, name="iris");

In [None]:
ln.add(dobject)

Get the dataframe back:

In [None]:
dobject.load()

## The story of `DObject`

We have come to love the pydata family of `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, and others for accessing lower-level data objects.

But we couldn’t find an object for accessing how data objects are linked to context.
So, we made `DObject` to help with modeling and understanding data objects in relation to their context.

Context can be other data objects, data transformations, ML models, people & pipelines who performed transformations, and all aspects of data lineage.
Context can also be hypotheses and any entity of the domain in which data is generated and modeled.

Depending on how `DObject`s are linked to context, they give rise to features of data lakes, warehouses and knowledge graphs.

We focused on linking `DObject` to biological concepts: entities, their types, records, transformations, and relations.
You'll learn about them further down the guide: {doc}`knowledge`.


```{Note}

Learn more: {doc}`dobject`. 

```

## What's in the DB now?

In [None]:
ln.view()

## And what's in storage?

Two cryptically named files:

In [None]:
!ls mytest

If you prefer semantic names, you can easily achieve it by tracking existing data rather than ingesting data into a storage location: {doc}`/faq/existing-data`.

Naming data objects in storage by the primary key ID of the `DObject` is typically preferred when facing potential clashes of names at large scale or working with in-memory views.

## Can I track a whole folder?

Yes: {doc}`/faq/ingest-folder`.