# Get started

LaminDB is a distributed data management system in which users collaborate on instances.

This is analogous to how developers collaborate on code in repositories, but unlike git and dvc, LaminDB is queryable by entities.[^integrate]

An instance manages storage (local directory, S3, GCP, Azure) and a SQL database (SQLite, Postgres, BigQuery) for querying it.

[^integrate]: Like git helps with integrating code contributions across repositories, LaminDB helps with integrating data across instances.

## Sign up

As a first-time user, sign up your email so that LaminDB can link you to data & analyses.

On the command line, run `lndb signup <email>`. For example: `lndb signup tuser1@foo.com`.[^github]

[^github]: Consider using your GitHub-associated email and handle=username if you have one!

## Log in

After confirming the signup email, you can login with your handle (or email) on the command line:

In [1]:
!lndb login test-user1  # test user 1 has handle test-user1

## Init

For this first tutorial, we init a local instance with storage in `mydata/` and a local SQlite database for managing it.

You can also directly pass `s3://my-bucket` to `--storage` or a Postgres URL to `--db`.

In [2]:
!lndb init --storage mydata --schema bionty,wetlab,bfx  # default bio entity and wetlab schema modules

ℹ️ Loading schema modules: core, bionty, wetlab, bfx.
ℹ️ Created instance mydata with core schema v0.4.0: /Users/sunnysun/Documents/repos/lamindb/docs/tutorials/mydata/mydata.lndb


In this local setup, all instance data is in `mydata/` and all metadata in the SQLite file `mydata/mydata.lndb`.
Settings persist in `~/.lndb/*.env` and can be accessed via [`db.settings`](https://lamin.ai/docs/lndb-setup/lndb_setup.settings).

## Ingest

In [3]:
import lamindb as db
import sklearn.datasets

db.header()  # re-exports nbproject.header, https://lamin.ai/docs/nbproject

2022-08-23 15:21:13,265:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-08-23 15:21:13,266:INFO - NumExpr defaulting to 8 threads.


0,1
id,GgD4VJbXtOOS
version,draft
time_init,2022-06-23 14:16
time_run,2022-08-23 13:21
pypackage,lamindb==0.2.1 scikit-learn==1.1.1


The `iris` dataset stores phenotypic measurements ([sepal & petal sizes](https://en.wikipedia.org/wiki/Iris_flower_data_set)), which we do **not** aim to query for in the present tutorial.

In [4]:
df = sklearn.datasets.load_iris(as_frame=True).frame

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


To track this dataset, we stage it for ingestion.

In [5]:
db.do.ingest.add(df, name="iris")

Let's take another toy dataset, a processed collage of microscopy images:

<img width="150" alt="Laminopathic nuclei" src="https://upload.wikimedia.org/wikipedia/commons/2/28/Laminopathic_nuclei.jpg">

This shows morphology of fibroblasts from a control (a, b) and a subject with [Progreria](https://en.wikipedia.org/wiki/Progeria) (c, d) where nuclear envelopes are shaped irregularly ([Paradisi et al., 2005](https://doi.org/10.1186/1471-2121-6-27), CC BY 2.0 via [Wiki Commons](https://commons.wikimedia.org/wiki/File:Laminopathic_nuclei.jpg)).

In [6]:
filepath = db.datasets.file_jpg_paradisi05()
filepath

KeyboardInterrupt: 

In [None]:
db.do.ingest.add(filepath)

Before completing the ingestion, let's check what we staged:

In [None]:
db.do.ingest.status

{'iris.feather': ('kfOLcWt59r0qzU7scjHiC', '1'),
 'paradisi05_laminopathic_nuclei.jpg': ('neGWUdqGRZCv34aR8SNiZ', '1')}

Let's now commit these data to LaminDB:

In [None]:
db.do.ingest.commit()

ℹ️ Added notebook 'Get started' (GgD4VJbXtOOS, 1) by user test-user1 (9ypQ1yrW).


✅ Ingested the following dobjects:
+---------------------------------------------------------------+---------------------------------+-----------------------+
|                            [1;92mdobject[0m                            |             [1;94mjupynb[0m              |         [1;95muser[0m          |
+---------------------------------------------------------------+---------------------------------+-----------------------+
|            iris.feather (kfOLcWt59r0qzU7scjHiC, 1)            | 'Get started' (GgD4VJbXtOOS, 1) | test-user1 (9ypQ1yrW) |
| paradisi05_laminopathic_nuclei.jpg (neGWUdqGRZCv34aR8SNiZ, 1) | 'Get started' (GgD4VJbXtOOS, 1) | test-user1 (9ypQ1yrW) |
+---------------------------------------------------------------+---------------------------------+-----------------------+


🔶 Cells [(6, None), (None, None)] were not run consecutively.


ℹ️ Set notebook version to [1m1[0m & wrote pypackages.


We see that several links are made in the background: the data object is associated with its source (this Jupyter notebook, `jupynb`) and the user who operates the notebook (`test-user1`).

`db.do.ingest` detects whether data comes from a notebook, a pipeline, a connector, or a custom graphical user interface.

What is a data object (dobject) in more detail? See the API docs [here](https://lamin.ai/docs/lnschema-core/lnschema_core.dobject) or read on!