# Get started

LaminDB is a distributed data management system similar to how git is a distributed version control system.

Just like you work with repositories in git, you work with <ins>instances</ins> in LaminDB.
However, unlike git (and dvc), LaminDB is queryable by metadata.

An instance is a data warehouse with storage (local directory, S3, GCP, Azure) and a SQL database (SQLite, Postgres, BigQuery) for querying it.

## Sign up

As a first-time user, sign up your email so that LaminDB can link you to data & analyses.

On the command line, run `lndb signup <email> <handle>`. For example: `lndb signup raspbear@gmx.de test-user1`.[^github]

[^github]: Consider using your GitHub-associated email if you have one!

## Log in

After confirming the signup email, you can login with your handle (or email) on the command line:

In [1]:
!lndb login test-user1

## Initialize and configure an instance

For a simple demo project, let us configure a local instance with storage in `mydata/` and a local SQlite database for managing it.

You can also directly pass `s3://my-bucket` to `--storage` or a postgres URL to `--db`.

In [2]:
!lndb init --storage mydata --schema bionty,wetlab  # a generic biology schema module based on bionty and wetlab

ℹ️ Using instance: mydata/mydata.lndb


The instance settings will persist in `~/.lndb/instance-mydata.env`.
All instance data is in `mydata`, and all metadata in the SQLite file `mydata.lndb`.

## Ingest data

In [3]:
import lamindb as db
import sklearn.datasets

db.header()  # this is nbproject.header()

2022-08-01 13:21:41,766:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-08-01 13:21:41,767:INFO - NumExpr defaulting to 8 threads.


0,1
id,GgD4VJbXtOOS
version,draft
time_init,2022-06-23 14:16
time_run,2022-08-01 05:21
pypackage,lamindb==0.1.2 scikit-learn==1.1.1


For the sake of demonstrating ingesting data that is merely queryable by provenance, let us choose data that has little semantic meaning in the context of modern biology.

The `iris` dataset stores phenotypes of flowers in form of [sepal & petal sizes](https://en.wikipedia.org/wiki/Iris_flower_data_set), which we do not aim to query for in the present tutorial.

In [4]:
df = sklearn.datasets.load_iris(as_frame=True).frame
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [5]:
db.do.ingest.add(df, name="iris")

Check the to-be-ingested list with assigned dobject ids and versions (here, version '1' of this data object):

In [6]:
db.do.ingest.status

{'iris.feather': ('jBs8hyAxxsumljmLiHPex', '1')}

We now need to complete the ingestion via:

In [7]:
db.do.ingest.commit()

ℹ️ Added notebook 'Get started' (GgD4VJbXtOOS, 1) by user raspbear@gmx.de (9ypQ1yrW).
✅ Ingested the following dobjects:
+-----------------------------------------+---------------------------------+----------------------------+
|                 [1;92mdobject[0m                 |             [1;94mjupynb[0m              |            [1;95muser[0m            |
+-----------------------------------------+---------------------------------+----------------------------+
| iris.feather (jBs8hyAxxsumljmLiHPex, 1) | 'Get started' (GgD4VJbXtOOS, 1) | raspbear@gmx.de (9ypQ1yrW) |
+-----------------------------------------+---------------------------------+----------------------------+


RuntimeError: Make sure you save the notebook in your editor before publishing!
You can avoid the need for manually saving in Jupyter Lab, which auto-saves the buffer during publish.

What is a [dobject](https://lamin.ai/docs/lndb-schema-core/lndb_schema_core.dobject)?