# Sample Submission

The sample submission Python API is served through a single function: `parse_and_insert_df`.

`parse_and_insert_df` is ACID-compliant and does not allow for duplicate inserts.

In [None]:
import pandas as pd
import lamindb as ln
from lnschema_wetlab.dev import parse_and_insert_df

In [None]:
!lndb init --storage "testsample" --schema "wetlab,bionty,bfx"

Let's instantiate a dataframe with test sample data.

In [None]:
biosample = pd.DataFrame(
    {
        "Name": [
            "dm3_pre_activation",
            "d0_post_activation",
            "d1_GFP",
            "d1_OSKMNL_arm1",
            "d1_OSKMNL_arm_2",
        ],
        "Experiment": ["x80", "x80", "x80", "x80", "x80"],
        "Condition": [
            "pre-activation",
            "post-activation",
            "GFP",
            "OSKMNL_x1",
            "OSKMNL_x3",
        ],
        "Transfection": [None, None, "GFP-LNP", "OSKMNL-LNP", "OSKMNL-LNP"],
        "Day": [-3, 0, 1, 1, 1],
        "Donor": ["Donor 5", "Donor 5", "Donor 5", "Donor 5", "Donor 5"],
        "Species": ["human", "human", "human", "human", "human"],
        "CMO": [None, None, 301, 302, 303],
        "Gene Expression File": ["Gm3", "G0", "G1", "G1", "G1"],
        "CMO File": [None, None, "T1", "T1", "T1"],
        "CSP File": [None, None, None, None, None],
    }
)

techsample = pd.DataFrame(
    {
        "Sample Name": ["Gm3", "G0", "G1", "G2", "G3"],
        "Sample ID": ["S1", "S2", "S3", "S4", "S5"],
        "Batch": [1, 1, 1, 1, 1],
        "File Type": ["fastq", "fastq", "fastq", "fastq", "fastq"],
        "Sample Type": [
            "Gene Expression",
            "Gene Expression",
            "Gene Expression",
            "Gene Expression",
            "Gene Expression",
        ],
        "% Total Read Allocation": ["3.28%", "4.69%", "13.14%", "13.14%", "13.14%"],
        "Filepath R1": [
            "Gm3_S1_L003_R1_001.fastq.gz",
            "G0_S2_L003_R1_001.fastq.gz",
            "G1_S3_L003_R1_001.fastq.gz",
            "G2_S4_L003_R1_001.fastq.gz",
            "G3_S5_L003_R1_001.fastq.gz",
        ],
        "Filepath R2": [
            "Gm3_S1_L003_R2_001.fastq.gz",
            "G0_S2_L003_R2_001.fastq.gz",
            "G1_S3_L003_R2_001.fastq.gz",
            "G2_S4_L003_R2_001.fastq.gz",
            "G3_S5_L003_R2_001.fastq.gz",
        ],
    }
)

Let's now use `parse_and_insert_df` to process the dataframe and insert the relevant entries in their respective tables.

`parse_and_insert_df` takes two arguments as inputs:
* A `pandas.DataFrame` with the sample data.
* A string with the name of the target LaminDB table (e.g. "biosample", "techsample").

`parse_and_insert_df` currently makes three matching assumptions:
* **Table matching (input string to primary (sample) LaminDB table)**: perfect, case insensitive matching.
    * Custom schemas take preference over core schemas (core, wetlab, bionty, bfx). E.g `retro.Biosample` takes preference over `wetlab.Biosample`.
* **Column matching (input DataFrame columns to secondary LaminDB tables)**: perfect, case insensitive matching.
    * Insertion assumption: insertion column on the target table is the first one to contain the "name" substring.
* **Column matching (input DataFrame columns to primary (sample) LaminDB table columns)**: perfect, case insensitive matching.

In the future, we will be able to relax some of these assumptions by building an interactive sample submission UI, where the user will be able to customize mappings.

In [None]:
res_biosample = parse_and_insert_df(biosample, "biosample")
res_biosample

In [None]:
res_techsample = parse_and_insert_df(techsample, "techsample")
res_techsample

`parse_and_insert_df` returns a dictionnary with two items: column mappings and table entries.

**Column mappings** is a dictionnary with all mappings between the DataFrame column and their respective LaminDB tables and table columns.
* Key (str): dataframe column name
* Value (tuple): (table name, table column)

**Table entries** is a dictionnary with all added entries.
* Key (str): table name
* Value (list): list with all added LaminDB records

We can now check that the entries have been accurately inserted in the database.

In [None]:
ln.view()