# DSI Tutorial and getting started

The goal of the Data Science Infrastructure Project ([DSI](https://github.com/lanl/dsi)) is to provide a flexible, AI-ready metadata query capability which returns data subject to strict, POSIX-enforced file security. In this tutorial, you will learn how to:
 - initialize a DSI instance
 - load data into DSI
 - check the data loaded
 - query the data
 - create new data and save it to DSI
 - load complex schemas
 - use DSI writers

This tutorial uses data from the [Cloverleaf3D](https://github.com/UK-MAC/CloverLeaf3D) Lagrangian-Eulerian hydrodynamics solver. Archived data is provided in dsi/examples/clover3d. Prior to running the tutorial, extract clover3d.zip and please follow the instructions in the [Quick Start: Installation](https://lanl.github.io/dsi/installation.html) to set up DSI.



In [None]:
from dsi.dsi import DSI

In [None]:
# Create instance of DSI
baseline = DSI()

# Available features

To see which available backends, readers and writers area available, you can try calling funtionst to list the featureset available in your instalation.

In [None]:
# Lists available backends
baseline.list_backends()

In [None]:
# Lists available readers
baseline.list_readers()

In [None]:
# Lists available writers
baseline.list_writers()

# Reading Data into DSI

For this tutorial, we will use cloverleaf 3d data available in our repository. dsi/examples/clover3d/clover3d.zip
Alternitively, you can download the data from this direct link: https://github.com/lanl/dsi/raw/refs/heads/main/examples/clover3d/clover3d.zip

* Use a unix terminal / windows powershell to pull the data and extract locally into ./clover3d folder

The data is an ensemble of 8 runs, and has 4 metadata products of interest:

* clover.in - input deck
* clover.out - simulation statistics
* timestamps.txt - time when simulation was launched on slurm
* viz files - insitu outputs in vtk format

To begin the ingest:

In [None]:
# Target backend defaults to SQLite since not defined
store = DSI("dsi-tutorial.db")

# dsi.read(filename, reader)
store.read("./clover3d/", 'Cloverleaf')

# Exploring the loaded data

In [None]:
# How many tables do we have
store.num_tables()

In [None]:
# Let's see what tables were created
store.list()

In [None]:
# Let's get more details about the data
store.summary()

In [None]:
# Preview the contents of the visualization files
store.display("viz_files")

# DSI Find to search within the data

In [None]:
# Search string or value within all tables
store.find("Jun 2025")

In [None]:
# Perform a search and receive a collection
find_list = store.find("8.0", True) # Use True to return a collection

In [None]:
# Simply display what this collection (pandas dataframe) looks like
find_list

# Updating contents with DSI

DSI Allows you to add or modify existing contents inside a collection that was returned from
a find or a query operation when 'True' is used.

Example usecase: We want to perform post-processing of the ingested data. In this example, we would like to append additional information to our DSI Database. We want to convert the simulation date from text to numerical unix time.

In [None]:
find_list = store.find("Jun 2025", True)

In [None]:
find_list

In [None]:
# Small amount of helper code to convert dates to unix time
from datetime import datetime
from zoneinfo import ZoneInfo
def str2unix(date_str):
    date_str_clean = date_str.rsplit(' ', 1)[0]  # remove 'MDT'
    dt_naive = datetime.strptime(date_str_clean, "%a %d %b %Y %I:%M:%S %p")
    # Set timezone
    dt_local = dt_naive.replace(tzinfo=ZoneInfo("America/Denver"))
    unix_time = int(dt_local.timestamp()) # Unix time in UTC
    return unix_time

In [None]:
store.display("simulation") # display table before update

In [None]:
# Iterate through collection and append new metadata
for table in find_list:
    # Create a new column in the collection
    table["sim_unixtime"] = table["sim_datetime"].apply(str2unix)

#dsi.update(collection)
store.update(find_list) # update all tables in the list
#store.update(find_list[0]) # Optionally, update only first table in the list


In [None]:
# See the updated results
store.display("simulation")

# Query DSI

DSI Supports direct SQL queries to the data that you have ingested

In [None]:
# Use sql statement to directly query the backend store
store.query("SELECT sim_id, xmin, ymin, xmax, ymax, state2_density FROM input") # Adding 'True' gives a collection

In [None]:
# alternative to "query()" if you want to get a whole table
store.get_table("input") # Adding 'True' gives a collection

# DSI Write - Schemas

DSI has support to represent complex schemas. For example, if you would like to relate the different tables together you can use the schema reader which takes in a .json file.

* schema.json

Before defining and ingesting a schema, what does an Entity Relationship Diagram look like in our initial ingest?

In [None]:
store.write("clover_er_diagram_no_schema.png", "ER_Diagram")

from IPython.display import Image
Image(filename="clover_er_diagram_no_schema.png", width=200)

In [None]:
# Create a new database where we will define a schema
schema_store = DSI("schema_tutorial.db")

# dsi.schema(filename)
schema_store.schema("./clover3d/schema.json") # Schema neeeds to be defined before reading Cloverleaf data

# dsi.read(filename, reader)
schema_store.read("./clover3d/", 'Cloverleaf') # read in Cloverleaf data

# dsi.write(filename, writer)
schema_store.write("clover_er_diagram.png", "ER_Diagram")

To preview the Entity Realationship Diagram (ERDiagram), import libraries to display images

In [None]:
from IPython.display import Image
Image(filename="clover_er_diagram.png", width=300)

# DSI Write - CSV

DSI Support the output (write) of data if you would like to export into another project.

In [None]:
store.write("input.csv", "Csv_Writer", "input")

# DSI Write - Table plot
DSI has a built in tool to assist in plotting tables

In [None]:
store.write("output_table_plot.png", "Table_Plot", "output")

In [None]:
Image(filename="output_table_plot.png", width=400)

# Ending your workflow

In [None]:
store.close()
schema_store.close()