# DSI Getting started

In [4]:
from dsi.dsi import DSI

In [5]:
# Create instance of DSI
baseline = DSI()

Created an instance of DSI


# Available features

To see which available backends, readers and writers area available, you can try calling funtionst to list the featureset available in your instalation.

In [6]:
# Lists available backends
baseline.list_backends()


Valid Backends for `backend_name` in backend():
----------------------------------------
Sqlite : Lightweight, file-based SQL backend. Default backend used by DSI API.
DuckDB : In-process SQL backend optimized for fast analytics on large datasets.




In [7]:
# Lists available readers
baseline.list_readers()


Valid Readers for `reader_name` in read():
--------------------------------------------------
CSV                  : Loads data from CSV files (one table per call)
YAML1                : Loads data from YAML files of a certain structure
TOML1                : Loads data from TOML files of a certain structure
JSON                 : Loads single-table data from JSON files
Ensemble             : Loads a CSV file where each row is a simulation run; creates a simulation table
Cloverleaf           : Loads data from a directory with subfolders for each simulation run's input and output data
Bueno                : Loads performance data from Bueno (github.com/lanl/bueno) (.data text file format)
DublinCoreDatacard   : Loads dataset metadata adhering to the Dublin Core format (XML)
SchemaOrgDatacard    : Loads dataset metadata adhering to schema.org (JSON)
GoogleDatacard       : Loads dataset metadata adhering to the Google Data Cards Playbook (YAML)
Oceans11Datacard     : Loads dataset metada

In [8]:
# Lists available writers
baseline.list_writers()


Valid Writers for `writer_name` in write(): ['ER_Diagram', 'Table_Plot', 'Csv_Writer'] 

ER_Diagram  : Creates a visual ER diagram image based on all tables in DSI.
Table_Plot  : Generates a plot of numerical data from a specified table.
Csv_Writer  : Exports the data of a specified table to a CSV file.



# Reading Data into DSI

For this tutorial, we will use cloverleaf 3d data available in our repository. dsi/examples/clover3d/clover3d.zip
Alternitively, you can download the data from this direct link: https://github.com/lanl/dsi/raw/refs/heads/main/examples/clover3d/clover3d.zip

The data is an ensemble of 8 runs, and has 4 metadata products of interest:

* clover.in - input deck
* clover.out - simulation statistics
* timestamps.txt - time when simulation was launched on slurm
* viz files - insitu outputs in vtk format

To begin the ingest:

In [28]:
# Target backend defaults to SQLite since not defined
store = DSI("dsi-tutorial.db")

# dsi.read(filename, reader)
store.read("./clover3d/", 'Cloverleaf')

Created an instance of DSI with the Sqlite backend, dsi-tutorial.db
Loaded ./clover3d/ into tables: input, output, simulation, viz_files


# Exploring the loaded data

In [29]:
# Let's see what tables were created
store.list()


Table: input
  - num of columns: 22
  - num of rows: 8

Table: output
  - num of columns: 10
  - num of rows: 720

Table: simulation
  - num of columns: 2
  - num of rows: 8

Table: viz_files
  - num of columns: 2
  - num of rows: 80




In [30]:
# Let's get more details about the data
store.summary()


Table: input

column           | type    | min  | max  | avg                 | std_dev               
---------------------------------------------------------------------------------------
sim_id           |         | None | None | None                | None                  
ymin             | FLOAT   | 0.0  | 0.0  | 0.0                 | 0.0                   
y_cells          | INTEGER | 60   | 60   | 60.0                | 0.0                   
test_problem     | INTEGER | 2    | 2    | 2.0                 | 0.0                   
xmin             | FLOAT   | 0.0  | 0.0  | 0.0                 | 0.0                   
max_timestep     | FLOAT   | 0.04 | 0.04 | 0.04                | 0.0                   
timestep_rise    | FLOAT   | 1.5  | 1.5  | 1.5                 | 0.0                   
state1_energy    | FLOAT   | 1.0  | 1.0  | 1.0                 | 0.0                   
state2_xmax      | FLOAT   | 5.0  | 5.0  | 5.0                 | 0.0                   
state2_xmin      

In [36]:
# Preview the contents of the visualization files
store.display("viz_files")


Table: viz_files

sim_id | image_filepath                    
-------------------------------------------
1      | run_1/clover.00000.00001.00000.vtk
1      | run_1/clover.00000.00001.00010.vtk
1      | run_1/clover.00000.00001.00020.vtk
1      | run_1/clover.00000.00001.00030.vtk
1      | run_1/clover.00000.00001.00040.vtk
1      | run_1/clover.00000.00001.00050.vtk
1      | run_1/clover.00000.00001.00060.vtk
1      | run_1/clover.00000.00001.00070.vtk
1      | run_1/clover.00000.00001.00080.vtk
1      | run_1/clover.00000.00001.00090.vtk
2      | run_2/clover.00000.00001.00000.vtk
2      | run_2/clover.00000.00001.00010.vtk
2      | run_2/clover.00000.00001.00020.vtk
2      | run_2/clover.00000.00001.00030.vtk
2      | run_2/clover.00000.00001.00040.vtk
2      | run_2/clover.00000.00001.00050.vtk
2      | run_2/clover.00000.00001.00060.vtk
2      | run_2/clover.00000.00001.00070.vtk
2      | run_2/clover.00000.00001.00080.vtk
2      | run_2/clover.00000.00001.00090.vtk
3      | run_

# DSI Find to search within the data

In [37]:
# Search string or value within all tables
store.find("Jun 2025")

Finding all instances of 'Jun 2025' in the active DSI backend

Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 1
  - Data: [1, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 2
  - Data: [2, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 3
  - Data: [3, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 4
  - Data: [4, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 5
  - Data: [5, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 6
  - Data: [6, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - Row Number: 7
  - Data: [7, 'Thu 05 Jun 2025 01:25:34 PM MDT']
Table: simulation
  - Columns: ['sim_id', 'sim_datetime']
  - 

In [42]:
# Perform a search and receive a collection
find_list = store.find("8.0")

find_list

Finding all instances of '8.0' in the active DSI backend

Table: input
  - Columns: ['sim_id', 'ymin', 'y_cells', 'test_problem', 'xmin', 'max_timestep', 'timestep_rise', 'state1_energy', 'state2_xmax', 'state2_xmin', 'state2_ymin', 'state1_density', 'state2_energy', 'initial_timestep', 'state2_density', 'state2_geometry', 'ymax', 'visit_frequency', 'state2_ymax', 'end_step', 'xmax', 'x_cells']
  - Row Number: 7
  - Data: [7, 0.0, 60, 2, 0.0, 0.04, 1.5, 1.0, 5.0, 0.0, 0.0, 0.2, 2.5, 0.04, 8.0, 'rectangle', 10.0, 10, 2.0, 90, 10.0, 60]
Table: output
  - Columns: ['sim_id', 'control', 'y', 'wall_clock', 'x', 'step', 'step_time_per_cell', 'timestep', 'average_time_per_cell', 'time']
  - Row Number: 533
  - Data: [6, 'sound', 0.0833, 0.10679292678833008, 0.0833, 83, 8.000267876519097e-08, 0.04, 3.5740604681502704e-07, 3.28]
Table: output
  - Columns: ['sim_id', 'control', 'y', 'wall_clock', 'x', 'step', 'step_time_per_cell', 'timestep', 'average_time_per_cell', 'time']
  - Row Number: 535


# Updating contents with DSI

In [45]:
for table in find_list:
    store.display(table["dsi_table_name"][0], 5) # display table before update

    table["new_col"] = 50   # add new column to this DataFrame
    table["timestep"] = 100 # update existing column for some DataFrames

#dsi.update(collection)
store.update(find_list) # update all tables in the list
store.update(find_list[0]) # update only first table in the list



Table: output

sim_id | control | y      | wall_clock             | x      | step | step_time_per_cell     | timestep | average_time_per_cell  | time | new_col
------------------------------------------------------------------------------------------------------------------------------------------------
1      | sound   | 0.0833 | 0.00026607513427734375 | 0.0833 | 1    | 7.390975952148437e-08  | 100.0    | 7.390975952148437e-08  | 0.0  | 50.0   
1      | sound   | 0.0833 | 0.0004999637603759766  | 0.0833 | 2    | 6.245242224799262e-08  | 100.0    | 6.943941116333008e-08  | 0.04 | 50.0   
1      | sound   | 0.0833 | 0.0007319450378417969  | 0.0833 | 3    | 6.25186496310764e-08   | 100.0    | 6.777268868905527e-08  | 0.08 | 50.0   
1      | sound   | 0.0833 | 0.0009679794311523438  | 0.0833 | 4    | 6.357828776041667e-08  | 100.0    | 6.722079383002387e-08  | 0.12 | 50.0   
1      | sound   | 0.0833 | 0.0012009143829345703  | 0.0833 | 5    | 6.337960561116536e-08  | 100.0    | 6.6717465

In [46]:
# See the updated results
for table in find_list:
    store.display(table["dsi_table_name"][0], 5) # display table after update


Table: output

sim_id | control | y      | wall_clock             | x      | step | step_time_per_cell     | timestep | average_time_per_cell  | time | new_col
------------------------------------------------------------------------------------------------------------------------------------------------
1      | sound   | 0.0833 | 0.00026607513427734375 | 0.0833 | 1    | 7.390975952148437e-08  | 100.0    | 7.390975952148437e-08  | 0.0  | 50.0   
1      | sound   | 0.0833 | 0.0004999637603759766  | 0.0833 | 2    | 6.245242224799262e-08  | 100.0    | 6.943941116333008e-08  | 0.04 | 50.0   
1      | sound   | 0.0833 | 0.0007319450378417969  | 0.0833 | 3    | 6.25186496310764e-08   | 100.0    | 6.777268868905527e-08  | 0.08 | 50.0   
1      | sound   | 0.0833 | 0.0009679794311523438  | 0.0833 | 4    | 6.357828776041667e-08  | 100.0    | 6.722079383002387e-08  | 0.12 | 50.0   
1      | sound   | 0.0833 | 0.0012009143829345703  | 0.0833 | 5    | 6.337960561116536e-08  | 100.0    | 6.6717465

# Query DSI

In [49]:
# Use sql statement to directly query the backend store
store.query("SELECT * FROM input LIMIT 3")

Printing the result of the SQL query: SELECT * FROM input LIMIT 3

sim_id | ymin | y_cells | test_problem | xmin | max_timestep | timestep_rise | state1_energy | state2_xmax | state2_xmin | state2_ymin | state1_density | state2_energy | initial_timestep | state2_density | state2_geometry | ymax | visit_frequency | state2_ymax | end_step | xmax | x_cells | new_col | timestep
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1      | 0.0  | 60      | 2            | 0.0  | 0.04         | 1.5           | 1.0           | 5.0         | 0.0         | 0.0         | 0.2            | 2.5           | 0.04             | 2.0            | rectangle       | 10.0 | 10              | 2.0         | 90       | 10.0 | 60      | 50      | 100     
2  