# DSI Tutorial and getting started

The goal of the Data Science Infrastructure Project ([DSI](https://github.com/lanl/dsi)) is to provide a flexible, AI-ready metadata query capability which returns data subject to strict, POSIX-enforced file security. In this tutorial, you will learn how to:
 - initialize a DSI instance
 - load data into DSI
 - check the data loaded
 - query the data
 - create new data and save it to DSI
 - load complex schemas
 - use DSI writers
 - use DSI Sync to index and move data

This tutorial uses data from the [Cloverleaf3D](https://github.com/UK-MAC/CloverLeaf3D) Lagrangian-Eulerian hydrodynamics solver. Archived data is provided in dsi/examples/clover3d. Prior to running the tutorial, extract clover3d.zip using *unzip -j* and please follow the instructions in the [Quick Start: Installation](https://lanl.github.io/dsi/installation.html) to set up DSI.



In [1]:
from dsi.dsi import DSI

In [2]:
# Create instance of DSI
baseline = DSI()

Created an instance of DSI


# Available features

To see which available backends, readers and writers area available, you can try calling funtionst to list the featureset available in your instalation.

In [3]:
# Lists available backends
baseline.list_backends()


Valid Backends for `backend_name` in backend():
----------------------------------------
Sqlite : Lightweight, file-based SQL backend. Default backend used by DSI API.
DuckDB : In-process SQL backend optimized for fast analytics on large datasets.




In [4]:
# Lists available readers
baseline.list_readers()


Valid Readers for `reader_name` in read():
--------------------------------------------------
Collection           : Loads data from an Ordered Dict. If multiple tables, each table must be a nested OrderedDict.
CSV                  : Loads data from CSV files (one table per call)
Parquet              : Loads data from Parquet - a columnar storage format for Apache Hadoop (one table per call)
YAML1                : Loads data from YAML files of a certain structure
TOML1                : Loads data from TOML files of a certain structure
JSON                 : Loads single-table data from JSON files
Ensemble             : Loads a CSV file where each row is a simulation run; creates a simulation table
Cloverleaf           : Loads data from a directory with subfolders for each simulation run's input and output data
Bueno                : Loads performance data from Bueno (github.com/lanl/bueno) (.data text file format)
DublinCoreDatacard   : Loads dataset metadata adhering to the Dublin Co

In [5]:
# Lists available writers
baseline.list_writers()


Valid Writers for `writer_name` in write(): ['ER_Diagram', 'Table_Plot', 'Csv_Writer', 'Parquet_Writer'] 

ER_Diagram  : Creates a visual ER diagram image based on all tables in DSI.
Table_Plot  : Generates a plot of numerical data from a specified table.
Csv         : Exports the data of a specified table to a CSV file.
Parquet     : Exports the data of a specified table to a Parquet file.



# Reading Data into DSI

For this tutorial, we will use cloverleaf 3d data available in our repository. 

* To pull the repository, you wil need to git clone https://github.com/lanl/dsi.git
* To access, go to examples/clover3d

The data is from [Cloverleaf3D](https://github.com/UK-MAC/CloverLeaf3D), a Lagrangian-Eulerian hydrodynamics solver.

The data is an **ensemble** of 8 runs, and has 4 metadata products of interest:

* clover.in - input deck
* clover.out - simulation statistics
* timestamps.txt - time when simulation was launched on slurm
* viz files - insitu outputs in vtk format


In [None]:
from IPython.display import HTML

HTML("""
<video width="256" height="208" controls loop>
  <source src="clover3d/movie.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>
""")


To begin the ingest:

In [6]:
# Target backend defaults to SQLite since not defined
store = DSI("dsi-tutorial.db")

# dsi.read(path, reader)
store.read("./clover3d/", 'Cloverleaf')

Created an instance of DSI with the Sqlite backend: dsi-tutorial.db
OrderedDict([('input', OrderedDict([('sim_id', [])])), ('output', OrderedDict([('sim_id', [])])), ('simulation', OrderedDict([('sim_id', []), ('sim_datetime', [])])), ('viz_files', OrderedDict([('sim_id', []), ('image_filepath', [])]))])
Loaded ./clover3d/ into tables: input, output, simulation, viz_files


In [7]:
store.read("/home/pascalgrosset/Desktop/sample_students.csv", 'CSV', 'student')

OrderedDict([('student', OrderedDict([('StudentID', [1, 2, 3, 4, 5]), ('FirstName', ['Alice', 'Brian', 'Chloe', 'David', 'Eva']), ('LastName', ['Johnson', 'Lopez', 'Smith', 'Brown', 'Martinez']), ('GradeLevel', [10, 11, 12, 9, 10]), ('Major', ['Mathematics', 'Computer Science', 'Biology', 'History', 'Physics'])]))])
Loaded /home/pascalgrosset/Desktop/sample_students.csv into the table student


# Exploring the loaded data

In [7]:
# How many tables do we have
store.num_tables()

Database now has 4 tables


In [None]:
# Let's see what tables were created
store.list()

In [None]:
# Let's get more details about the data
store.summary()

In [None]:
# Preview the contents of the visualization files
store.display("simulation")

# DSI Find to search within the data

DSI's find capability lets you explore your data by performing queries with the following modifiers, such as >, <, >=, <=, =, ==, ~ (contains), ~~ (contains), !=, and (X, Y) for a range between values X and Y. Additionally, by adding a "True" input will return you a collection.

In [None]:
# Search string or value within all tables
store.find("wall_clock > 0.10")

In [None]:
# Perform a find and receive a collection
find_list = store.find("state2_density==8.0", True) # Use True to return a collection

In [None]:
# Simply display what this collection (pandas dataframe) looks like
find_list

In [None]:
find_list = store.find("time>3.0", True)

In [None]:
find_list

In [None]:
find_list = store.find("time(1.0,1.1)", True)

In [None]:
find_list

# Updating contents with DSI

DSI Allows you to add or modify existing contents inside a collection that was returned from
a find or a query operation when 'True' is used.

Example usecase: We want to perform post-processing of the ingested data. In this example, we would like to append additional information to our DSI Database. We want to convert the simulation date from text to numerical unix time.

In [None]:
collection = store.find("sim_id > 0", True)

In [None]:
collection = store.query("SELECT * FROM simulation WHERE sim_id > '1'", True, True)

In [None]:
collection

In [None]:
# Small amount of helper code to convert dates to unix time
from datetime import datetime
from zoneinfo import ZoneInfo
def str2unix(date_str):
    date_str_clean = date_str.rsplit(' ', 1)[0]  # remove 'MDT'
    dt_naive = datetime.strptime(date_str_clean, "%a %d %b %Y %I:%M:%S %p")
    # Set timezone
    dt_local = dt_naive.replace(tzinfo=ZoneInfo("America/Denver"))
    unix_time = int(dt_local.timestamp()) # Unix time in UTC
    return unix_time

In [None]:
store.display("simulation") # display table before update

In [None]:
print(collection)

In [None]:
# Iterate through collection and append new metadata
collection["sim_unixtime"] = collection["sim_datetime"].apply(str2unix)

#dsi.update(collection)
store.update(collection) # update all tables in the list

In [None]:
# See the updated results
store.display("simulation")

# Query DSI

DSI Supports direct SQL queries to the data that you have ingested

In [None]:
# Use sql statement to directly query the backend store
store.query("SELECT sim_id, xmin, ymin, xmax, ymax, state2_density FROM input") # Adding 'True' gives a collection

In [None]:
store.list()

In [None]:
# alternative to "query()" if you want to get a whole table
store.get_table("input") # Adding 'True' gives a collection

# DSI Write - Complex Schemas

By formatting your metadata and putting it into DSI, you have essentially created a schema. DSI also has support to represent complex schemas by defining relations. For example, if you would like to relate the different tables together you can use the schema reader which takes in a .json file.

* schema.json

Before defining and ingesting a complex schema, what does an Entity Relationship Diagram look like in our initial schema?

* To run this portion of the example, the graphviz package is required

pip install graphviz

In [None]:
store.write("clover_er_diagram_no_schema.png", "ER_Diagram")

from IPython.display import Image
Image(filename="clover_er_diagram_no_schema.png", width=200)

In [None]:
# Create a new database where we will relate a complex schema
schema_store = DSI("schema_tutorial.db")

# dsi.schema(filename)
schema_store.schema("./clover3d/schema.json") # Schema neeeds to be defined before reading Cloverleaf data

# dsi.read(path, reader)
schema_store.read("./clover3d/", 'Cloverleaf') # read in Cloverleaf data

# dsi.write(filename, writer)
schema_store.write("clover_er_diagram.png", "ER_Diagram")

To preview the Entity Realationship Diagram (ERDiagram), import libraries to display images

In [None]:
from IPython.display import Image
Image(filename="clover_er_diagram.png", width=300)

# DSI Write - CSV

DSI Support the output (write) of data if you would like to export into another project. For example, here we want to export the table "input" into a csv file.

In [None]:
store.write("input.csv", "CSV", "input")

# DSI Write - Table plot
DSI has a built in tool to assist in plotting tables. In this example, we plot the contents of the "output" table. This is useful for automated tools / CI to track ongoing statistics.

In [None]:
store.write("output_table_plot.png", "Table_Plot", "output")

In [None]:
Image(filename="output_table_plot.png", width=400)

# Ending your workflow

In [None]:
store.close()
schema_store.close()

# Reloading your workflow

In [None]:
# Target backend defaults to SQLite since not defined
store = DSI("dsi-tutorial.db")
store.summary()

# Moving your data with DSI

In [None]:
from dsi.core import Sync

In [None]:
#Origin
local_files = "./clover3d/"
#Remote (Assuming on a Macbook, otherwise change to other location)
remote_path = "/Users/Shared/staging/"

In [None]:
# Create Sync type with project name
s = Sync("dsi-tutorial")

In [None]:
s.index(local_files,remote_path,True)

In [None]:
store.summary()

In [None]:
s.copy("copy",True)