# DSI Agent

The role of this agent is to allow you to interface with Data Catalogue provided through the Data Science Infrasture (DSI) project.

To use this agent, you will need:
 - Access to Chat-GPT. The LANL AI team should be able to grant you access to this. The current model in use is "chatgpt-5.1"

Some capabilities available to the agent, besides the ability to search databases, are:
 - Download datasets
 - web search
 - arxiv paper search
 - wikipedia search

These are all demonstrated in the notebook below.  

**Note:**
- LLMs can make mistakes and can exhibit random behavior from time to time.

## System setup  
Do not change!!!

In [1]:
import sqlite3
import os

from langchain_openai import ChatOpenAI
from langgraph.checkpoint.sqlite import SqliteSaver
from pathlib import Path

In [2]:
from ursa.agents import DSIAgent

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Function to display and image in Jupyter notebook
from PIL import Image
def display_image(img_path: str):
    img = Image.open(img_path)
    display(img)

In [4]:
# need to hide this better
workspace = "dsi_agent_example_1"
os.makedirs(workspace, exist_ok=True)
rdb_path = Path(workspace) / "dsi_agent_checkpoint.db"
rdb_path.parent.mkdir(parents=True, exist_ok=True)
rconn = sqlite3.connect(str(rdb_path), check_same_thread=False)
dsiagent_checkpointer = SqliteSaver(rconn)

## Load Dataset
Please provide the agent with the location of the master dataset (containing datacards)

In [5]:
#dataset_path = input("Please enter the address of the dataset to explore:")  # when in production
dataset_path = "data/oceans_11/ocean_11_datasets.db"

In [6]:
# Specify the model to use
model = ChatOpenAI( model="gpt-5.1", max_tokens=100000, timeout=None, max_retries=2)

In [8]:
# Initialize the agent
ai = DSIAgent(llm=model, database_path=dataset_path, checkpointer=dsiagent_checkpointer)

### Query the datasets

Use ```ai.ask("<query>")``` to query the dataset.  
  - e.g. ```ai.ask("Tell me about the datasets you have.")```

You should be able to query each of the datasets and ask the agent to load them or load back the master diana database.

In [9]:
ai.ask("Tell me about the datasets you have.")

Here are the datasets currently available:

1. Deep Water Impact Ensemble Dataset  
   - Domain: physics  
   - Keywords: asteroid impact, meteor, insitu, visualization, simulation  
   - Summary: Simulations for the IEEE SciVis 2018 contest studying asteroid impacts in deep ocean water, with varying asteroid sizes, speeds, and compositions.

2. Bowtie Dataset  
   - Domain: manufacturing  
   - Keywords: semiconductor, manufacturing  
   - Summary: ~3,600 images of a semiconductor part called “Bowtie,” labeled as accept/reject with subcategories (zoom levels, defect types like gouge, debris, etc.) and spreadsheets linking images to wafer locations.

3. Higrad Firetex Wildfire Simulations  
   - Domain: physics  
   - Keywords: wildfire, simulation, highgrad, firetec  
   - Summary: Time series 3D CFD simulations (Higrad + Firetec) of wildfires over mountainous/canyon terrain, including atmosphere, vegetation, topography, turbulence, and vorticity-driven lateral spread phenomena.

4. Gray-Scott reaction-diffusion dataset  
   - Domain: physics  
   - Keywords: gray-scott, PDE, simulation, complex dynamics  
   - Summary: 1,000-record HDF5 dataset of simulations of the Gray-Scott reaction-diffusion model, illustrating nonlinear pattern formation.

5. The High Explosives & Affected Targets (HEAT) Dataset  
   - Domain: eulerian  
   - Keywords: High Explosives, HEAT, AI Ready, ML, Eulerian  
   - Summary: 2D cylindrically symmetric shock-propagation simulations (CYL and PLI partitions) with thermodynamic and kinematic fields for multiple materials, designed as an AI/ML benchmark for multi-material shock propagation.

6. Heat Equations  
   - Domain: physics  
   - Keywords: heat, diffusion, simulation, partial differential equations  
   - Summary: 1,000-record HDF5 dataset of simulations of heat diffusion equations, with code and mathematical background linked.

7. Monopoly Dataset  
   - Domain: manufacturing  
   - Keywords: computed tomography, scans, monopoly hotels, steel, materials  
   - Summary: X-ray CT image data for 10 additively manufactured 316L steel specimens (“Monopoly Hotels”) with engineered planar lack-of-fusion defects; each specimen has >1000 high-quality TIFF slices.

8. 3D FLASH Computation of National Ignition Facility Shot  
   - Domain: fusion  
   - Keywords: NIF, fusion, 3D, simulation  
   - Summary: 61 HDF5 files from a 3D FLASH simulation of a NIF cylindrical implosion (shot N210201-001), including multi-material concentrations, density, pressure, temperatures, and energy deposition on an adaptive mesh over a 3D domain.  
   - Associated DSI database: nif.db

If you tell me which dataset you’re interested in, I can dive into more detail or (where available) load its associated database and analyze it.


Query took: 16.19 seconds, total tokens used: 4584.



## Diving into the Computation of National Ignition Facility Shot dataset

In [10]:
ai.ask("Tell me more about the Ignition Facility dataset")

Here’s more detail on the 3D FLASH / National Ignition Facility dataset based on its database:

High‑level description  
- It is a 3D radiation‑hydrodynamics simulation of a cylindrical NIF implosion (shot N210201‑001), run with the FLASH code.  
- The simulation is saved as a time series of HDF5 files; the metadata table I see lists 61 timesteps (0–30 ns in the original description, every 0.5 ns).  
- Each timestep is stored in an adaptive mesh (AMR) format: many 16×16×16 blocks, with the number of blocks varying over time.

From the metadata table (`nif_metadata`)  
Each row corresponds to one timestep/file, with:

- `sim_name`: e.g. `nifcylxyz_hdf5_plt_cnt_0005`  
- `timestep`: integer index (e.g. 1, 3, 5, 7, 18, …)  
- `num_arrays`: always 9 in this DB; the arrays are:
  - `ablt` – ablator concentration (CH plastic)
  - `cham` – chamber-gas concentration (He)
  - `foam` – central foam (CH)
  - `mark` – marker material (Al)
  - `dens` – mass density [g/cm³]
  - `depo` – instantaneous laser energy deposition per unit mass [erg/g]
  - `pres` – pressure [erg/cm³]
  - `tele` – electron temperature [K]
  - `tion` – ion temperature [K]

- `shape`: e.g. `(8664, 16, 16, 16)`  
  - The first number is the number of AMR blocks; each block is 16×16×16 cells.  
  - So `(8664,16,16,16)` ≈ 8664 × 4096 ≈ 35.5 million cells at that timestep.  
  - Shapes vary with time as the AMR grid adapts (e.g. 6872, 8304, 9648 blocks in some samples).

- For each field, min and max values over the whole domain at that timestep:
  - Example at early time (`timestep = 1`):
    - `dens_min` ≈ 8.76×10⁻⁶ g/cm³, `dens_max` ≈ 2.70 g/cm³  
    - `pres_min` ≈ 6.51×10⁴, `pres_max` ≈ 4.98×10¹³ erg/cm³  
    - `tele_min` ≈ 100 K, `tele_max` ≈ 2.56×10⁷ K  
    - `tion_min` ≈ 100 K, `tion_max` ≈ 4.90×10⁷ K  
    - `ablt/cham/foam/mark` all between 0 and 1 (volume fractions)
  - At a later time (`timestep = 18`):
    - `dens_max` ~2.68 g/cm³ (similar order, but different min)
    - `pres_min` ~1.46×10⁸, `pres_max` ~3.10×10¹² erg/cm³
    - `tele_max` ~1.73×10⁶ K, `tion_max` ~2.76×10⁶ K
    - `depo_max` = 0 (no more active laser deposition at that time)

- `link`: URL to the actual FLASH file, e.g.  
  `https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0005.flash`

Auxiliary info (`array_and_types` table)  
- Lists the 9 arrays and their types (e.g., float fields for the variables above).  
- Confirms the dataset is organized as multi-field AMR blocks per timestep.

In practical terms, this dataset lets you:  
- Track implosion dynamics via density, pressure, and temperature evolution.  
- Separate materials (ablator, foam, marker, chamber gas) through their concentration fields.  
- Analyze where and when laser energy is deposited (`depo`) and its effects.  
- Work with high‑resolution, multi-physics 3D AMR data for tasks like:
  - developing or validating reduced-order/ML models,  
  - studying mix, shock propagation, and stagnation conditions,  
  - designing visualization and analysis workflows for large AMR simulations.

If you’d like, I can:
- List all timesteps and their basic stats,  
- Identify the most “extreme” timestep (e.g., max pressure or temperature), or  
- Help you design an analysis workflow (e.g., which fields to focus on for a specific physics question).


Query took: 18.11 seconds, total tokens used: 7577.



In [None]:
ai.ask("Can you load that DSI database")

In [11]:
ai.ask("list the tables/variables")

For the Ignition Facility (NIF) dataset, the available simulation variables are:

- ablt (float) – ablator material fraction (CH plastic)  
- cham (float) – chamber gas fraction (He)  
- dens (float) – mass density  
- depo (float) – laser energy deposition per unit mass  
- foam (float) – central foam fraction (CH)  
- mark (float) – marker material fraction (Al)  
- pres (float) – pressure  
- tele (float) – electron temperature  
- tion (float) – ion temperature  

If by “tables” you meant database tables rather than simulation variables, the key ones here are:
- `nif_metadata` – one row per timestep/file with global stats and links  
- `array_and_types` – the list of variables above and their data types.


Query took: 6.90 seconds, total tokens used: 8087.



In [12]:
ai.ask("show me some rows from nif_metadata")

Here are 10 example rows from `nif_metadata`:

1.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0005  
- timestep: 5  
- num_arrays: 9  
- shape: (8664, 16, 16, 16)  
- dens_min / dens_max: 9.11e-06 / 5.67  
- pres_min / pres_max: 6.51e4 / 3.43e13  
- tele_min / tele_max: 1.00e2 / 2.38e7  
- tion_min / tion_max: 1.48e2 / 8.14e7  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0005.flash  

2.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0007  
- timestep: 7  
- shape: (8304, 16, 16, 16)  
- dens_min / dens_max: 1.00e-05 / 7.38  
- pres_min / pres_max: 6.51e4 / 3.57e13  
- tele_min / tele_max: 1.00e2 / 1.60e7  
- tion_min / tion_max: 2.03e2 / 4.69e7  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0007.flash  

3.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0003  
- timestep: 3  
- shape: (8656, 16, 16, 16)  
- dens_min / dens_max: 9.19e-06 / 4.55  
- pres_min / pres_max: 6.51e4 / 3.82e13  
- tele_min / tele_max: 1.05e2 / 2.38e7  
- tion_min / tion_max: 1.05e2 / 1.54e8  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0003.flash  

4.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0001  
- timestep: 1  
- shape: (6872, 16, 16, 16)  
- dens_min / dens_max: 8.76e-06 / 2.70  
- pres_min / pres_max: 6.51e4 / 4.98e13  
- tele_min / tele_max: 1.00e2 / 2.56e7  
- tion_min / tion_max: 1.00e2 / 4.90e7  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0001.flash  

5.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0018  
- timestep: 18  
- shape: (9648, 16, 16, 16)  
- dens_min / dens_max: 1.52e-05 / 2.68  
- pres_min / pres_max: 1.46e8 / 3.10e12  
- tele_min / tele_max: 1.00e2 / 1.73e6  
- tion_min / tion_max: 2.36e2 / 2.76e6  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0018.flash  

6.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0006  
- timestep: 6  
- shape: (8408, 16, 16, 16)  
- dens_min / dens_max: 9.85e-06 / 7.95  
- pres_min / pres_max: 6.51e4 / 4.04e13  
- tele_min / tele_max: 1.57e2 / 2.34e7  
- tion_min / tion_max: 1.57e2 / 5.00e7  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0006.flash  

7.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0004  
- timestep: 4  
- shape: (8408, 16, 16, 16)  
- dens_min / dens_max: 8.92e-06 / 4.65  
- pres_min / pres_max: 6.45e4 / 3.24e13  
- tele_min / tele_max: 1.54e2 / 2.47e7  
- tion_min / tion_max: 1.54e2 / 1.27e8  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0004.flash  

8.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0000  
- timestep: 0  
- shape: (6360, 16, 16, 16)  
- dens_min / dens_max: 1.00e-05 / 2.70  
- pres_min / pres_max: 6.51e4 / 1.33e10  
- tele_min / tele_max: 2.90e2 / 2.90e2  
- tion_min / tion_max: 2.90e2 / 2.90e2  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0000.flash  

9.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0019  
- timestep: 19  
- shape: (9848, 16, 16, 16)  
- dens_min / dens_max: 1.66e-05 / 2.63  
- pres_min / pres_max: 1.46e8 / 3.06e12  
- tele_min / tele_max: 1.00e2 / 1.54e6  
- tion_min / tion_max: 2.35e2 / 3.07e6  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0019.flash  

10.  
- sim_name: nifcylxyz_hdf5_plt_cnt_0002  
- timestep: 2  
- shape: (8240, 16, 16, 16)  
- dens_min / dens_max: 7.30e-06 / 4.14  
- pres_min / pres_max: 6.51e4 / 4.25e13  
- tele_min / tele_max: 1.03e2 / 2.35e7  
- tion_min / tion_max: 1.03e2 / 2.05e8  
- link: https://oceans11.lanl.gov/nif/N210201-001_3D_r2//nifcylxyz_hdf5_plt_cnt_0002.flash  

If you’d like a more compact table (e.g., only `timestep`, `shape`, and `pres_max`) or sorted by a particular quantity (like max pressure or temperature), I can pull that next.


Query took: 14.97 seconds, total tokens used: 12961.



In [None]:
ai.ask("Can you download the data for sim name: nifcylxyz_hdf5_plt_cnt_0018")

In [None]:
ai.ask("can you to visualize the data you downloaded with yt?") #you will need to install yt in yout environment for this

In [None]:
display_image("nifcylxyz_hdf5_plt_cnt_0018_density_slice_z0.png")

In [None]:
ai.ask("can you create a plot of dens_max over time")

In [None]:
display_image("dens_max_vs_timestep.png")

In [None]:
ai.ask("explain to me how you generated this plot")

In [None]:
ai.ask("who is the owner of this dataset?")

## General Inquiries

In [None]:
ai.ask("can you reload the master database now")

In [None]:
ai.ask("Tell me again what datasets you have?")

In [None]:
ai.ask("what is this Gray-Scott reaction-diffusion dataset?")

In [None]:
ai.ask("can you find some arxiv papers related to this")

In [None]:
ai.ask("can you search osti for papers on it?")

In [None]:
ai.ask("I'm interested in asteroid impacts, is there anything related to it?")

In [None]:
ai.ask("do asteroids really impact earth")

In [None]:
ai.ask("can you search wikipedia for this impact: Chelyabinsk, Russia")