# Converting Data into Well Format

While the next [example](./walrus_example_1_RunningWalrus.ipynb) shows how to use Walrus independently of our training code, our training and validation code will generally be easier to use if the data is formatted uniformly.

For this, we give an example of how you would take a non-Well dataset [Bubble ML 2.0](https://arxiv.org/abs/2507.21244) and convert it into the Well format. We're specifically looking at the PoolBoilSubcooled data for R515B and FC72 that we used in the Walrus paper. Note that since this is for usage in Walrus where our goal is to infer a number of properties from history, we don't try to map every property stored in the dataset to the new format - just the ones we need. **If you use the BubbleML dataset in your work in the transformed format, please cite their paper.** Their project also contains a number of interesting scenarios, tasks, and forms of analysis that weren't used in Walrus.

## What is The Well Format?

The Well format is a standardized HDF5 structure for storing PDE simulation data. Key features:
- **Hierarchical organization**: Dimensions, fields, boundary conditions, and scalars in separate groups
- **Metadata-rich**: Attributes describe time-varying, sample-varying, and dimension-varying properties
- **Field organization by rank**: Scalars (t0), vectors (t1), tensors (t2)
- **Flexible**: Supports Cartesian, cylindrical, and spherical coordinates

To start off let's make some paths to store our raw and processed data, then we'll download the data. The files we're going to be downloading are quite large, so it may be better to download them through your preferred approach (manual, huggingface CLI, or other) and point processed path to the downloaded files.

In [None]:
%matplotlib inline
import os 
import matplotlib.pyplot as plt

# Define paths for raw (downloaded) and processed (Well-formatted) data
download_path = "./raw_data/"
processed_path = "./processed_data/"

# Create directories if they don't exist
os.makedirs(download_path, exist_ok=True)
os.makedirs(processed_path, exist_ok=True)

## Step 1: Download the BubbleML Data

Note - this is a large dataset (~several GB), so you may want to download it independently and just point `download_path` at the download location.

We'll download two materials:
- **R515B**: A refrigerant commonly used in cooling systems
- **FC72**: A dielectric fluid used in electronic cooling

Each material has multiple HDF5 files (different wall temperatures) and accompanying JSON metadata files.

In [None]:
import requests
import os

os.makedirs(download_path, exist_ok=True)

# Map each HuggingFace subfolder to a local subdirectory
# This organizes data by material type
materials = {
    "PoolBoiling-Subcooled-R515B-2D": os.path.join(download_path, "R515B"),
    "PoolBoiling-Subcooled-FC72-2D":  os.path.join(download_path, "FC72"),
}

# Download files for each material
for material, subdir in materials.items():
    # Create subdirectory for this material
    os.makedirs(subdir, exist_ok=True)
    
    # Query HuggingFace API to get list of files in this folder
    api = f"https://huggingface.co/api/datasets/hpcforge/BubbleML_2/tree/main/{material}"
    resp = requests.get(api)
    resp.raise_for_status()  # Raise error if request fails
    items = resp.json()  # Parse JSON response

    # Download each file in the folder
    for item in items:
        path = item.get("path")
        
        # Skip if not a valid file path
        if not path or item.get("type") != "file":
            continue
        
        # Construct download URL
        url = f"https://huggingface.co/datasets/hpcforge/BubbleML_2/resolve/main/{path}"
        out_file = os.path.join(subdir, os.path.basename(path))
        
        # Skip if file already exists
        if os.path.exists(out_file):
            print("skipping (exists):", out_file)
            continue
        
        # Download with streaming to handle large files
        print("downloading:", path, "->", out_file)
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(out_file, "wb") as f:
                # Write in chunks (8KB at a time) for memory efficiency
                for chunk in r.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
        print("saved:", out_file)

## Step 2: Understand the Well Format Structure

If we look at the files we've downloaded, we have a collection of HDF5 and JSON files. To see what we're trying to create, let's look at the structure of a Well-formatted file.

### Example: Helmholtz Staircase Dataset

The overall file structure looks like:
```
<KeysViewHDF5 ['boundary_conditions', 'dimensions', 'scalars', 't0_fields', 't1_fields', 't2_fields']>
{'dataset_name': 'helmholtz_staircase', 'grid_type': 'cartesian', 'n_spatial_dims': 2, 'n_trajectories': 26, 'omega': 0.06283032, 'simulation_parameters': array(['omega'], dtype=object)}
├── boundary_conditions
│   ├── x_open_neumann
│   │   └── mask (1024,)              ← Boolean mask: 1 at boundary locations
│   ├── xy_wall
│   │   └── mask (1024, 256)
│   └── y_open_neumann
│       └── mask (256,)
├── dimensions
│   ├── time (50,)                     ← Temporal coordinate
│   ├── x (1024,)                      ← Spatial coordinates
│   └── y (256,)
├── scalars
│   └── omega ()                        ← Simulation parameters (0D)
├── t0_fields                           ← Scalar fields (0-tensor)
│   ├── mask (1024, 256)                  [spatial dims only]
│   ├── pressure_im (26, 50, 1024, 256)   [traj, time, x, y]
│   └── pressure_re (26, 50, 1024, 256)
├── t1_fields                           ← Vector fields (1-tensor)
│                                         [traj, time, x, y, components]
└── t2_fields                           ← Tensor fields (2-tensor)
                                          [traj, time, x, y, i, j]
```

### Key Observations:
1. **Top-level groups**: `boundary_conditions`, `dimensions`, `scalars`, `t0_fields`, `t1_fields`, `t2_fields`
2. **Metadata as attributes**: Dataset-level info stored in HDF5 attributes
3. **Field organization by tensor rank**: 
   - t0 = scalars (temperature, pressure)
   - t1 = vectors (velocity)
   - t2 = tensors (stress, strain)
4. **Consistent ordering**: `[trajectory, time, x, y, z, components]`

So for BubbleML data, we want to define these parameters based on the content of the HDF5 files and their associated metadata. We'll define a quick little helper to view the HDF5 structure once we're done.

In [None]:
import h5py

def h5_tree(val, pre=''):
    """Recursively print HDF5 file structure in a tree format.
    
    Args:
        val: HDF5 group or file object
        pre: Prefix string for tree formatting (for recursion)
    
    This is a helpful utility for visualizing HDF5 file structure.
    """
    items = len(val)
    for key, val in val.items():
        items -= 1
        
        # Determine if this is the last item (affects tree formatting)
        if items == 0:
            # Last item - use └── symbol
            if type(val) == h5py._hl.group.Group:
                # It's a group - recurse into it
                print(pre + f'└── {key}')
                h5_tree(val, pre+'    ')  # Add 4 spaces for indentation
            else:
                # It's a dataset - show its shape
                try:
                    print(pre + f'└── {key} {val.shape}')
                except TypeError:
                    # Scalar dataset (no shape)
                    print(pre + f'└── {key} (scalar)')
        else:
            # Not the last item - use ├── symbol
            if type(val) == h5py._hl.group.Group:
                print(pre + f'├── {key}')
                h5_tree(val, pre+'│   ')  # Add │ for vertical line
            else:
                try:
                    print(pre + f'├── {key} {val.shape}')
                except TypeError:
                    print(pre + f'├── {key} (scalar)')

## Step 3: Examine the Source Data Structure

Now let's look at the structure of the BubbleML data we downloaded.

In [None]:
# Let's look at an example JSON file
# This contains simulation parameters and configuration
!cat {download_path}/R515B/Twall_100.json

In [None]:
# Now let's look at the HDF5 structure
# This contains the actual simulation data (fields)
example_bubble = h5py.File(f"{download_path}/R515B/Twall_100.hdf5", 'r')
print("BubbleML HDF5 Structure:")
print("=" * 50)
h5_tree(example_bubble)

print("\n" + "=" * 50)
print("Key Observations:")
print("=" * 50)
print("- Shape: (2001, 512, 512) = (time, y, x)")
print("  Note: Python uses (y, x) order for images")
print("  Well format uses (x, y) order - we'll need to transpose!")
print("\n- Fields available:")
print("  • dfun: Distance function (gas-liquid interface SDF)")
print("  • temperature: Temperature field")
print("  • velx, vely: Velocity components")
print("  • pressure: Pressure field")
print("  • massflux: Mass flux")
print("  • normx, normy: Interface normals")
print("\n- Coordinates:")
print("  • x_centers, y_centers: Cell centers")
print("  • x_faces, y_faces: Cell faces (one extra point)")

## Step 4: Define the Transformation Function

We can see that each HDF5 file appears to be one trajectory generated under a given configuration defined by the JSON file. The trajectory is 2001 steps long and contains snapshots from a 512x512 field which we will place at the volume centers.

### Fields to Include

Trying to match the settings used in the Bubbleformer paper, we'll include:
- `dfun` (gas-liquid interface SDF)
- `temperature`
- `velocity` (both x and y components)

We won't try to include all of the information - Walrus is supposed to try to infer this from the history after all - but we'll add some of the high level details to the file as an example of how you would do this.

### Important Note on Axis Order

For Euclidean data, the Well canonically stores data in **[x, y, z]** order. 
2D images in Python are usually stored in **[y, x]** format, so we'll need to **transpose** these axes to match other data.

### Boundary Condition Mapping

Boundaries are the main tricky element. BubbleML uses:
- **No-slip conditions**: Models contact between fluid and solid surface → maps to `"WALL"`
- **Outflow BCs**: Flow can continue past this point → maps to `"OPEN"`

The Well uses topological descriptors:
- `"WALL"`: Closed/reflective boundary (solid surface)
- `"OPEN"`: Open boundary (inflow/outflow)
- `"PERIODIC"`: Wrapping boundary (toroidal topology)

In [None]:
import glob
import json
import numpy as np

def translate_bubble(hdf5_path, json_path, subname, out_path):
    """Convert a single BubbleML HDF5 file to Well format.
    
    Args:
        hdf5_path: Path to source HDF5 file (contains field data)
        json_path: Path to corresponding JSON file (contains metadata)
        subname: Material name (e.g., 'R515B', 'FC72')
        out_path: Output directory for Well-formatted file
    
    The function performs these key transformations:
    1. Extracts fields from source HDF5
    2. Transposes spatial axes from (y, x) to (x, y)
    3. Organizes into Well format groups
    4. Adds appropriate metadata attributes
    """
    # Extract temperature from filename (e.g., "Twall_100.hdf5" -> 100)
    baseT = int(hdf5_path.split('_')[-1].replace('.hdf5',''))
    
    # Load JSON metadata
    json_file = json.load(open(json_path))
    
    # ========================================================================
    # STEP 1: Extract and transform field data from source HDF5
    # ========================================================================
    with h5py.File(hdf5_path, 'r') as f:
        # Extract gas-liquid interface distance function
        # Original: (T, Y, X) -> swapaxes(1,2) -> (T, X, Y) -> add batch dim -> (1, T, X, Y)
        gas_interface_sdf = np.swapaxes(f['dfun'][:], 1, 2)[None]
        
        # Extract velocity components and transpose
        # Original: (T, Y, X) -> (T, X, Y) -> (1, T, X, Y)
        vx = np.swapaxes(f['velx'][:], 1, 2)[None]
        vy = np.swapaxes(f['vely'][:], 1, 2)[None]
        # Stack into velocity vector: (1, T, X, Y, 2)
        vel = np.stack([vx, vy], -1)
        
        # Extract temperature field
        temp = np.swapaxes(f['temperature'][:], 1, 2)[None]
        
        # Extract spatial coordinates
        # Use cell centers (not faces) for field values
        x_coords = f['x_centers'][:]
        y_coords = f['y_centers'][:]
        
        # Create time coordinate array
        # 2001 timesteps from 0 to t_final
        t = np.linspace(0, json_file["t_final"], 2001, endpoint=True)

    # ========================================================================
    # STEP 2: Create Well-formatted HDF5 file
    # ========================================================================
    outpath = out_path + f"bubbleML_PoolBoiling-Subcooled_{subname}_T{baseT}.hdf5"
    with h5py.File(outpath, 'w') as f:
        
        # ====================================================================
        # 2a. Top-level metadata (stored as file attributes)
        # ====================================================================
        f.attrs["dataset_name"] = f"bubbleML_PoolBoiling-Subcooled"
        
        # Grid type determines which augmentations can be applied
        # Options: 'cartesian', 'cylindrical', 'spherical'
        f.attrs["grid_type"] = "cartesian"
        
        f.attrs["n_spatial_dims"] = 2  # 2D simulation
        f.attrs["n_trajectories"] = 1  # Each file contains one trajectory

        # Store simulation parameters used in BubbleML benchmarks
        # These describe the physical properties and numerical settings
        f.attrs["simulation_parameters"] = [
            "inv_reynolds",       # 1/Reynolds number (viscosity)
            "cpgas",              # Specific heat capacity of gas
            "mugas",              # Dynamic viscosity of gas
            "rhogas",             # Density of gas
            "thcogas",            # Thermal conductivity of gas
            "stefan",             # Stefan number (phase change)
            "prandtl",            # Prandtl number (momentum/thermal diffusion)
            "heater-nucWaitTime", # Nucleation wait time
            "heater-wallTemp"     # Wall temperature
        ]
        
        # Store each parameter value as an attribute
        f.attrs["inv_reynolds"] = json_file["inv_reynolds"]
        f.attrs["cpgas"] = json_file["cpgas"]
        f.attrs["mugas"] = json_file["mugas"]
        f.attrs["rhogas"] = json_file["rhogas"]
        f.attrs["thcogas"] = json_file["thcogas"]
        f.attrs["stefan"] = json_file["stefan"]
        f.attrs["prandtl"] = json_file["prandtl"]
        f.attrs["heater-nucWaitTime"] = json_file["heater"]["nucWaitTime"]
        f.attrs["heater-wallTemp"] = json_file["heater"]["wallTemp"]

        # ====================================================================
        # 2b. Spatial/temporal dimensions
        # ====================================================================
        dims = f.create_group("dimensions")
        dims.attrs["spatial_dims"] = ["x", "y"]  # Dimension names
        
        # Time dimension
        time = dims.create_dataset("time", data=t)
        time.attrs["time_varying"] = True   # Time varies over time (always true for time)
        time.attrs["sample_varying"] = False  # Same time grid for all trajectories
        
        # X spatial dimension
        x = dims.create_dataset("x", data=x_coords)
        x.attrs["time_varying"] = False  # Fixed grid (not a moving mesh)
        x.attrs["sample_varying"] = False  # Same grid for all trajectories
        
        # Y spatial dimension
        y = dims.create_dataset("y", data=y_coords)
        y.attrs["time_varying"] = False
        y.attrs["sample_varying"] = False

        # ====================================================================
        # 2c. Boundary conditions
        # ====================================================================
        # BCs are defined as masks indicating where each BC type applies
        boundaries = f.create_group("boundary_conditions")
        
        # X-direction: No-slip walls on both sides
        x_wall_noslip = boundaries.create_group("x_wall_noslip")
        xmask = np.zeros(x_coords.shape[0])  # Start with all zeros
        xmask[0] = 1   # Mark first index (lower boundary)
        xmask[-1] = 1  # Mark last index (upper boundary)
        x_wall_noslip.create_dataset("mask", data=xmask, dtype=np.int8)
        x_wall_noslip.attrs["bc_type"] = "WALL"  # Solid wall
        x_wall_noslip.attrs["associated_dims"] = ["x"]  # Applies to x-dimension
        x_wall_noslip.attrs["associated_fields"] = []  # No specific fields
        x_wall_noslip.attrs["sample_varying"] = False
        x_wall_noslip.attrs["time_varying"] = False
        
        # Y-direction lower: No-slip wall (heated surface)
        y_wall_noslip = boundaries.create_group("y_wall_noslip")
        ymask = np.zeros(y_coords.shape[0])
        ymask[0] = 1  # Only lower boundary is a wall
        y_wall_noslip.create_dataset("mask", data=ymask, dtype=np.int8)
        y_wall_noslip.attrs["bc_type"] = "WALL"
        y_wall_noslip.attrs["associated_dims"] = ["y"]
        y_wall_noslip.attrs["associated_fields"] = []
        y_wall_noslip.attrs["sample_varying"] = False
        y_wall_noslip.attrs["time_varying"] = False

        # Y-direction upper: Open (outflow)
        y_open = boundaries.create_group("y_open")
        ymask = np.zeros(y_coords.shape[0])
        ymask[-1] = 1  # Only upper boundary is open
        y_open.create_dataset("mask", data=ymask, dtype=np.int8)
        y_open.attrs["bc_type"] = "OPEN"  # Outflow boundary
        y_open.attrs["associated_dims"] = ["y"]
        y_open.attrs["associated_fields"] = []
        y_open.attrs["sample_varying"] = False
        y_open.attrs["time_varying"] = False

        # ====================================================================
        # 2d. Scalars (simulation parameters as datasets)
        # ====================================================================
        # These are 0-dimensional quantities (single numbers)
        # For most use cases in Walrus, these are unused metadata
        scalars = f.create_group("scalars")
        scalars.attrs["field_names"] = [
            "inv_reynolds", "cpgas", "mugas", "rhogas", "thcogas",
            "stefan", "prandtl", "heater-nucWaitTime", "heater-wallTemp"
        ]
        
        # Create a dataset for each scalar parameter
        # These are all constant (don't vary with time or trajectory)
        for param_name in ["inv_reynolds", "cpgas", "mugas", "rhogas", "thcogas",
                           "stefan", "prandtl"]:
            dset = scalars.create_dataset(param_name, data=json_file[param_name])
            dset.attrs["sample_varying"] = False
            dset.attrs["time_varying"] = False
        
        # Heater parameters come from nested JSON structure
        heater_nucWaitTime = scalars.create_dataset(
            "heater-nucWaitTime", data=json_file["heater"]["nucWaitTime"]
        )
        heater_nucWaitTime.attrs["sample_varying"] = False
        heater_nucWaitTime.attrs["time_varying"] = False

        heater_wallTemp = scalars.create_dataset(
            "heater-wallTemp", data=json_file["heater"]["wallTemp"]
        )
        heater_wallTemp.attrs["sample_varying"] = False
        heater_wallTemp.attrs["time_varying"] = False

        # ====================================================================
        # 2e. Fields - THE ACTUAL DATA WE'RE MODELING
        # ====================================================================
        
        # t0_fields: Scalar (0-tensor) fields
        # Shape: [trajectory, time, x, y]
        t0 = f.create_group("t0_fields")
        t0.attrs["field_names"] = ["gas-interface-sdf", "temperature"]
        
        # Gas-liquid interface signed distance function
        # Positive = gas phase, negative = liquid phase, zero = interface
        sdf_dset = t0.create_dataset("gas-interface-sdf", data=gas_interface_sdf, dtype=np.float32)
        sdf_dset.attrs["dim_varying"] = np.array([True, True])  # Varies over both x and y
        sdf_dset.attrs["sample_varying"] = True   # Different for each trajectory
        sdf_dset.attrs["time_varying"] = True     # Evolves over time

        # Temperature field
        temp_dset = t0.create_dataset("temperature", data=temp, dtype=np.float32)
        temp_dset.attrs["dim_varying"] = np.array([True, True])
        temp_dset.attrs["sample_varying"] = True
        temp_dset.attrs["time_varying"] = True

        # t1_fields: Vector (1-tensor) fields
        # Shape: [trajectory, time, x, y, components]
        # Components dimension has size = n_spatial_dims (2 for 2D velocity)
        t1 = f.create_group("t1_fields")
        t1.attrs["field_names"] = ["velocity"]
        
        # Velocity vector field (v_x, v_y)
        v_dset = t1.create_dataset("velocity", data=vel, dtype=np.float32)
        v_dset.attrs["dim_varying"] = np.array([True, True])  # Varies over x and y
        v_dset.attrs["sample_varying"] = True
        v_dset.attrs["time_varying"] = True

        # t2_fields: Tensor (2-tensor) fields
        # Shape: [trajectory, time, x, y, i, j]
        # Would include stress tensors, strain tensors, etc.
        # We don't have any 2-tensor fields in this dataset
        t2 = f.create_group("t2_fields")
        t2.attrs["field_names"] = []
    
    print(f"✓ Created: {outpath}")

## Step 5: Convert All Files

Now that we have our files and our transformation definition, let's apply the conversion to all downloaded data!

We'll organize the output into standard train/valid/test splits.

In [None]:
# Clean up any existing processed data
!rm -rf {processed_path}/data/

# Create directory structure for train/valid/test splits
!mkdir -p {processed_path}/data/train/
!mkdir -p {processed_path}/data/valid/
!mkdir -p {processed_path}/data/test/

# Process both materials
materials = ["FC72", "R515B"]

print("Converting BubbleML data to Well format...")
print("=" * 60)

for material in materials:
    print(f"\nProcessing {material}...")
    
    # Get all HDF5 files for this material
    base_path = os.path.join(download_path, material)
    out_path = f"{processed_path}/data/train/"
    files = glob.glob(base_path + '/*.hdf5')
    
    # Convert each file
    for file in files:
        hdf5_path = file
        json_path = file.replace('.hdf5', '.json')
        
        # Verify both files exist
        if not (os.path.exists(hdf5_path) and os.path.exists(json_path)):
            print(f"  ⚠ Warning: Missing files for {file}")
            continue
        
        # Perform conversion
        translate_bubble(hdf5_path, json_path, material, out_path)

print("\n" + "=" * 60)
print("✓ Conversion complete!")

## Step 6: Create Validation and Test Splits

The files should now be transformed! The problem now is we don't have separate validation and test sets yet.

In the experiments in the Walrus paper, only 2 wall temperatures were used for validation/testing per material, so we'll repeat this pattern here. 

**Note**: If you're doing extensive hyperparameter tuning on validation, you'll generally want to avoid having identical validation and test sets, but we're using this pattern to match the original experimental settings.

In [None]:
# Specify which files to use for validation and test
# These are specific wall temperatures chosen to be held out
valid_files = [
    "bubbleML_PoolBoiling-Subcooled_FC72_T107.hdf5",   # FC72 at 107°C
    "bubbleML_PoolBoiling-Subcooled_FC72_T97.hdf5",    # FC72 at 97°C
    "bubbleML_PoolBoiling-Subcooled_R515B_T30.hdf5",   # R515B at 30°C
    "bubbleML_PoolBoiling-Subcooled_R515B_T40.hdf5"    # R515B at 40°C
]

# Use same files for test (following original paper)
test_files = [
    "bubbleML_PoolBoiling-Subcooled_FC72_T107.hdf5",
    "bubbleML_PoolBoiling-Subcooled_FC72_T97.hdf5",
    "bubbleML_PoolBoiling-Subcooled_R515B_T30.hdf5",
    "bubbleML_PoolBoiling-Subcooled_R515B_T40.hdf5"
]

print("Creating validation and test splits...")
print("=" * 60)

# Move validation files from train to valid directory
print("\nMoving validation files...")
for vf in valid_files:
    src = f"{processed_path}/data/train/{vf}"
    dst = f"{processed_path}/data/valid/{vf}"
    if os.path.exists(src):
        !mv {src} {dst}
        print(f"  ✓ {vf}")
    else:
        print(f"  ⚠ Warning: {vf} not found")

# Copy validation files to test directory
print("\nCopying test files...")
for tf in test_files:
    src = f"{processed_path}/data/valid/{tf}"
    dst = f"{processed_path}/data/test/{tf}"
    if os.path.exists(src):
        !cp {src} {dst}
        print(f"  ✓ {tf}")
    else:
        print(f"  ⚠ Warning: {tf} not found")

print("\n" + "=" * 60)
print("✓ Data split complete!")
print(f"\nFinal structure:")
print(f"  train/: {len(os.listdir(processed_path + '/data/train/'))} files")
print(f"  valid/: {len(os.listdir(processed_path + '/data/valid/'))} files")
print(f"  test/:  {len(os.listdir(processed_path + '/data/test/'))} files")

## Step 7: Verify the Conversion

Perfect! The Well contains a number of utilities to help verify transformed data. In particular, there is the [format checker tool](https://github.com/PolymathicAI/the_well/blob/master/scripts/check_thewell_formatting.py).

However, here we're going to assume the formatting is OK and verify by loading the data directly with Walrus's dataset class.

In [None]:
from walrus.data.inflated_dataset import BatchInflatedWellDataset

# Load the training dataset
# This will read all HDF5 files in the train/ directory
print("Loading Well-formatted dataset...")
dataset = BatchInflatedWellDataset(
    path=f"{processed_path}",          # Base path containing data/ folder
    well_split_name="train",           # Use train split
    normalization_type=None            # No normalization for now
)

print(f"✓ Dataset loaded successfully!")
print(f"  Total samples: {len(dataset)}")
print(f"  Fields available: {dataset.metadata.field_names}")

## Step 8: Visualize a Sample

Let's load a sample and visualize the temperature field to verify everything looks correct.

In [None]:
import matplotlib.pyplot as plt

# Load sample 1000 from the dataset
samples = dataset[1000]

print("Sample structure:")
print(f"  Keys: {samples.keys()}")
print(f"  input_fields shape: {samples['input_fields'].shape}")
print(f"  Field order: {dataset.metadata.field_names}")

# Extract temperature field from first timestep
# Shape: [time, x, y, fields]
# Temperature is field index 1 (after gas-interface-sdf)
temperature = samples["input_fields"][0, ..., 1].squeeze()

# Visualize
plt.figure(figsize=(10, 8))
img = plt.imshow(temperature.T, origin='lower', cmap='hot')  # Transpose for visualization
plt.colorbar(img, label='Temperature')
plt.title('Temperature Field - BubbleML Pool Boiling\n(Sample 1000, t=0)', fontsize=14)
plt.xlabel('X coordinate')
plt.ylabel('Y coordinate')
plt.tight_layout()
plt.show()

print("\n✓ Visualization complete!")
print("  You should see the heated bottom boundary and cooler fluid above.")

## Summary and Next Steps

Congratulations! You've successfully converted the BubbleML dataset to Well format.

### What we did:
1. ✓ Downloaded BubbleML data from HuggingFace
2. ✓ Examined source data structure (HDF5 + JSON)
3. ✓ Defined transformation function to Well format
4. ✓ Handled axis transposition (y,x) → (x,y)
5. ✓ Mapped boundary conditions (no-slip → WALL, outflow → OPEN)
6. ✓ Organized fields by tensor rank (t0, t1, t2)
7. ✓ Created train/valid/test splits
8. ✓ Verified with Walrus dataset loader

### Key takeaways:
- **Well format is hierarchical**: dimensions, boundaries, scalars, fields
- **Fields organized by rank**: t0 (scalars), t1 (vectors), t2 (tensors)
- **Metadata is crucial**: Attributes describe varying properties
- **Axis order matters**: Well uses (x, y, z), Python images use (y, x)
- **BCs are topological**: WALL, OPEN, PERIODIC describe connectivity

### Next steps:

**To use this in a Walrus model**, check out the next [notebook](walrus_example_1_RunningWalrus.ipynb).

**To define a config file** for use in the full Walrus training codebase, check out the config file [here](https://github.com/PolymathicAI/walrus/blob/walrus/configs/data/bubbleml_poolboil_subcool.yaml).

**To convert your own data**, follow this pattern:
1. Examine your source data structure
2. Map your fields to Well's t0/t1/t2 groups
3. Map your boundary conditions to WALL/OPEN/PERIODIC
4. Transpose axes to match (x, y, z) order
5. Add appropriate metadata attributes
6. Verify with format checker or dataset loader