<img src="images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">

## Physics NeMo External Aerodynamics DLI

## Notebook 1 - Preprocessing Ahmed body *surface* dataset

### Introduction

For educational purposes, it's important to use lightweight datasets that are easy to store and manage, especially for users who may not have access to high-performance computing resources. One such dataset is the **Ahmed body surface data**, which includes 3D surface geometry, pressure and wall shear stress data for variations in the Ahmed body geometry and inlet Reynolds number. This dataset is a great choice because it is relatively small in size, yet provides valuable information about aerodynamic simulations. It’s ideal for teaching and experimentation, as it won’t demand excessive storage or computational power. *Note that this dataset was created by the NVIDIA PhysicsNeMo development team and differs from other similar datasets hosted on cloud platforms like AWS.* As mentioned already the complete Ahmed body surface dataset is hosted on NGC and accessible from the following link:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/physicsnemo/resources/physicsnemo_ahmed_body_dataset

**Before proceeding, navigate to the data directory and make sure that the Ahmed body dataset is downloaded properly. The folder structure should be:**
```bash
physicsnemo_ahmed_body_dataset_vv1/dataset
├── train/
├── train_info/
├── train_stl_files/
├── validation/
├── validation_info/
├── validation_stl_files/
├── test/
├── test_info/
├── test_stl_files/
```


In this notebook, we will walk through the preprocessing steps required to prepare the **Ahmed body surface dataset** for training with the **DoMINO model**, which predicts surface quantities such as pressure and wall shear stress.  

For **surface traning** the DoMINO model requires **3D surface geometry** in both **STL** and **VTP (VTK PolyData)** formats:  

- **STL (Stereolithography)**  
  - A widely used file format for representing 3D surface geometry in computer-aided design (CAD) applications.  
  - Describes the surface as a collection of triangular facets, suitable for computational geometry, 3D printing, and mesh-based simulations.  

- **VTP (VTK PolyData)**  
  - A format from the **VTK (Visualization Toolkit)** that stores surface data as **PolyData**, representing points, lines, and polygons on the surface.  
  - Commonly used in CFD and physics-informed simulations.  

Both data formats are required for DoMINO surface training:  

- **STL files** provide the 3D geometry informattion.  
- **VTP files** store additional surface quantities such as pressure and wall shear stress.  

To make the dataset easier to use in machine learning workflows and feeding into the GPU, the **data is converted into NumPy arrays (NPY format)**:  

- Allows efficient numerical operations and faster computations.  
- Facilitates convenient storage on disk, making the data readily accessible for training and further analysis.

## Table of Contents
- [Step 1: Define Experiment Parameters and Dependencies](#step-1-define-experiment-parameters-and-dependencies)
  - [Loading Required Libraries](#loading-required-libraries)
  - [Dependencies](#dependencies)
  - [Experiment Parameters and Variables](#experiment-parameters-and-variables)
- [Step 2: Conversion Pipeline: VTP + STL + Global Parameters → NumPy](#step-2-conversion-pipeline-vtp--stl--global-parameters-→-numpy)

### **Step 1: Define Experiment Parameters and Dependencies**

The first step in training the DoMINO model on the Ahmed body surface dataset is to set up our experiment environment and define the necessary parameters. This includes specifying paths to our data, configuring training settings, and ensuring all required libraries are available.

Key components we need to set up:
- Data paths for training and validation sets
- Model hyperparameters and training configurations
- Visualization settings for results
- Required Python libraries for mesh processing and deep learning

### Loading Required Libraries

Before we proceed with the experiment setup, let's first import all the necessary libraries. These libraries will be used for:
- Mesh processing and visualization (`vtk`, `pyvista`)
- Data handling and file operations (`pathlib`, `concurrent.futures`)
- Progress tracking and visualization (`tqdm`, `matplotlib`)
- PyTorch provides data primitives: `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. `Dataset` stores the samples and their corresponding labels.
- Important utilities for data processing and training, testing DoMINO (`modulus.utils.domino.utils`)

### Dependencies
Ensure that the required Python libraries are installed:

```bash
pip install numpy pyvista vtk matplotlib tqdm numpy-stl
apt update
apt install -y xvfb
```

In [1]:
import os
import time
import random
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
from typing import Union

import numpy as np
import pyvista as pv
import vtk
from stl import mesh
from tqdm import tqdm

from physicsnemo.utils.domino.utils import *
from torch.utils.data import Dataset

  pattern = re.compile(f"{suffix}[\w-]+(/[\w-]+)?/[\w-]+@[A-Za-z0-9.]+/[\w/](.*)")


### Experiment Parameters and Variables

In this section, we set up all the essential parameters and variables required for the Ahmed body experiment. 

In [2]:
# Directory and Path Configuration
DATA_DIR = Path("/workspace/physicsnemo_ahmed_body_dataset_vv1/dataset")  # Root directory for dataset

# Physical Variables
VOLUME_VARS = ["p"]  # Volume variables to predict (pressure)
SURFACE_VARS = ["p", "wallShearStress"]  # Surface variables to predict
GLOBAL_PARAMS_TYPES = {"inlet_velocity": "vector", "air_density": "scalar"}
GLOBAL_PARAMS_REFERENCE = {"inlet_velocity": [50.0], "air_density": 1.226}

* The function `setup_environment` prepares the environment for processing the Ahmed Body dataset.

* It generates structured paths for each split: **train**, **validation**, **test**.

* Each split has:

    * Raw **VTP files** for surface simulation data.
    
    * **Info files** for global parameters like inlet velocity.
    
    * **STL files** for 3D geometry.
    
    * Output folder for **processed NumPy surface data**.

* Printing the paths ensures the user can quickly verify that all directories are correctly set up.

You can now call it like this:


In [3]:
def setup_environment(data_dir: str):
    """
    Sets up the working environment by defining the folder paths for training, validation, and test splits.
    
    This function helps organize the dataset for preprocessing and downstream training.
    
    Returns:
        dataset_paths: Dict with paths to VTP/STL files for each split.
        info_paths: Dict with paths to global parameter info files.
        stl_paths: Dict with paths to STL geometry files.
        surface_paths: Dict with paths to save processed NumPy surface data.
    """
    print("=== Environment Setup ===")
    print(f"Current data directory: {data_dir}")

    # Paths to raw VTP/mesh data
    dataset_paths = {split: os.path.join(data_dir, split) for split in ["train", "validation", "test"]}
    
    # Paths to global parameter info files (text files)
    info_paths = {k: os.path.join(data_dir, f"{k}_info") for k in dataset_paths}
    
    # Paths to STL files
    stl_paths = {k: os.path.join(data_dir, f"{k}_stl_files") for k in dataset_paths}
    
    # Paths to save processed surface data as NumPy arrays
    surface_paths = {k: os.path.join(data_dir, f"{k}_prepared_surface_data") for k in dataset_paths}

    # Print all paths for confirmation
    print("\nConfigured directory paths:")
    for split in dataset_paths:
        print(f"  {split.capitalize()} data: {dataset_paths[split]}")
        print(f"  {split.capitalize()} info: {info_paths[split]}")
        print(f"  {split.capitalize()} STL: {stl_paths[split]}")
        print(f"  {split.capitalize()} prepared surface data: {surface_paths[split]}\n")

    return dataset_paths, info_paths, stl_paths, surface_paths

lets load the environments:

In [4]:
dataset_paths, info_paths, stl_paths, surface_paths = setup_environment(DATA_DIR)

=== Environment Setup ===
Current data directory: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset

Configured directory paths:
  Train data: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/train
  Train info: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/train_info
  Train STL: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/train_stl_files
  Train prepared surface data: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/train_prepared_surface_data

  Validation data: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/validation
  Validation info: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/validation_info
  Validation STL: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/validation_stl_files
  Validation prepared surface data: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/validation_prepared_surface_data

  Test data: /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/test
  Test info: /workspace/physicsnemo_ahmed_body_dataset_

### Step 2: Conversion Pipeline: VTP + STL + Global Parameters → NumPy

This code block defines a pipeline to convert 3D surface CFD data of the Ahmed body into a **NumPy-based format** suitable for training with the **DoMINO model**. The pipeline combines:

- **VTP (VTK PolyData)**: surface simulation results (pressure, wall shear stress)
- **STL**: 3D surface geometry
- **Global parameters**: inlet velocity (from the info files)

### Pipeline Steps

- **Dataset Initialization (`OpenFoamAhmedBodySurfaceDataset`)**
  - Takes paths for VTP data, STL files, and info files.
  - Reads all filenames in the dataset folder and shuffles them.
  - Specifies which surface and volume variables to extract.
  
- **Load Data (`__getitem__`)**
  - For a given index:
    - Reads the CFD VTP file using `vtkXMLPolyDataReader`.
    - Reads the corresponding STL file using `pyvista` to get geometry:
      - Node coordinates (`points`)
      - Cell centers
      - Face connectivity
      - Cell areas
    - Reads the inlet velocity from the info file. The inlet velocity is read from the info text file corresponding to each case and stored in the NumPy dictionary as "stream_velocity" for normalization of surface fields.
      ```python
         with open(info_path, "r") as file:
         velocity = next(float(line.split(":")[1].strip()) for line in file if "Velocity" in line)
         # later added to returned dictionary
         "stream_velocity": velocity,
      ```
    - Extracts and normalizes surface fields (e.g., pressure, wall shear stress).
    - Computes surface normals and area for each mesh cell.
    - Returns a dictionary containing:
      ```python
      {
        "stl_coordinates": ...,
        "stl_centers": ...,
        "stl_faces": ...,
        "stl_areas": ...,
        "surface_mesh_centers": ...,
        "surface_normals": ...,
        "surface_areas": ...,
        "volume_fields": None,
        "volume_mesh_centers": None,
        "surface_fields": ...,
        "filename": ...,
        "stream_velocity": ...,
        "air_density": ...
      }
      ```

- **File Processing (`process_file`)**
  - Saves each sample as a `.npy` file for efficient storage and fast loading.
  - Skips files if they already exist or are empty.

- **Batch Processing (`process_surface_data_batch`)**
  - Iterates over all datasets and corresponding STL/info paths.
  - Creates output directories for the converted `.npy` files.
  - Uses a **ProcessPoolExecutor** for parallel processing of all files.
  - Progress is tracked with `tqdm`.

### Summary
- **VTP Files**: Contain CFD surface simulation results (pressure, wall shear stress, etc.).
- **STL Files**: Provide the 3D surface geometry of the mesh.
- **Info Files**: Contain global parameters such as inlet velocity.
- **OpenFoamAhmedBodySurfaceDataset**: Combines VTP, STL, and info data into a structured format.
- **Extracted Data Dictionary**: Includes mesh points, normals, areas, surface fields, and global parameters.
- **.npy Files**: Store the processed data efficiently for fast loading in machine learning workflows.
- **DoMINO Training**: The final stage where the processed NumPy dataset is used for training DoMINO model. 

This diagram will help readers visualize the end-to-end data conversion pipeline.  

```mermaid
flowchart TD
    %% Input Section
    subgraph INPUT_DATA ["Input Data"]
        A[VTP Files - PolyData]
        B[STL Files - 3D Geometry]
        D[Info Files - e.g. Inlet Velocity]
    end

    %% Processing Node
    C[OpenFOAM Ahmed Body Surface Dataset]

    %% Output Section
    E[Extracted Data Dictionary]
    F[Saved as .npy Files]
    G[Ready for DoMINO Training]

    %% Data Flow
    A --> C
    B --> C
    D --> C
    C --> E
    E --> F
    F --> G

In [5]:
class OpenFoamAhmedBodySurfaceDataset(Dataset):
    """
    Datapipe for converting OpenFOAM Ahmed Body surface dataset into NumPy arrays.

    This class reads the VTP surface simulation files, STL geometry files, and info
    files containing global parameters (like inlet velocity) to prepare data for
    machine learning workflows.
    """

    def __init__(self, data_path: Union[str, Path], info_path: Union[str, Path], stl_path: Union[str, Path], surface_variables=None, volume_variables=None, global_params_types=None, global_params_reference=None, device: int = 0):
        """
        Initializes the dataset object.

        Args:
            data_path: Path to VTP files (surface CFD results).
            info_path: Path to global parameter files (text files).
            stl_path: Path to STL geometry files.
            surface_variables: List of surface fields to extract (default: ["p", "wallShearStress"]).
            volume_variables: List of volume fields (default: ["UMean", "pMean"]).
            device: Device ID for loading to GPU (optional).
        """
        self.data_path = Path(data_path).expanduser()
        self.stl_path = Path(stl_path).expanduser()
        self.info_path = Path(info_path).expanduser()
        assert self.data_path.exists(), f"Path {self.data_path} does not exist"

        # List all VTP files and shuffle for random sampling
        self.filenames = get_filenames(self.data_path)
        random.shuffle(self.filenames)
        self.surface_variables = surface_variables or ["p", "wallShearStress"]
        self.volume_variables = volume_variables or ["UMean", "pMean"]
        self.global_params_types = global_params_types
        self.global_params_reference = global_params_reference
        self.device = device

    def __len__(self):
        """Returns the number of files in the dataset."""
        return len(self.filenames)

    def __getitem__(self, idx):
        """
        Reads one file and converts it to a dictionary of NumPy arrays.

        Steps:
        1. Read global parameter info (inlet velocity) from info file.
        2. Read STL file to get mesh points, faces, and surface areas.
        3. Read VTP file to get surface CFD fields (pressure, shear stress).
        4. Normalize surface fields using velocity and air density.
        5. Compute surface normals and areas.
        6. Return a dictionary containing all relevant NumPy arrays.
        """
        cfd_filename = self.filenames[idx]
        car_dir = self.data_path / cfd_filename

        stl_path = self.stl_path / f"{car_dir.stem}.stl"
        info_path = self.info_path / f"{car_dir.stem}_info.txt"

        # Read inlet velocity from info file
        with open(info_path, "r") as file:
            velocity = next(float(line.split(":")[1].strip()) for line in file if "Velocity" in line)
            
        air_density = self.global_params_reference["air_density"]
        # Read STL mesh
        mesh_stl = pv.get_reader(stl_path).read()
        stl_faces = mesh_stl.faces.reshape(-1, 4)[:, 1:]
        stl_sizes = np.array(mesh_stl.compute_cell_sizes(length=False, area=True, volume=False).cell_data["Area"])

        # Read VTP surface data
        reader = vtk.vtkXMLPolyDataReader()
        reader.SetFileName(str(car_dir))
        reader.Update()
        polydata = reader.GetOutput()

        celldata = get_node_to_elem(polydata).GetCellData()
        surface_fields = np.concatenate(get_fields(celldata, self.surface_variables), axis=-1) / (air_density * velocity**2)

        mesh = pv.PolyData(polydata)
        surface_sizes = np.array(mesh.compute_cell_sizes(length=False, area=True, volume=False).cell_data["Area"])
        surface_normals = mesh.cell_normals / np.linalg.norm(mesh.cell_normals, axis=1)[:, np.newaxis]

        # Arrange global parameters reference in a list based on the type of the parameter
        global_params_reference_list = []
        for name, type in self.global_params_types.items():
            if type == "vector":
                global_params_reference_list.extend(self.global_params_reference[name])
            elif type == "scalar":
                global_params_reference_list.append(self.global_params_reference[name])
            else:
                raise ValueError(
                    f"Global parameter {name} not supported for  this dataset"
                )
        global_params_reference = np.array(
            global_params_reference_list, dtype=np.float32
        )

        # Prepare the list of global parameter values for each simulation file
        global_params_values_list = []
        for key in self.global_params_types.keys():
            if key == "inlet_velocity":
                 global_params_values_list.append(velocity)
            elif key == "air_density":
                 global_params_values_list.append(air_density)
            else:
                raise ValueError(
                    f"Global parameter {key} not supported for  this dataset"
                )
        global_params_values = np.array(global_params_values_list, dtype=np.float32)
        

        return {
            "stl_coordinates": mesh_stl.points.astype(np.float32),
            "stl_centers": mesh_stl.cell_centers().points.astype(np.float32),
            "stl_faces": stl_faces.flatten().astype(np.float32),
            "stl_areas": stl_sizes.astype(np.float32),
            "surface_mesh_centers": mesh.cell_centers().points.astype(np.float32),
            "surface_normals": surface_normals.astype(np.float32),
            "surface_areas": surface_sizes.astype(np.float32),
            "volume_fields": None,
            "volume_mesh_centers": None,
            "surface_fields": surface_fields.astype(np.float32),
            "filename": cfd_filename,
            "global_params_values": global_params_values,
            "global_params_reference": global_params_reference,
        }


def process_file(fname: str, fm_data, output_path: str):
    """
    Converts a single VTP/STL file into a .npy file.

    Skips the file if the output already exists or if the input file is missing/empty.
    """
    full_path, output_file = os.path.join(fm_data.data_path, fname), os.path.join(output_path, f"{fname}.npy")
    if os.path.exists(output_file) or not os.path.exists(full_path) or os.path.getsize(full_path) == 0:
        return
    np.save(output_file, fm_data[fm_data.filenames.index(fname)])


def process_surface_data_batch(dataset_paths: dict, info_paths: dict, stl_paths: dict, surface_paths: dict):
    """
    Converts all surface data in the dataset into NumPy format and saves them.

    Steps:
    - Ensures output directories exist.
    - Iterates through train/validation/test splits.
    - Loads the dataset using OpenFoamAhmedBodySurfaceDataset.
    - Processes files in parallel using ProcessPoolExecutor.
    - Converts VTP+STL+global velocity into a NumPy dictionary for each case.
    - Saves the .npy files in the corresponding prepared surface data folder.
    """
    for path in surface_paths.values(): os.makedirs(path, exist_ok=True)

    print("=== Starting Processing ===")
    for key, dataset_path in dataset_paths.items():
        surface_path = surface_paths[key]
        os.makedirs(surface_path, exist_ok=True)
        fm_data = OpenFoamAhmedBodySurfaceDataset(dataset_path, info_paths[key], stl_paths[key], VOLUME_VARS, SURFACE_VARS,GLOBAL_PARAMS_TYPES,GLOBAL_PARAMS_REFERENCE)
        file_list = [fname for fname in fm_data.filenames if fname.endswith(".vtp")]

        print(f"\nProcessing {len(file_list)} files from {dataset_path} → {surface_path}...")

        with ProcessPoolExecutor() as executor:
            list(tqdm(
                executor.map(process_file, file_list, [fm_data]*len(file_list), [surface_path]*len(file_list)),
                total=len(file_list),
                desc=f"Processing {key}",
                dynamic_ncols=True
            ))

    print("=== All Processing Completed Successfully ===")

Lets convert the files:

In [6]:

process_surface_data_batch(dataset_paths, info_paths, stl_paths, surface_paths)

=== Starting Processing ===

Processing 408 files from /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/train → /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/train_prepared_surface_data...


Processing train: 100%|██████████| 408/408 [00:01<00:00, 271.03it/s]



Processing 50 files from /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/validation → /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/validation_prepared_surface_data...


Processing validation: 100%|██████████| 50/50 [00:00<00:00, 81.42it/s]


Processing 50 files from /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/test → /workspace/physicsnemo_ahmed_body_dataset_vv1/dataset/test_prepared_surface_data...



Processing test: 100%|██████████| 50/50 [00:00<00:00, 95.25it/s]


=== All Processing Completed Successfully ===


<img src="images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">