# Dataset

A `Dataset` is a collection of `Projects` that contain molecular dynamics simulations or related data, with some shared metadata and characteristics due to how they were generated. For each `Project`, in the context of the **MDDB Workflow**, we are refering to a set of simulations/replicas, with one or more trajectory files and a common topology file. To complete the definitions, individual simulations or replicas are referred to as `MD`.

The main functionality of this class is keeping track of the state of many Projects: if they are still running, if they are done or if they fail and what caused the error. For this the only adjustment we have to do is adding the path where our main SQLite storage file will be kept. We can do this by using the `dataset_path` flag during the workflow execution:

`mwf run ... --dataset_path path/to/our_dataset.db`

Or, if we do no want to write the flag everytime, by using the field `dataset_path` in the input.yaml config file:
```yaml
- dataset_path: path/to/our_dataset.db
```


## Creating a new Dataset

However, having to modify the inputs file for every project of the dataset may be very cumbersome, as Datasets can be form by hundreds or thousand projects. For this we can make use of another feature of this class: automatic inputs file generation.

### Directory Structure

For this, we part from a root folder, that every person may be organize on its own ways, but they normally follow a hierarchical structure with all its project that may look something like this:

``` bash
new_dataset/
├── project_1/
├── project_2/
├── project_3/
├── project_4/ 
├── ...
├──── special_cases/
├────── case_1/
├────── case_2/
├────── ...
├──── wrong_cases/
├────── case_1/
├────── case_2/
├────── ...
├── scripts/
├── project_logs/
└── ...
```
Note of we do not specify nothing about `MDs` as we will take care of that later.

In [1]:
import os

# Create directory structure
dataset_dir = "new_dataset"
dirs = [
    dataset_dir+"/project_1",
    dataset_dir+"/project_2",
    dataset_dir+"/project_3",
    dataset_dir+"/project_4",
    dataset_dir+"/special_cases/case_1",
    dataset_dir+"/special_cases/not_this_one",
    dataset_dir+"/to_remove/case_1",
    dataset_dir+"/to_remove/case_2",
    dataset_dir+"/scripts",
    dataset_dir+"/project_logs",
]

for dir_path in dirs:
    os.makedirs(dir_path, exist_ok=True)

In [2]:
%load_ext autoreload
%autoreload 2
from mddb_workflow.core.dataset import Dataset

# Create test directory structure
dataset_dir = "new_dataset"
# Initialize the Dataset
db_path = dataset_dir+"/new_dataset.db"
# Remove database in case the notebook is re-run
if os.path.exists(db_path):
    os.remove(db_path)

# Create dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)

### Adding entries

Adding entries to the dataset is the first step to select what are the projects where are going to keep track of.

For this, we specify the root folders and the ones to ignore (not containing projects, e.g., scripts, logs, etc). We can do this passing absolute, relative or glob patterns. For example:

In [3]:
# CLI: mwf dataset add new_dataset.db -p project_* special_cases/case_1 to_remove/* --ignore-dirs */logs
ds.add_entries([dataset_dir+'/project_*',
                dataset_dir+'/special_cases/case_1',
                dataset_dir+'/to_remove/*'],
                ignore_dirs=[dataset_dir+'/*logs'],
                verbose=True)

Ignoring project: project_logs
Adding project: project_2 (UUID: 96fe74ab-0d1e-4b2e-96b8-261c9d95ef05)
Adding project: project_1 (UUID: da544317-d17c-4e90-9713-7f54a8cf4fa1)
Adding project: project_4 (UUID: e502a318-7f54-4c90-b554-09e8c2b890f7)
Adding project: project_3 (UUID: f42b249e-9377-4250-bb22-1b0b40a80702)
Adding project: special_cases/case_1 (UUID: 1b6a4dcb-e585-4cc6-866f-c085077b2ddd)
Adding project: to_remove/case_1 (UUID: 965bc0fb-1072-43ac-a2af-7fdd3d14c027)
Adding project: to_remove/case_2 (UUID: e425e3b4-d897-4729-bbf3-7cd45b44b939)


Some useful glob patterns:

- `*`: matches all the folders.
- `**/*`: matches all subfolders.
- `**/[0-9]*`: matches subfolders starting with a digit.

### Removing entries

In cases where later we find a project should be deleted, or if the glob pattern added folders you did not want, we can remove those matching simlar to how we added them:

In [4]:
# CLI: mwf dataset remove new_dataset.db to_remove/*
ds.remove_entry('to_remove/*')

Deleted project with UUID '965bc0fb-1072-43ac-a2af-7fdd3d14c027'
Deleted project with UUID 'e425e3b4-d897-4729-bbf3-7cd45b44b939'


### Showing the dataset

Once we have initialized the entries, we can show the dataset to check if it is correct.

#### Dataset tables

In [5]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
da544317,,projects,project_1,0,new,No information recorded yet.,13:50:27 12-02-2026
96fe74ab,,projects,project_2,0,new,No information recorded yet.,13:50:27 12-02-2026
f42b249e,,projects,project_3,0,new,No information recorded yet.,13:50:27 12-02-2026
e502a318,,projects,project_4,0,new,No information recorded yet.,13:50:27 12-02-2026
1b6a4dcb,,projects,special_cases/case_1,0,new,No information recorded yet.,13:50:27 12-02-2026


In [6]:
# CLI:
!mwf dataset show {dataset_dir}/new_dataset.db

[3m                             MDDB Dataset (5 rows)                              [0m
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃[1m [0m[1muuid   [0m[1m [0m┃[1m [0m[1mprojec…[0m[1m [0m┃[1m [0m[1mscope  [0m[1m [0m┃[1m [0m[1mrel_pa…[0m[1m [0m┃[1m [0m[1mnum_mds[0m[1m [0m┃[1m [0m[1mstate[0m[1m [0m┃[1m [0m[1mmessage [0m[1m [0m┃[1m [0m[1mlast_m…[0m[1m [0m┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ da5443… │         │ projec… │ ../pro… │ 0       │ new   │ No       │ 13:50:… │
│         │         │         │         │         │       │ informa… │ 12-02-… │
│         │         │         │         │         │       │ recorded │         │
│         │         │         │         │         │       │ yet.     │         │
│ 96fe74… │         │ projec… │ ../pro… │ 0       │ new   │ No       │ 13:50:… │
│         │         │         │         │         │       │ informa… │ 

In [7]:
# For more specific subset of the dataset, use the different flags:
!mwf dataset show -h

usage: mwf dataset show [-h] [-p [QUERY_PATH ...]] [-st [QUERY_STATE ...]]
                        [-sc QUERY_SCOPE] [-ms QUERY_MESSAGE] [-s SORT_BY]
                        [-n N_ROWS] [-l] [-m]
                        [dataset_path]

positional arguments:
  dataset_path
      Path to the dataset storage file, normally an .db file. If not provided,
      the first *.db file found in the current directory will be used.

options:
  -h, --help
      show this help message and exit
  -p, --query_path [QUERY_PATH ...]
      If provided, filters rows whose 'rel_path' matches these glob patterns.
      Default: ['*']
  -st, --query_state [QUERY_STATE ...]
      If provided, filters rows whose 'state' matches this value/list of
      values.
  -sc, --query_scope QUERY_SCOPE
      If provided, filters rows whose 'scope' matches this value
      ('project'/'p' or 'md'/'m').
  -ms, --query_message QUERY_MESSAGE
      If provided, filters rows whose 'message' matches these glob patterns
      (e.

#### Dataset summary

In [8]:
ds.summary()

Unnamed: 0,state,count
0,new,5


In [9]:
# CLI:
!mwf dataset show {dataset_dir}/new_dataset.db -m

Summary of project states:
  state  count
0   new      5


#### Specific rows

In [10]:
ds.get_status(dataset_dir+'/special_cases/case_1')

{'uuid': '1b6a4dcb-e585-4cc6-866f-c085077b2ddd',
 'rel_path': 'special_cases/case_1',
 'num_mds': 0,
 'state': 'new',
 'message': 'No information recorded yet.',
 'last_modified': '13:50:27 12-02-2026',
 'scope': 'Project'}

In [11]:
# CLI
!mwf dataset status {dataset_dir}/new_dataset.db -p {dataset_dir}'/special_cases/case_1'

UUID:          1b6a4dcb-e585-4cc6-866f-c085077b2ddd
Path:          special_cases/case_1
State:         new
Scope:         Project
MDs:           0
Last Modified: 13:50:27 12-02-2026
Message:       No information recorded yet.


## Running the workflow

### Generating inputs files programmatically

The first step to run the workflow is generating the inputs files for each project. This can be done in a programmatic way using the `generate_inputs_files` method of the `Dataset` class. This method will generate an `inputs.yaml` file for each project in the dataset, with the same content as the one we would have to write if we were to do it manually, but with the advantage that we can use variables that will be replaced by the actual values when the workflow is executed.

#### Jinja2 templates

This is done by using Jinja2 templates syntax. For example, we can use the `{{DATASET}}` variable to refer to the dataset path and the `{{DIR}}` variable to refer to the project directory name. This way, we can write a single inputs file template that will be used for all projects in the dataset, and we do not have to worry about writing different inputs files for each project.

In [12]:
inputs_template_str = """
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: {{DATASET}}
description: 10 ns simulation of {{DIR}} pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project {{DIR}}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

In [13]:
# CLI: mwf dataset inputs new_dataset.db -it inputs_template.yaml -o
ds.generate_inputs_yaml(inputs_template, overwrite=True)

Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4/inputs.yaml for project project_4
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1


In [14]:
# Notice how the {{DATASET}} and {{DIR}} variables have been replaced by the dataset path and the project directory name, respectively.
!cat {dataset_dir}/project_1/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: ../new_dataset.db
description: 10 ns simulation of project_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_1

#### Adding custom fields

To generate more complex inputs files, we make use of more advanced features of [Jinja2 templates](https://jinja.palletsprojects.com/en/stable/templates/), such as custom filters and functions, which is basically Python code that we can use in the templates to generate the inputs files.

The template will recieve a dictionary generated by a custom function that we can write, using project directory as argument:

Project directory -> Custom function -> Dictionary -> Template -> Rendered inputs.yaml

In [15]:
inputs_template_str = """
name: Project {{DIR}}
{%- if is_special_case %}
description: Special case description for {{DIR}}
{%- else %}
description: 10 ns simulation of {{DIR}} pdb structure
{% endif %}
"""

# Save the template to a file
inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)


# Define the custom function
def inputs_generator(project_dir: str):
    """Generate a dictionary with the information that we want to use in the template.
    This function will be called for each project directory, and the returned dictionary will be passed to the template as variables.
    """
    if "special_cases" in project_dir:
        return {'is_special_case': True}

In [16]:
# CLI: mwf dataset inputs new_dataset.db -it inputs_template.yaml -ig inputs_generator.py -o
ds.generate_inputs_yaml(inputs_template, overwrite=True,
                        inputs_generator=inputs_generator)

Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4/inputs.yaml for project project_4
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1


In [17]:
# Notice how the project in the special_cases directory has a different description than the rest of the projects.
!cat {dataset_dir}/project_1/inputs.yaml
!cat {dataset_dir}/special_cases/case_1/inputs.yaml


name: Project project_1
description: 10 ns simulation of project_1 pdb structure

name: Project case_1
description: Special case description for case_1

Similarly, we can use the CLI to generate the inputs files with the custom function. With the only difference that we pass a file instead of a function. 

**IMPORTANT**: in this file there must be a function called `inputs_generator`.

In [18]:

python_file_str = """
def inputs_generator(project_dir: str):
    if "special_cases" in project_dir:
        return {'is_special_case': True}
"""

inputs_generator_py = dataset_dir+'/inputs_generator.py'
with open(inputs_generator_py, 'w') as f:
    f.write(python_file_str)

!mwf dataset inputs {dataset_dir}/new_dataset.db -it {inputs_template} -ig {inputs_generator_py} -o

Loading inputs generator from file: new_dataset/inputs_generator.py
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1/inputs.yaml for project project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2/inputs.yaml for project project_2
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3/inputs.yaml for project project_3
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4/inputs.yaml for project project_4
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1/inputs.yaml for project special_cases/case_1


In [19]:
!cat {dataset_dir}/project_1/inputs.yaml
!cat {dataset_dir}/special_cases/case_1/inputs.yaml


name: Project project_1
description: 10 ns simulation of project_1 pdb structure

name: Project case_1
description: Special case description for case_1

#### A more real example: handling multiple and variable number of MDs

In this case, we want to generate an inputs file that contains a list of all the MDs that we have in each project, but the number of MDs is not the same for all projects. For this, we can write a custom function that will look for all the MDs in each project, ignore any irrelevant files (equilibration trajectories, for example), and return a dictionary with the list of MDs, that we can then use in the template to generate the inputs file.

In [20]:
# Create directory structure
dirs = [
    dataset_dir+"/many_mds",
    dataset_dir+"/many_mds/project_1",
    dataset_dir+"/many_mds/project_2",
]
files = [
    # A project with 3 equilibration and 3 MD replicas
    dataset_dir+"/many_mds/project_1/equil_1.traj",
    dataset_dir+"/many_mds/project_1/equil_2.traj",
    dataset_dir+"/many_mds/project_1/equil_3.traj",
    dataset_dir+"/many_mds/project_1/prod_1.traj",
    dataset_dir+"/many_mds/project_1/prod_2.traj",
    dataset_dir+"/many_mds/project_1/prod_3.traj",
    # A project with 2 equilibration and 2 MD replicas
    dataset_dir+"/many_mds/project_2/equil_1.traj",
    dataset_dir+"/many_mds/project_2/equil_2.traj",
    dataset_dir+"/many_mds/project_2/prod_1.traj",
    dataset_dir+"/many_mds/project_2/prod_2.traj",
]
for dir_path in dirs:
    os.makedirs(dir_path, exist_ok=True)

for file_path in files:
    with open(file_path, 'w') as f:
        f.write("DUMMY TRAJ FILE\n")

In [21]:
ds.add_entries([dataset_dir+'/many_mds/*'], verbose=True)

Adding project: many_mds/project_2 (UUID: 167be7a4-62fb-4fd3-9dd6-0218c0d639b6)
Adding project: many_mds/project_1 (UUID: 4527fd4e-378a-40e2-a7d6-e6e53151f7b9)


In [22]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4527fd4e,,projects,many_mds/project_1,0,new,No information recorded yet.,13:50:39 12-02-2026
167be7a4,,projects,many_mds/project_2,0,new,No information recorded yet.,13:50:39 12-02-2026
da544317,,projects,project_1,0,new,No information recorded yet.,13:50:27 12-02-2026
96fe74ab,,projects,project_2,0,new,No information recorded yet.,13:50:27 12-02-2026
f42b249e,,projects,project_3,0,new,No information recorded yet.,13:50:27 12-02-2026
e502a318,,projects,project_4,0,new,No information recorded yet.,13:50:27 12-02-2026
1b6a4dcb,,projects,special_cases/case_1,0,new,No information recorded yet.,13:50:27 12-02-2026


In [23]:
inputs_template_str = """
name: Project {{DIR}}
mds:
{% for md in mds %}
  -
    mdir: {{ md.mdir }}
    input_trajectory_filepaths: {{ md.traj }}
{% endfor %}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

In [24]:
from pathlib import Path


def mds_generator(project_dir: str):
    """Generate a list of MD replicas based on the traj files in the project directory."""
    mds = []
    project_path = Path(project_dir)
    prod_trajs = sorted(project_path.glob('prod_*.traj'))
    num_replicas = len(prod_trajs)
    for i in range(num_replicas):
        mds.append({
            'mdir': f'md_replica_{i+1}',
            'traj': prod_trajs[i].relative_to(project_path).as_posix(),
        })
    return {'mds': mds}

In [25]:
# Check that the function works as expected
mds_generator(dataset_dir+'/many_mds/project_1')

{'mds': [{'mdir': 'md_replica_1', 'traj': 'prod_1.traj'},
  {'mdir': 'md_replica_2', 'traj': 'prod_2.traj'},
  {'mdir': 'md_replica_3', 'traj': 'prod_3.traj'}]}

In [28]:
ds.generate_inputs_yaml(inputs_template,
                        inputs_generator=mds_generator,
                        overwrite=True,
                        query_path='*many_mds*'
                        )

Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/many_mds/project_1/inputs.yaml for project many_mds/project_1
Generating /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/many_mds/project_2/inputs.yaml for project many_mds/project_2


In [29]:
# Generated inputs.yaml for project with 3 replicas
!cat new_dataset/many_mds/project_1/inputs.yaml


name: Project project_1
mds:

  -
    mdir: md_replica_1
    input_trajectory_filepaths: prod_1.traj

  -
    mdir: md_replica_2
    input_trajectory_filepaths: prod_2.traj

  -
    mdir: md_replica_3
    input_trajectory_filepaths: prod_3.traj


In [30]:
# Generated inputs.yaml for project with 2 replicas
!cat new_dataset/many_mds/project_2/inputs.yaml


name: Project project_2
mds:

  -
    mdir: md_replica_1
    input_trajectory_filepaths: prod_1.traj

  -
    mdir: md_replica_2
    input_trajectory_filepaths: prod_2.traj


### Launching the workflow

#### Python

Once the inputs files are generated, we can launch the workflow for all projects in the dataset. The `launch_workflow` method provides several options for running the workflow:

- **Sequential execution**: Run projects one after another (default)
- **Parallel execution**: Run multiple projects simultaneously using a process pool
- **SLURM execution**: Submit jobs to a SLURM cluster

The method also supports filtering which projects to run using the same query parameters we've seen before (`query_path`, `query_state`, `query_message`).

#### SLURM

The simplest way to run the workflow is to call `launch_workflow` without any arguments. This will run the workflow sequentially for all projects in the dataset:

##### Filtering projects to run

You can filter which projects to run using the same query parameters we used before:

In [None]:
# Run only for projects in the special_cases directory
ds.launch_workflow(query_path=['*/special_cases/*'])

# Run only for projects that are in 'new' state
ds.launch_workflow(query_state=['new'])

# Run for projects matching a specific pattern and state
ds.launch_workflow(
    query_path=['project_*'],
    query_state=['new', 'error']
)

##### Parallel execution

To run multiple projects simultaneously, use the `n_jobs` parameter to specify the number of parallel workers. Use `n_jobs=-1` to use all available CPU cores:

In [None]:
# Run with 4 parallel workers
ds.launch_workflow(n_jobs=4)

# Use all available CPU cores
ds.launch_workflow(n_jobs=-1)

##### Custom workflow command

By default, the workflow runs `mwf run` for each project. You can customize this command using the `mwf_run_cmd` parameter:

In [None]:
# Run with custom flags, e.g., only include specific tasks
ds.launch_workflow(mwf_run_cmd='mwf run --include meta network')

# Run with debug mode enabled
ds.launch_workflow(debug=True)

##### Using the CLI

All of the above functionality is also available through the command line interface:

In [None]:
# Run sequentially for all projects
!mwf dataset run {dataset_dir}/new_dataset.db

# Run with filtering
!mwf dataset run {dataset_dir}/new_dataset.db -p 'project_*' -st new error

# Run with parallel workers
!mwf dataset run {dataset_dir}/new_dataset.db -n 4

# Run with custom command
!mwf dataset run {dataset_dir}/new_dataset.db -c 'mwf run --include meta network'

# See all available options
!mwf dataset run -h

For computing clusters using SLURM, you can submit each project as a separate job. This requires a job template file that defines the SLURM configuration.

##### Creating a SLURM job template

The job template is a Jinja2 template that will be rendered for each project. It should contain the SLURM directives and the command to run. The template has access to the following variables:

- `{{DIR}}`: Absolute path to the project directory
- Every field available in the inputs.yaml.

Here's an example job template:

In [None]:
job_template_str = """#!/bin/bash
#SBATCH --job-name=mddb_workflow
#SBATCH --output=mwf_%j.out
#SBATCH --error=mwf_%j.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G

# Load required modules
module load anaconda3

# Activate virtual environment if needed
conda activate mwf_env

# Change to project directory
cd {{DIR}}

# Run the workflow command
mwf run -filt -fit -e energies clusters pockets -m largeaa
"""

# Save the template to a file
job_template_path = dataset_dir + '/slurm_job_template.sh'
with open(job_template_path, 'w') as f:
    f.write(job_template_str)

##### Submitting jobs to SLURM

Once you have a job template, you can submit jobs using the `slurm=True` parameter and providing the path to the template:

In [None]:
# Submit all projects as SLURM jobs
ds.launch_workflow(
    slurm=True,
    job_template=job_template_path
)

# Submit filtered projects as SLURM jobs
ds.launch_workflow(
    query_path=['project_*'],
    query_state=['new'],
    slurm=True,
    job_template=job_template_path
)

# Use custom workflow command with SLURM
ds.launch_workflow(
    slurm=True,
    job_template=job_template_path,
    mwf_run_cmd='mwf run --include meta network minimal'
)

##### Using SLURM with the CLI

In [None]:
# Submit all projects as SLURM jobs
!mwf dataset run {dataset_dir}/new_dataset.db --slurm --job-template {job_template_path}

# Submit with filtering
!mwf dataset run {dataset_dir}/new_dataset.db -p 'project_*' -st new --slurm -jt {job_template_path}

# Submit with custom workflow command
!mwf dataset run {dataset_dir}/new_dataset.db --slurm -jt {job_template_path} -c 'mwf run --include meta network'

##### Monitoring job status

When running workflows (either locally or via SLURM), the dataset automatically tracks the state of each project. You can monitor progress using:

In [40]:
ds.summary()

Unnamed: 0,state,count
0,new,7


In [None]:
# Check the summary of project states
ds.summary()

# View the full dataset with log files
ds.get_dataframe(include_logs=True)

# Filter to see only running or error states
ds.get_dataframe(query_state=['running', 'error'])

# CLI: Watch the dataset in real-time (updates every few seconds)
# mwf dataset watch new_dataset.db

# Already run projects

In [35]:
import os
import json
from uuid import uuid4
from contextlib import chdir

# Create test directory structure
test_dir = "old_dataset"
if test_dir not in os.getcwd():
    os.makedirs(test_dir, exist_ok=True)

# Define the directory structure
projects = {
    'project1': ['replica1', 'replica2'],
    'project2': ['replica1', 'replica2', 'replica3'],
    'project3': [],  # Project with no MDs
}
with chdir(test_dir):
    # Create directories and cache files
    for project_name, md_dirs in projects.items():
        # Create project directory
        os.makedirs(project_name, exist_ok=True)

        # Create cache file for project with UUID
        project_uuid = str(uuid4())
        project_cache = {
            'uuid': project_uuid,
            # Project cache does NOT have project_uuid
        }
        cache_path = os.path.join(project_name, '.mwf_cache.json')
        with open(cache_path, 'w') as f:
            json.dump(project_cache, f, indent=4)

        print(f"Created project: {project_name} (UUID: {project_uuid})")

        # Create MD directories with their cache files
        for md_dir in md_dirs:
            md_path = os.path.join(project_name, md_dir)
            os.makedirs(md_path, exist_ok=True)

            # Create cache file for MD with its own UUID and parent project_uuid
            md_uuid = str(uuid4())
            md_cache = {
                'uuid': md_uuid,
                'project_uuid': project_uuid,  # MD cache HAS project_uuid
            }
            md_cache_path = os.path.join(md_path, '.mwf_cache.json')
            with open(md_cache_path, 'w') as f:
                json.dump(md_cache, f, indent=4)

            print(f"  Created MD: {md_dir} (UUID: {md_uuid})")

print("\nTest directory structure created with cache files!")

Created project: project1 (UUID: f3d6c8cc-c795-49d3-b3b2-acd9228a8c5e)
  Created MD: replica1 (UUID: ef62c188-944c-457b-bb29-b3643c5a60a6)
  Created MD: replica2 (UUID: f89caec2-56c2-4d32-9992-cb3f55f27dad)
Created project: project2 (UUID: adef845e-3bee-40c8-bb53-69bd8869ce5d)
  Created MD: replica1 (UUID: 3bb86ed2-cc7c-4e60-919a-1758d3cfbcf1)
  Created MD: replica2 (UUID: 45e5df52-1216-43d9-afdc-05073a3ff293)
  Created MD: replica3 (UUID: bee0004b-6f25-44e5-9427-009d2198aadb)
Created project: project3 (UUID: 27556625-e806-42cf-96e4-267eb5fc2151)

Test directory structure created with cache files!


In [36]:
%load_ext autoreload
%autoreload 2
from mddb_workflow.core.dataset import Dataset

# Initialize the Dataset
db_path = test_dir+"/dataset.db"
# Remove database in case the notebook is re-run
if os.path.exists(db_path):
    os.remove(db_path)

# Create dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)
ds.scan(verbose=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Adding project: project3 (UUID: 27556625-e806-42cf-96e4-267eb5fc2151)
Adding project: project1 (UUID: f3d6c8cc-c795-49d3-b3b2-acd9228a8c5e)
Adding project: project2 (UUID: adef845e-3bee-40c8-bb53-69bd8869ce5d)
  Adding MD: project1/replica1 (UUID: ef62c188-944c-457b-bb29-b3643c5a60a6, Project UUID: f3d6c8cc-c795-49d3-b3b2-acd9228a8c5e)
  Adding MD: project1/replica2 (UUID: f89caec2-56c2-4d32-9992-cb3f55f27dad, Project UUID: f3d6c8cc-c795-49d3-b3b2-acd9228a8c5e)
  Adding MD: project2/replica1 (UUID: 3bb86ed2-cc7c-4e60-919a-1758d3cfbcf1, Project UUID: adef845e-3bee-40c8-bb53-69bd8869ce5d)
  Adding MD: project2/replica2 (UUID: 45e5df52-1216-43d9-afdc-05073a3ff293, Project UUID: adef845e-3bee-40c8-bb53-69bd8869ce5d)
  Adding MD: project2/replica3 (UUID: bee0004b-6f25-44e5-9427-009d2198aadb, Project UUID: adef845e-3bee-40c8-bb53-69bd8869ce5d)


In [37]:
# uuid are shortened and the paths are shown relative to the dataset
# To display full uuids and absolute paths, use ds.get_dataframe()
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
f3d6c8cc,,projects,project1,2.0,new,No information recorded yet.,11:11:38 15-01-2026
adef845e,,projects,project2,3.0,new,No information recorded yet.,11:11:38 15-01-2026
27556625,,projects,project3,0.0,new,No information recorded yet.,11:11:38 15-01-2026
ef62c188,f3d6c8cc,mds,project1/replica1,,new,No information recorded yet.,11:11:38 15-01-2026
f89caec2,f3d6c8cc,mds,project1/replica2,,new,No information recorded yet.,11:11:38 15-01-2026
3bb86ed2,adef845e,mds,project2/replica1,,new,No information recorded yet.,11:11:38 15-01-2026
45e5df52,adef845e,mds,project2/replica2,,new,No information recorded yet.,11:11:38 15-01-2026
bee0004b,adef845e,mds,project2/replica3,,new,No information recorded yet.,11:11:38 15-01-2026


In [38]:
# Test adding a new MD to an existing project
# Get UUID of project1
with open(test_dir+'/project1/.mwf_cache.json', 'r') as f:
    project1_data = json.load(f)
    project1_uuid = project1_data['uuid']

# Create new MD directory
new_md_dir = test_dir+'/project1/replica3'
os.makedirs(new_md_dir, exist_ok=True)

# Create cache for new MD with project_uuid
new_md_uuid = str(uuid4())
with open(os.path.join(new_md_dir, '.mwf_cache.json'), 'w') as f:
    json.dump({
        'uuid': new_md_uuid,
        'project_uuid': project1_uuid
    }, f, indent=4)

print(f"Created new MD: {new_md_dir} (UUID: {new_md_uuid}, Project: {project1_uuid})")

# Add it to dataset
ds.add_project(new_md_dir, verbose=True)

Created new MD: old_dataset/project1/replica3 (UUID: fbb396c7-045f-4406-aa5f-85cbbda84eec, Project: f3d6c8cc-c795-49d3-b3b2-acd9228a8c5e)
Adding project: project1/replica3 (UUID: fbb396c7-045f-4406-aa5f-85cbbda84eec)


In [39]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fbb396c7,,projects,project1/replica3,0.0,new,No information recorded yet.,11:12:06 15-01-2026
f3d6c8cc,,projects,project1,2.0,new,No information recorded yet.,11:11:38 15-01-2026
adef845e,,projects,project2,3.0,new,No information recorded yet.,11:11:38 15-01-2026
27556625,,projects,project3,0.0,new,No information recorded yet.,11:11:38 15-01-2026
ef62c188,f3d6c8cc,mds,project1/replica1,,new,No information recorded yet.,11:11:38 15-01-2026
f89caec2,f3d6c8cc,mds,project1/replica2,,new,No information recorded yet.,11:11:38 15-01-2026
3bb86ed2,adef845e,mds,project2/replica1,,new,No information recorded yet.,11:11:38 15-01-2026
45e5df52,adef845e,mds,project2/replica2,,new,No information recorded yet.,11:11:38 15-01-2026
bee0004b,adef845e,mds,project2/replica3,,new,No information recorded yet.,11:11:38 15-01-2026


In [40]:
# Test removing an MD by directory path
ds.remove_entry(test_dir+'/project2/replica1', verbose=True)
ds.dataframe

Deleted MD with UUID '3bb86ed2-cc7c-4e60-919a-1758d3cfbcf1'


Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fbb396c7,,projects,project1/replica3,0.0,new,No information recorded yet.,11:12:06 15-01-2026
f3d6c8cc,,projects,project1,2.0,new,No information recorded yet.,11:11:38 15-01-2026
adef845e,,projects,project2,2.0,new,No information recorded yet.,11:11:38 15-01-2026
27556625,,projects,project3,0.0,new,No information recorded yet.,11:11:38 15-01-2026
ef62c188,f3d6c8cc,mds,project1/replica1,,new,No information recorded yet.,11:11:38 15-01-2026
f89caec2,f3d6c8cc,mds,project1/replica2,,new,No information recorded yet.,11:11:38 15-01-2026
45e5df52,adef845e,mds,project2/replica2,,new,No information recorded yet.,11:11:38 15-01-2026
bee0004b,adef845e,mds,project2/replica3,,new,No information recorded yet.,11:11:38 15-01-2026


In [41]:
# Test removing an entire project (will cascade delete all MDs)
ds.remove_entry(test_dir+'/project2/', verbose=True)
ds.dataframe

Deleted project with UUID 'adef845e-3bee-40c8-bb53-69bd8869ce5d'


Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fbb396c7,,projects,project1/replica3,0.0,new,No information recorded yet.,11:12:06 15-01-2026
f3d6c8cc,,projects,project1,2.0,new,No information recorded yet.,11:11:38 15-01-2026
27556625,,projects,project3,0.0,new,No information recorded yet.,11:11:38 15-01-2026
ef62c188,f3d6c8cc,mds,project1/replica1,,new,No information recorded yet.,11:11:38 15-01-2026
f89caec2,f3d6c8cc,mds,project1/replica2,,new,No information recorded yet.,11:11:38 15-01-2026


In [42]:
# Test get_status by directory path
ds.get_status(test_dir+'/project1')

{'uuid': 'f3d6c8cc-c795-49d3-b3b2-acd9228a8c5e',
 'rel_path': 'project1',
 'num_mds': 2,
 'state': 'new',
 'message': 'No information recorded yet.',
 'last_modified': '11:11:38 15-01-2026',
 'scope': 'Project'}

## Dataset limitations

- Locking mechanisms when using sshfs or network filesystems.
- Used flags history is not stored in the dataset, so if we change the flags used for a project, the dataset will not be aware of it and may show wrong information about the state of the project.