# Dataset

A `Dataset` is a collection of `Projects` that contain molecular dynamics simulations or related data, with some shared metadata and characteristics due to how they were generated. For each `Project`, in the context of the **MDDB Workflow**, we are refering to a set of simulations/replicas, with one or more trajectory files and a common topology file. To complete the definitions, individual simulations or replicas are referred to as `MD`.

The main functionality of this class is keeping track of the state of many Projects: if they are still running, if they are done or if they fail and what caused the error. For this the only adjustment we have to do is adding the path where our main SQLite storage file will be kept. We can do this by using the `dataset_path` flag during the workflow execution:

`mwf run ... --dataset_path path/to/our_dataset.db`

Or, if we do no want to write the flag everytime, by using the field `dataset_path` in the input.yaml config file:
```yaml
- dataset_path: path/to/our_dataset.db
```

In [None]:
# TODO: Examples showing the states of some runs

## Creating a new Dataset

However, having to modify the inputs file for every project of the dataset may be very cumbersome, as Datasets can be form by hundreds or thousand projects. For this we can make use of another feature of this class: automatic inputs file generation.

### Directory Structure

For this, we part from a root folder, that every person may be organize on its own ways, but they normally follow a hierarchical structure with all its project that may look something like this:

``` bash
new_dataset/
├── project_1/
├── project_2/
├── project_3/
├── project_4/ 
├── ...
├──── special_cases/
├────── case_1/
├────── case_2/
├────── ...
├──── wrong_cases/
├────── case_1/
├────── case_2/
├────── ...
├── scripts/
├── project_logs/
└── ...
```
Note of we do not specify nothing about `MDs` as we will take care of that later.

In [30]:
import os

# Create directory structure
dataset_dir = "new_dataset"
dirs = [
    dataset_dir+"/project_1",
    dataset_dir+"/project_2",
    dataset_dir+"/project_3",
    dataset_dir+"/project_4",
    dataset_dir+"/special_cases/case_1",
    dataset_dir+"/special_cases/case_2",
    dataset_dir+"/wrong_cases/case_1",
    dataset_dir+"/wrong_cases/case_2",
    dataset_dir+"/scripts",
    dataset_dir+"/project_logs",
]

for dir_path in dirs:
    os.makedirs(dir_path, exist_ok=True)

In [31]:
%load_ext autoreload
%autoreload 2
from mddb_workflow.core.dataset import Dataset

# Create test directory structure
dataset_dir = "new_dataset"
# Initialize the Dataset
db_path = dataset_dir+"/new_dataset.db"
# Remove database in case the notebook is re-run
if os.path.exists(db_path):
    os.remove(db_path)

# Create dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [32]:
ds.add_entries([dataset_dir+'/project_*',
                 dataset_dir+'/special_cases/*',],
                ignore_dirs=[dataset_dir+'/*logs'],
                 verbose=True)

Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_4 (UUID: c613c8e7-a6b7-4c3b-8f51-8843e862d61a)
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_3 (UUID: 4549c7e0-6df6-4f35-a429-3f11a2c5a5b4)
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_1 (UUID: 309696e4-9025-41d0-8aa4-e1b10d15ed36)
Ignoring project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_logs
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/project_2 (UUID: 4394a53a-74cf-4492-8968-bb197c614fed)
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_1 (UUID: 08235d5b-5ad1-498c-bc5f-9677d76f0197)
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/special_cases/case_2 (UUID: 93e9244a-da3f-4cad-9b2a-7f54cf932188)


Some useful glob patters:

- `*`: matches all the folders.
- `**/*`: matches all subfolders.
- `**/[0-9]*`: matches subfolders starting with a digit.

In [33]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,,Project,project_1,0,new,No information have been recorded yet.,11:32:41 07-01-2026
1,,Project,project_2,0,new,No information have been recorded yet.,11:32:41 07-01-2026
2,,Project,project_3,0,new,No information have been recorded yet.,11:32:41 07-01-2026
3,,Project,project_4,0,new,No information have been recorded yet.,11:32:41 07-01-2026
4,,Project,special_cases/case_1,0,new,No information have been recorded yet.,11:32:41 07-01-2026
5,,Project,special_cases/case_2,0,new,No information have been recorded yet.,11:32:41 07-01-2026


### Generating inputs files programmatically

#### Jinja2 templates

In [34]:
inputs_template_str = """
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: {{DATASET}}
description: 10 ns simulation of {{DIR}} pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project {{DIR}}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

In [35]:
ds.generate_inputs_yaml(inputs_template, overwrite=True)

In [36]:
!cat {dataset_dir}/project_1/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/new_dataset.db
description: 10 ns simulation of project_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_1

#### Adding custom fields

In [37]:
inputs_template_str = """
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: {{DATASET}}
description: 10 ns simulation of {{DIR}} pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project {{DIR}}
{%- if is_special_case %}
description: Special case description for {{DIR}}
{% endif %}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

def input_generator(project_dir: str):
    if "special_cases" in project_dir:
        return {'is_special_case': True}

In [38]:
ds.generate_inputs_yaml(inputs_template, overwrite=True,
                        input_generator=input_generator)

In [39]:
# Now we add a field only for special cases
!cat {dataset_dir}/special_cases/case_1/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/new_dataset.db
description: 10 ns simulation of case_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project case_1
description: Special case description for case_1


In [40]:
# While for regular projects the file remains unchanged
!cat {dataset_dir}/project_1/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/new_dataset.db
description: 10 ns simulation of project_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_1

#### Handling multiple and variable number of MDs

In [47]:
# Create directory structure
dirs = [
    dataset_dir+"/many_mds",
    dataset_dir+"/many_mds/project_1",
    dataset_dir+"/many_mds/project_2",
]
files = [
    # A project with 3 equilibration and 3 MD replicas
    dataset_dir+"/many_mds/project_1/equil_1.traj",
    dataset_dir+"/many_mds/project_1/equil_2.traj",
    dataset_dir+"/many_mds/project_1/equil_3.traj",
    dataset_dir+"/many_mds/project_1/prod_1.traj",
    dataset_dir+"/many_mds/project_1/prod_2.traj",
    dataset_dir+"/many_mds/project_1/prod_3.traj",
    # A project with 2 equilibration and 2 MD replicas
    dataset_dir+"/many_mds/project_2/equil_1.traj",
    dataset_dir+"/many_mds/project_2/equil_2.traj",
    dataset_dir+"/many_mds/project_2/prod_1.traj",
    dataset_dir+"/many_mds/project_2/prod_2.traj",
]
for dir_path in dirs:
    os.makedirs(dir_path, exist_ok=True)

for file_path in files:
    with open(file_path, 'w') as f:
        f.write("DUMMY TRAJ FILE\n")

In [48]:
ds.add_entries([dataset_dir+'/many_mds/*'], verbose=True)

In [49]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,,Project,many_mds/project_1,0,new,No information have been recorded yet.,11:32:42 07-01-2026
1,,Project,many_mds/project_2,0,new,No information have been recorded yet.,11:32:42 07-01-2026
2,,Project,project_1,0,new,No information have been recorded yet.,11:32:41 07-01-2026
3,,Project,project_2,0,new,No information have been recorded yet.,11:32:41 07-01-2026
4,,Project,project_3,0,new,No information have been recorded yet.,11:32:41 07-01-2026
5,,Project,project_4,0,new,No information have been recorded yet.,11:32:41 07-01-2026
6,,Project,special_cases/case_1,0,new,No information have been recorded yet.,11:32:41 07-01-2026
7,,Project,special_cases/case_2,0,new,No information have been recorded yet.,11:32:41 07-01-2026


In [58]:
inputs_template_str = """
authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: {{DATASET}}
description: 10 ns simulation of {{DIR}} pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project {{DIR}}
mds:
{% for md in mds %}
  -
    mdir: {{ md.mdir }}
    input_trajectory_filepaths: {{ md.traj }}
{% endfor %}
"""

inputs_template = dataset_dir+'/inputs_template.yaml'
with open(inputs_template, 'w') as f:
    f.write(inputs_template_str)

In [59]:
from pathlib import Path


def md_input_generator(project_dir: str):
    """Generate a list of MD replicas based on the traj files in the project directory."""
    mds = []
    project_path = Path(project_dir)
    prod_trajs = sorted(project_path.glob('prod_*.traj'))
    num_replicas = len(prod_trajs)
    for i in range(num_replicas):
        mds.append({
            'mdir': f'md_replica_{i+1}',
            'traj': prod_trajs[i].relative_to(project_path).as_posix(),
        })
    return {'mds': mds}

In [60]:
md_input_generator(dataset_dir+'/many_mds/project_1')

{'mds': [{'mdir': 'md_replica_1', 'traj': 'prod_1.traj'},
  {'mdir': 'md_replica_2', 'traj': 'prod_2.traj'},
  {'mdir': 'md_replica_3', 'traj': 'prod_3.traj'}]}

In [61]:
ds.generate_inputs_yaml(inputs_template, overwrite=True,
                        input_generator=md_input_generator)

In [None]:
# Generated inputs.yaml for project with 3 replicas
!cat new_dataset/many_mds/project_1/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/new_dataset.db
description: 10 ns simulation of project_1 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_1
mds:

  -
    mdir: md_replica_1
    input_trajectory_filepaths: prod_1.traj

  -
    mdir: md_replica_2
    input_trajectory_filepaths: prod_2.traj

  -
    mdir: md_replica_3
    input_trajectory_filepaths: prod_3.traj


In [None]:
# Generated inputs.yaml for project with 2 replicas
!cat new_dataset/many_mds/project_2/inputs.yaml


authors:
- Rubén Chaves
collections:
- mdbind
contact: For any questions please send a mail to ruben.chaves@irbbarcelona.org
dataset_path: /home/rchaves/repo/MDDB/workflow/docs/source/new_dataset/new_dataset.db
description: 10 ns simulation of project_2 pdb structure
linkcense: https://creativecommons.org/licenses/by/4.0/
name: Project project_2
mds:

  -
    mdir: md_replica_1
    input_trajectory_filepaths: prod_1.traj

  -
    mdir: md_replica_2
    input_trajectory_filepaths: prod_2.traj


# Already run projects

In [76]:
import os
import json
from uuid import uuid4
from contextlib import chdir

# Create test directory structure
test_dir = "old_dataset"
if test_dir not in os.getcwd():
    os.makedirs(test_dir, exist_ok=True)

# Define the directory structure
projects = {
    'project1': ['replica1', 'replica2'],
    'project2': ['replica1', 'replica2', 'replica3'],
    'project3': [],  # Project with no MDs
}
with chdir(test_dir):
    # Create directories and cache files
    for project_name, md_dirs in projects.items():
        # Create project directory
        os.makedirs(project_name, exist_ok=True)

        # Create cache file for project with UUID
        project_uuid = str(uuid4())
        project_cache = {
            'uuid': project_uuid,
            # Project cache does NOT have project_uuid
        }
        cache_path = os.path.join(project_name, '.mwf_cache.json')
        with open(cache_path, 'w') as f:
            json.dump(project_cache, f, indent=4)

        print(f"Created project: {project_name} (UUID: {project_uuid})")

        # Create MD directories with their cache files
        for md_dir in md_dirs:
            md_path = os.path.join(project_name, md_dir)
            os.makedirs(md_path, exist_ok=True)

            # Create cache file for MD with its own UUID and parent project_uuid
            md_uuid = str(uuid4())
            md_cache = {
                'uuid': md_uuid,
                'project_uuid': project_uuid,  # MD cache HAS project_uuid
            }
            md_cache_path = os.path.join(md_path, '.mwf_cache.json')
            with open(md_cache_path, 'w') as f:
                json.dump(md_cache, f, indent=4)

            print(f"  Created MD: {md_dir} (UUID: {md_uuid})")

print("\nTest directory structure created with cache files!")

Created project: project1 (UUID: 82ff7f88-c24c-4e3d-90a2-541721046fc3)
  Created MD: replica1 (UUID: 957f28b2-1941-4ada-b058-fc1f807629d7)
  Created MD: replica2 (UUID: 39f9fbe5-a239-4c31-b398-cb6a6a0f86a0)
Created project: project2 (UUID: c527aeef-84bb-4343-97fa-461d0e556f31)
  Created MD: replica1 (UUID: f7910129-d21c-47b5-bc39-4975e1f8d184)
  Created MD: replica2 (UUID: e694403d-1c63-40d1-a9da-eefede93eebc)
  Created MD: replica3 (UUID: 493f6e55-26bc-4bc2-8dcc-0ff135a543ed)
Created project: project3 (UUID: 9044d2a2-68af-4060-8303-8299c791242d)

Test directory structure created with cache files!


In [77]:
%load_ext autoreload
%autoreload 2
from mddb_workflow.core.dataset import Dataset

# Initialize the Dataset
db_path = test_dir+"/dataset.db"
# Remove database in case the notebook is re-run
if os.path.exists(db_path):
    os.remove(db_path)

# Create dataset and scan for projects and MDs
ds = Dataset(dataset_path=db_path)
ds.scan(verbose=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Adding project: project3 (UUID: 9044d2a2-68af-4060-8303-8299c791242d)
Adding project: project1 (UUID: 82ff7f88-c24c-4e3d-90a2-541721046fc3)
Adding project: project2 (UUID: c527aeef-84bb-4343-97fa-461d0e556f31)
  Adding MD: project1/replica1 (UUID: 957f28b2-1941-4ada-b058-fc1f807629d7, Project UUID: 82ff7f88-c24c-4e3d-90a2-541721046fc3)
  Adding MD: project1/replica2 (UUID: 39f9fbe5-a239-4c31-b398-cb6a6a0f86a0, Project UUID: 82ff7f88-c24c-4e3d-90a2-541721046fc3)
  Adding MD: project2/replica1 (UUID: f7910129-d21c-47b5-bc39-4975e1f8d184, Project UUID: c527aeef-84bb-4343-97fa-461d0e556f31)
  Adding MD: project2/replica2 (UUID: e694403d-1c63-40d1-a9da-eefede93eebc, Project UUID: c527aeef-84bb-4343-97fa-461d0e556f31)
  Adding MD: project2/replica3 (UUID: 493f6e55-26bc-4bc2-8dcc-0ff135a543ed, Project UUID: c527aeef-84bb-4343-97fa-461d0e556f31)


In [78]:
# uuid are shortened and the paths are shown relative to the dataset
# To display full uuids and absolute paths, use ds.get_dataframe()
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
82ff7f88,,Project,project1,2.0,new,No information have been recorded yet.,11:52:06 07-01-2026
957f28b2,82ff7f88,MD,project1/replica1,,new,No information have been recorded yet.,11:52:06 07-01-2026
39f9fbe5,82ff7f88,MD,project1/replica2,,new,No information have been recorded yet.,11:52:06 07-01-2026
c527aeef,,Project,project2,3.0,new,No information have been recorded yet.,11:52:06 07-01-2026
f7910129,c527aeef,MD,project2/replica1,,new,No information have been recorded yet.,11:52:06 07-01-2026
e694403d,c527aeef,MD,project2/replica2,,new,No information have been recorded yet.,11:52:06 07-01-2026
493f6e55,c527aeef,MD,project2/replica3,,new,No information have been recorded yet.,11:52:06 07-01-2026
9044d2a2,,Project,project3,0.0,new,No information have been recorded yet.,11:52:06 07-01-2026


In [67]:
# Test adding a new project manually
new_project_dir = 'project4'
os.makedirs(new_project_dir, exist_ok=True)

# Create cache for new project
new_uuid = str(uuid4())
with open(os.path.join(new_project_dir, '.mwf_cache.json'), 'w') as f:
    json.dump({'uuid': new_uuid}, f, indent=4)

print(f"Created new project: {new_project_dir} (UUID: {new_uuid})")

# Add it to dataset
ds.add_project(new_project_dir, verbose=True)

Created new project: project4 (UUID: 244599c8-21b9-4d91-88ea-f5ca6ea65c66)
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/project4 (UUID: 244599c8-21b9-4d91-88ea-f5ca6ea65c66)


In [68]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,,Project,project1,2.0,new,No information have been recorded yet.,11:43:23 07-01-2026
1,0.0,MD,project1/replica1,,new,No information have been recorded yet.,11:43:23 07-01-2026
2,0.0,MD,project1/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
3,,Project,project2,3.0,new,No information have been recorded yet.,11:43:23 07-01-2026
4,3.0,MD,project2/replica1,,new,No information have been recorded yet.,11:43:23 07-01-2026
5,3.0,MD,project2/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
6,3.0,MD,project2/replica3,,new,No information have been recorded yet.,11:43:23 07-01-2026
7,,Project,project3,0.0,new,No information have been recorded yet.,11:43:23 07-01-2026
8,,Project,../project4,0.0,new,No information have been recorded yet.,11:44:00 07-01-2026


In [69]:
# Test adding a new MD to an existing project
# Get UUID of project1
with open(test_dir+'/project1/.mwf_cache.json', 'r') as f:
    project1_data = json.load(f)
    project1_uuid = project1_data['uuid']

# Create new MD directory
new_md_dir = test_dir+'/project1/replica3'
os.makedirs(new_md_dir, exist_ok=True)

# Create cache for new MD with project_uuid
new_md_uuid = str(uuid4())
with open(os.path.join(new_md_dir, '.mwf_cache.json'), 'w') as f:
    json.dump({
        'uuid': new_md_uuid,
        'project_uuid': project1_uuid
    }, f, indent=4)

print(f"Created new MD: {new_md_dir} (UUID: {new_md_uuid}, Project: {project1_uuid})")

# Add it to dataset
ds.add_project(new_md_dir, verbose=True)

Created new MD: old_dataset/project1/replica3 (UUID: eec1500a-6ee1-4d3c-abff-973d3b695517, Project: 08fe8605-7d20-4dca-8b32-0077412b8147)
Adding project: /home/rchaves/repo/MDDB/workflow/docs/source/old_dataset/project1/replica3 (UUID: eec1500a-6ee1-4d3c-abff-973d3b695517)


In [70]:
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,,Project,project1,2.0,new,No information have been recorded yet.,11:43:23 07-01-2026
1,0.0,MD,project1/replica1,,new,No information have been recorded yet.,11:43:23 07-01-2026
2,0.0,MD,project1/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
3,,Project,project1/replica3,0.0,new,No information have been recorded yet.,11:44:05 07-01-2026
4,,Project,project2,3.0,new,No information have been recorded yet.,11:43:23 07-01-2026
5,4.0,MD,project2/replica1,,new,No information have been recorded yet.,11:43:23 07-01-2026
6,4.0,MD,project2/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
7,4.0,MD,project2/replica3,,new,No information have been recorded yet.,11:43:23 07-01-2026
8,,Project,project3,0.0,new,No information have been recorded yet.,11:43:23 07-01-2026
9,,Project,../project4,0.0,new,No information have been recorded yet.,11:44:00 07-01-2026


In [71]:
# Test removing an MD by directory path
ds.remove_entry(test_dir+'/project2/replica1')
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,,Project,project1,2.0,new,No information have been recorded yet.,11:43:23 07-01-2026
1,0.0,MD,project1/replica1,,new,No information have been recorded yet.,11:43:23 07-01-2026
2,0.0,MD,project1/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
3,,Project,project1/replica3,0.0,new,No information have been recorded yet.,11:44:05 07-01-2026
4,,Project,project2,2.0,new,No information have been recorded yet.,11:43:23 07-01-2026
5,4.0,MD,project2/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
6,4.0,MD,project2/replica3,,new,No information have been recorded yet.,11:43:23 07-01-2026
7,,Project,project3,0.0,new,No information have been recorded yet.,11:43:23 07-01-2026
8,,Project,../project4,0.0,new,No information have been recorded yet.,11:44:00 07-01-2026


In [72]:
# Test removing an entire project (will cascade delete all MDs)
ds.remove_entry(test_dir+'/project2/')
ds.dataframe

Unnamed: 0_level_0,project_uuid,scope,rel_path,num_mds,state,message,last_modified
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,,Project,project1,2.0,new,No information have been recorded yet.,11:43:23 07-01-2026
1,0.0,MD,project1/replica1,,new,No information have been recorded yet.,11:43:23 07-01-2026
2,0.0,MD,project1/replica2,,new,No information have been recorded yet.,11:43:23 07-01-2026
3,,Project,project1/replica3,0.0,new,No information have been recorded yet.,11:44:05 07-01-2026
4,,Project,project3,0.0,new,No information have been recorded yet.,11:43:23 07-01-2026
5,,Project,../project4,0.0,new,No information have been recorded yet.,11:44:00 07-01-2026


In [73]:
# Test get_status by directory path
ds.get_status(test_dir+'/project1')

{'uuid': '08fe8605-7d20-4dca-8b32-0077412b8147',
 'abs_path': '/home/rchaves/repo/MDDB/workflow/docs/source/old_dataset/project1',
 'num_mds': 2,
 'state': 'new',
 'message': 'No information have been recorded yet.',
 'last_modified': '11:43:23 07-01-2026',
 'type': 'project'}

In [74]:
ds.status

{'projects': [('08fe8605-7d20-4dca-8b32-0077412b8147',
   '/home/rchaves/repo/MDDB/workflow/docs/source/old_dataset/project1',
   2,
   'new',
   'No information have been recorded yet.',
   '11:43:23 07-01-2026'),
  ('eec1500a-6ee1-4d3c-abff-973d3b695517',
   '/home/rchaves/repo/MDDB/workflow/docs/source/old_dataset/project1/replica3',
   0,
   'new',
   'No information have been recorded yet.',
   '11:44:05 07-01-2026'),
  ('24755cf7-ca3a-4c6a-bef8-63c7a2fc0a67',
   '/home/rchaves/repo/MDDB/workflow/docs/source/old_dataset/project3',
   0,
   'new',
   'No information have been recorded yet.',
   '11:43:23 07-01-2026'),
  ('244599c8-21b9-4d91-88ea-f5ca6ea65c66',
   '/home/rchaves/repo/MDDB/workflow/docs/source/project4',
   0,
   'new',
   'No information have been recorded yet.',
   '11:44:00 07-01-2026')],
 'mds': [('48e4c63c-745f-45a5-bf1f-a89020f959ea',
   '08fe8605-7d20-4dca-8b32-0077412b8147',
   '/home/rchaves/repo/MDDB/workflow/docs/source/old_dataset/project1/replica1',
   '

# Old Dataset

In [12]:
# dataset_dir = '/home/rchaves/ssh_dirs/mn5/res/others/agus_MoDeL-CNS'
# dataset_dir = '/home/rchaves/ssh_dirs/irbcluster/scratch/model-cns/'
# dataset_dir = '/home/rchaves/repo/MDDB/workflow/test/data/input/dataset/'
# dataset_dir = '/home/rchaves/ssh_dirs/mn5/projects/model/'
dataset_dir = '/home/rchaves/ssh_dirs/irbcluster/data/AB-DB/'
# dataset_dir = '/home/rchaves/ssh_dirs/mn5/res/raw/johnson'

In [7]:
job_template = dataset_dir + "job_template.sh"
print(job_template)
!cat {job_template}

/home/rchaves/ssh_dirs/mn5/res/raw/johnsonjob_template.sh
cat: /home/rchaves/ssh_dirs/mn5/res/raw/johnsonjob_template.sh: No such file or directory


In [None]:
# To launch the workflow with SLURM
mn5 = True
dt.launch_workflow(
    include_groups=[],
    exclude_groups=[],
    slurm=mn5,
    job_template=job_template,
    pool_size=3,
    n_jobs=0,
    debug=mn5,
    fi={'step': 1},  # -fi step 1
    )

cd /home/rchaves/ssh_dirs/mn5/projects/model/6jzh
Job script: /home/rchaves/ssh_dirs/mn5/projects/model/6jzh/mwf_slurm_job.sh
sbatch --output=logs/mwf_%j.out --error=logs/mwf_%j.err mwf_slurm_job.sh 
cd /home/rchaves/ssh_dirs/mn5/projects/model/6ps7
Job script: /home/rchaves/ssh_dirs/mn5/projects/model/6ps7/mwf_slurm_job.sh
sbatch --output=logs/mwf_%j.out --error=logs/mwf_%j.err mwf_slurm_job.sh 
cd /home/rchaves/ssh_dirs/mn5/projects/model/6ps5
Job script: /home/rchaves/ssh_dirs/mn5/projects/model/6ps5/mwf_slurm_job.sh
sbatch --output=logs/mwf_%j.out --error=logs/mwf_%j.err mwf_slurm_job.sh 


## Command Line

In [None]:
# TODO: make Dataset use cache data for thing like the templates so you do not have to wirte everytime in the commands. Save them in a special table.

In [None]:
!mwf dataset groups {dataset_yaml_path}

Project groups based on status messages:

Group 0:
Message: Done!
Projects:
  - 6gdg
  - 6j8h
  - 6jzh
  - 6k42
  - 6kux
  - 6kuy
  - 6ni3
  - 6nt3
  - 6oik
  - 6ps5
  - 6qfa
  - 6wjc
  - 7bz2
  - 7cmu
  - 7dhr
  - 7dtd
  - 7e2y
  - 7e2z
  - 7jvr

Group 1:
Message: No output log available
Projects:
  - 6ps7



In [37]:
!mwf dataset run -h

usage: mwf dataset run [-h] [-ns] [-nc] [-ig [INCLUDE_GROUPS ...]]
                       [-jt JOB_TEMPLATE] [--debug]
                       dataset_yaml

positional arguments:
  dataset_yaml
      Path to the dataset YAML file.

options:
  -h, --help
      show this help message and exit
  -ns, --no_symlinks
      Do not use symlinks internally
  -nc, --no_colors
      Do not use colors for logging
  -ig, --include-groups [INCLUDE_GROUPS ...]
      List of group IDs to be run.
  -eg, --exclude-groups [EXCLUDE_GROUPS ...]
      List of group IDs to be excluded.
  -n, --n_jobs N_JOBS
      Number of jobs to run.
  --slurm
      Submit the workflow to SLURM.
  -jt, --job-template JOB_TEMPLATE
      Path to the SLURM job template file. Required if --slurm is used.
  --debug
      Enable debug mode.


In [36]:
# In cmd: #mwf dataset run dataset.yaml --slurm -jt job_template.sh -eg 3 4 0
!mwf dataset run {dataset_yaml_path} --slurm -jt {job_template} -eg 1 --debug -n 2

cd /home/rchaves/ssh_dirs/mn5/ruben/model/6kuy
sbatch --output=logs/mwf_%j.out --error=logs/mwf_%j.err mwf_slurm_job.sh 
cd /home/rchaves/ssh_dirs/mn5/ruben/model/7cmu
sbatch --output=logs/mwf_%j.out --error=logs/mwf_%j.err mwf_slurm_job.sh 
