Automatic ensemble new #721

Aske-Rosted · 2024-05-17T04:27:05Z

the following is an implementation of the feature mentioned in #720.

I had to include a type ignore on line 68. which I personally was not too happy with, but I did not manage to satisfy the pre-commit hooks without it.

The solution uses the pre-existing EnsembleDataset class to automatically combine databases from a list of databases.

merge from main

RasmusOrsoe · 2024-05-17T11:45:47Z

src/graphnet/data/dataloader.py

-            datasets = Dataset.from_config(config)
+
+            if isinstance(config.path, list):
+                datasets: Union[Dict[str, Dataset], Dict[str, EnsembleDataset]] = {}  # type: ignore


I think the reason for your problem with mypy could be that you define datasets to contain both Dataset and EnsembleDataset, but in the typehints of DataLoader the dataset is only type hinted to be dataset: Dataset. See here. I think if you changed that type hint to dataset: Union[Dataset, EnsembleDataset], the type hinting should be fine. Did you try this?

RasmusOrsoe

Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.

Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:

@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
    """Test constructing Dataset with multiple data paths."""
    # Arrange
    config_path = CONFIG_PATHS[backend]

    # Single dataset
    config = DatasetConfig.load(config_path)
    dataset = Dataset.from_config(config)
    # Construct multiple datasets
    config_ensemble = DatasetConfig.load(config_path)
    config_ensemble.path = [config_ensemble.path, config_ensemble.path]

    ensemble_dataset = Dataset.from_config(config)
    
    assert len(dataset)*2 == len(ensemble_dataset)

Aske-Rosted · 2024-05-28T07:38:37Z

Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.

Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:
@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
    """Test constructing Dataset with multiple data paths."""
    # Arrange
    config_path = CONFIG_PATHS[backend]

    # Single dataset
    config = DatasetConfig.load(config_path)
    dataset = Dataset.from_config(config)
    # Construct multiple datasets
    config_ensemble = DatasetConfig.load(config_path)
    config_ensemble.path = [config_ensemble.path, config_ensemble.path]

    ensemble_dataset = Dataset.from_config(config)
    
    assert len(dataset)*2 == len(ensemble_dataset)

Hey Rasmus looked into the suggestion for a bit but since this automatic ensembling is happing during the dataloader, and not in the dataset.from_config(config) call the current test suggestion does not test the ensembling code. This does however illuminate a bit of an issue with the current implementation of the automatic ensembling. That is you can have a dataset config file that works fine as long as you only feed it to a dataloader, however if you try to manually load the dataset from the config using the dataset class it will fail due to the list of files rather than a single file...

RasmusOrsoe · 2024-05-28T09:21:41Z

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

Aske-Rosted · 2024-05-29T02:27:20Z

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

I agree that is probably the more appropriate location to handle multiple file paths. I will have a look at it.

Aske-Rosted added 4 commits March 13, 2024 14:05

automatic ensemble creation from list

9cf8ad3

Merge branch 'main' into automatic_ensemble

a4f2ef2

merge from main

update embedding to main

ad17223

Merge branch 'automatic_ensemble' into automatic_ensemble_new

d0d9874

RasmusOrsoe reviewed May 17, 2024

View reviewed changes

Aske-Rosted added 4 commits May 28, 2024 14:37

typing fix

b4fa590

add test

0932394

Merge branch 'main' into automatic_ensemble_new

de7e10b

remove test

7baa60d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic ensemble new #721

Automatic ensemble new #721

Aske-Rosted commented May 17, 2024

RasmusOrsoe May 17, 2024

RasmusOrsoe left a comment •

edited

Loading

Aske-Rosted commented May 28, 2024

RasmusOrsoe commented May 28, 2024

Aske-Rosted commented May 29, 2024

Automatic ensemble new #721

Are you sure you want to change the base?

Automatic ensemble new #721

Conversation

Aske-Rosted commented May 17, 2024

RasmusOrsoe May 17, 2024

Choose a reason for hiding this comment

RasmusOrsoe left a comment • edited Loading

Choose a reason for hiding this comment

Aske-Rosted commented May 28, 2024

RasmusOrsoe commented May 28, 2024

Aske-Rosted commented May 29, 2024

RasmusOrsoe left a comment •

edited

Loading