Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic ensemble new #721

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Aske-Rosted
Copy link
Collaborator

the following is an implementation of the feature mentioned in #720.

I had to include a type ignore on line 68. which I personally was not too happy with, but I did not manage to satisfy the pre-commit hooks without it.

The solution uses the pre-existing EnsembleDataset class to automatically combine databases from a list of databases.

datasets = Dataset.from_config(config)

if isinstance(config.path, list):
datasets: Union[Dict[str, Dataset], Dict[str, EnsembleDataset]] = {} # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason for your problem with mypy could be that you define datasets to contain both Dataset and EnsembleDataset, but in the typehints of DataLoader the dataset is only type hinted to be dataset: Dataset. See here. I think if you changed that type hint to dataset: Union[Dataset, EnsembleDataset], the type hinting should be fine. Did you try this?

Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.

Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:

@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
    """Test constructing Dataset with multiple data paths."""
    # Arrange
    config_path = CONFIG_PATHS[backend]

    # Single dataset
    config = DatasetConfig.load(config_path)
    dataset = Dataset.from_config(config)
    # Construct multiple datasets
    config_ensemble = DatasetConfig.load(config_path)
    config_ensemble.path = [config_ensemble.path, config_ensemble.path]

    ensemble_dataset = Dataset.from_config(config)
    
    assert len(dataset)*2 == len(ensemble_dataset)

@Aske-Rosted
Copy link
Collaborator Author

Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.

Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:

@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
    """Test constructing Dataset with multiple data paths."""
    # Arrange
    config_path = CONFIG_PATHS[backend]

    # Single dataset
    config = DatasetConfig.load(config_path)
    dataset = Dataset.from_config(config)
    # Construct multiple datasets
    config_ensemble = DatasetConfig.load(config_path)
    config_ensemble.path = [config_ensemble.path, config_ensemble.path]

    ensemble_dataset = Dataset.from_config(config)
    
    assert len(dataset)*2 == len(ensemble_dataset)

Hey Rasmus looked into the suggestion for a bit but since this automatic ensembling is happing during the dataloader, and not in the dataset.from_config(config) call the current test suggestion does not test the ensembling code. This does however illuminate a bit of an issue with the current implementation of the automatic ensembling. That is you can have a dataset config file that works fine as long as you only feed it to a dataloader, however if you try to manually load the dataset from the config using the dataset class it will fail due to the list of files rather than a single file...

@RasmusOrsoe
Copy link
Collaborator

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

@Aske-Rosted
Copy link
Collaborator Author

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

I agree that is probably the more appropriate location to handle multiple file paths. I will have a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants