-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic ensemble new #721
base: main
Are you sure you want to change the base?
Automatic ensemble new #721
Conversation
src/graphnet/data/dataloader.py
Outdated
datasets = Dataset.from_config(config) | ||
|
||
if isinstance(config.path, list): | ||
datasets: Union[Dict[str, Dataset], Dict[str, EnsembleDataset]] = {} # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason for your problem with mypy could be that you define datasets
to contain both Dataset
and EnsembleDataset
, but in the typehints of DataLoader
the dataset
is only type hinted to be dataset: Dataset
. See here. I think if you changed that type hint to dataset: Union[Dataset, EnsembleDataset]
, the type hinting should be fine. Did you try this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.
Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:
@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
"""Test constructing Dataset with multiple data paths."""
# Arrange
config_path = CONFIG_PATHS[backend]
# Single dataset
config = DatasetConfig.load(config_path)
dataset = Dataset.from_config(config)
# Construct multiple datasets
config_ensemble = DatasetConfig.load(config_path)
config_ensemble.path = [config_ensemble.path, config_ensemble.path]
ensemble_dataset = Dataset.from_config(config)
assert len(dataset)*2 == len(ensemble_dataset)
Hey Rasmus looked into the suggestion for a bit but since this automatic ensembling is happing during the dataloader, and not in the |
@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files. I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into |
I agree that is probably the more appropriate location to handle multiple file paths. I will have a look at it. |
the following is an implementation of the feature mentioned in #720.
I had to include a type ignore on line 68. which I personally was not too happy with, but I did not manage to satisfy the pre-commit hooks without it.
The solution uses the pre-existing
EnsembleDataset
class to automatically combine databases from a list of databases.