Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Output File Management Class For gempyor #257

Open
TimothyWillard opened this issue Jul 18, 2024 · 6 comments
Open

Model Output File Management Class For gempyor #257

TimothyWillard opened this issue Jul 18, 2024 · 6 comments
Labels
enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. low priority Low priority.
Milestone

Comments

@TimothyWillard
Copy link
Contributor

File IO related to model output is a bit scattered at the moment and difficult to test. There are also underlying assumptions throughout the package on the output directory structure that are challenging to change since it cannot be done in one place, let alone unit test.

A helpful abstraction would be a ModelOutput class where each instance would correspond to a single model output folder. The class would have methods for reading/writing files of a particular type (i.e. hosp, spar, etc.), the ability to accept arbitrary FolderIODriver that will handle the reading/writing of files to a folder (i.e. ParquetFolderIODriver, CsvFolderIODriver, etc.), and the ability to construct an instance from an existing folder for ease of use in post-processing/analysis.

Also will need to document output structure as a part of this (relates to GH-229). There has also already been some prior related discussion in GH-198.

I'll leave it to @jcblemai to comment on priority and fill in other details I have missed here, but I think this covers the main points.

@TimothyWillard TimothyWillard added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. labels Jul 18, 2024
@twallema
Copy link
Member

Hi Timothy, could you read thru #253 on using xarray as the primary simulation output and see its it's complementary to this issue?

@TimothyWillard
Copy link
Contributor Author

TimothyWillard commented Jul 18, 2024

Hi @twallema! Funny enough reading that issue and working on another issue combined spurred this thought. And when discussing this issue with @jcblemai I even mentioned the possibility of other IO formats (hence the folder driver idea or whatever we end up calling it). This setup would make it easier to just use CSV files for everything when working with sample/testing configs or potentially using netcdf for xarray objects in the future.

The first pass focus will be on centralizing the directory structure logic though.

@pearsonca
Copy link
Contributor

I have run into what I think is a related problem: the outputs aren't just nested with their corresponding configuration files.

So currently can: get config + know infrastructure => find output files (assuming a bunch of defaults). Alternatively, get output file + know infrastructure => find config file.

For my particular use, seems possible to infer configuration entries etc from the data, but all of this is a bit painful / fragile long term. Basically, want something like a single entry point object (for users / tools), which then knows how to inflate the concepts of interest independent of the underlying representation - the tools should be able to easily discover which representation is present (csv vs arrow vs database vs ...) and abstract that for the user.

@TimothyWillard
Copy link
Contributor Author

@pearsonca I do not understand your comment, could you perhaps provide a concrete example of what you're describing? I don't think configuration files fall under this issue, might be better as a separate issue.

@TimothyWillard TimothyWillard added the low priority Low priority. label Sep 30, 2024
@pearsonca
Copy link
Contributor

Sure: let's say I want to plot some outputs from a run.

I'd like to be able to do a somewhat-useful version of that just given the enclosing folder for that run. Given the known folder structure, perfectly fine to descend and grab the relevant results file(s).

But with the file(s) read in, still have to introspect out all the features (e.g. compartments, populations, etc). The alternative would be to grab those from the corresponding configuration file.

So: either have to also provide its location OR attempt to find it based on the output folder location (+some other introspection).

I think the same problem will arise for a hypothetical ModelOutput object - its probably going to want to know about the configuration associated with the output to properly structure itself.

One easy way to solve this might be to write a snapshot of the config file to the output directory?

@TimothyWillard
Copy link
Contributor Author

Ah, I see. That seems slightly larger in scope then what is described in this issue currently and involves changing the output structure slightly to now add either just a copy of the config or a parsed version of it. I'll defer to @jcblemai but I think changing output structure is challenging for legacy reasons? I suppose adding a new directory should be as bad since it maintains backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. low priority Low priority.
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants