Model Output File Management Class For `gempyor` #257

TimothyWillard · 2024-07-18T13:47:55Z

File IO related to model output is a bit scattered at the moment and difficult to test. There are also underlying assumptions throughout the package on the output directory structure that are challenging to change since it cannot be done in one place, let alone unit test.

A helpful abstraction would be a ModelOutput class where each instance would correspond to a single model output folder. The class would have methods for reading/writing files of a particular type (i.e. hosp, spar, etc.), the ability to accept arbitrary FolderIODriver that will handle the reading/writing of files to a folder (i.e. ParquetFolderIODriver, CsvFolderIODriver, etc.), and the ability to construct an instance from an existing folder for ease of use in post-processing/analysis.

Also will need to document output structure as a part of this (relates to GH-229). There has also already been some prior related discussion in GH-198.

I'll leave it to @jcblemai to comment on priority and fill in other details I have missed here, but I think this covers the main points.

The text was updated successfully, but these errors were encountered:

twallema · 2024-07-18T14:16:55Z

Hi Timothy, could you read thru #253 on using xarray as the primary simulation output and see its it's complementary to this issue?

TimothyWillard · 2024-07-18T14:25:01Z

Hi @twallema! Funny enough reading that issue and working on another issue combined spurred this thought. And when discussing this issue with @jcblemai I even mentioned the possibility of other IO formats (hence the folder driver idea or whatever we end up calling it). This setup would make it easier to just use CSV files for everything when working with sample/testing configs or potentially using netcdf for xarray objects in the future.

The first pass focus will be on centralizing the directory structure logic though.

pearsonca · 2024-09-30T16:36:14Z

I have run into what I think is a related problem: the outputs aren't just nested with their corresponding configuration files.

So currently can: get config + know infrastructure => find output files (assuming a bunch of defaults). Alternatively, get output file + know infrastructure => find config file.

For my particular use, seems possible to infer configuration entries etc from the data, but all of this is a bit painful / fragile long term. Basically, want something like a single entry point object (for users / tools), which then knows how to inflate the concepts of interest independent of the underlying representation - the tools should be able to easily discover which representation is present (csv vs arrow vs database vs ...) and abstract that for the user.

TimothyWillard · 2024-09-30T17:50:11Z

@pearsonca I do not understand your comment, could you perhaps provide a concrete example of what you're describing? I don't think configuration files fall under this issue, might be better as a separate issue.

pearsonca · 2024-10-01T13:08:00Z

Sure: let's say I want to plot some outputs from a run.

I'd like to be able to do a somewhat-useful version of that just given the enclosing folder for that run. Given the known folder structure, perfectly fine to descend and grab the relevant results file(s).

But with the file(s) read in, still have to introspect out all the features (e.g. compartments, populations, etc). The alternative would be to grab those from the corresponding configuration file.

So: either have to also provide its location OR attempt to find it based on the output folder location (+some other introspection).

I think the same problem will arise for a hypothetical ModelOutput object - its probably going to want to know about the configuration associated with the output to properly structure itself.

One easy way to solve this might be to write a snapshot of the config file to the output directory?

TimothyWillard · 2024-10-01T14:25:58Z

Ah, I see. That seems slightly larger in scope then what is described in this issue currently and involves changing the output structure slightly to now add either just a copy of the config or a parsed version of it. I'll defer to @jcblemai but I think changing output structure is challenging for legacy reasons? I suppose adding a new directory should be as bad since it maintains backwards compatibility.

TimothyWillard added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. labels Jul 18, 2024

saraloo mentioned this issue Jul 19, 2024

Structure of simulation output is not generalizable to n-dimensional epi models (n subpopulations) #253

Open

twallema mentioned this issue Jul 23, 2024

Tryout of joint parquet/xarray output #264

Draft

TimothyWillard added this to the flepiMoP 3.0 milestone Aug 26, 2024

TimothyWillard added the low priority Low priority. label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Output File Management Class For `gempyor` #257

Model Output File Management Class For `gempyor` #257

TimothyWillard commented Jul 18, 2024

twallema commented Jul 18, 2024

TimothyWillard commented Jul 18, 2024 •

edited

Loading

pearsonca commented Sep 30, 2024

TimothyWillard commented Sep 30, 2024

pearsonca commented Oct 1, 2024

TimothyWillard commented Oct 1, 2024

Model Output File Management Class For gempyor #257

Model Output File Management Class For gempyor #257

Comments

TimothyWillard commented Jul 18, 2024

twallema commented Jul 18, 2024

TimothyWillard commented Jul 18, 2024 • edited Loading

pearsonca commented Sep 30, 2024

TimothyWillard commented Sep 30, 2024

pearsonca commented Oct 1, 2024

TimothyWillard commented Oct 1, 2024

Model Output File Management Class For `gempyor` #257

Model Output File Management Class For `gempyor` #257

TimothyWillard commented Jul 18, 2024 •

edited

Loading