Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch and load public base model snapshot #102

Merged
merged 27 commits into from Jun 20, 2023
Merged

Fetch and load public base model snapshot #102

merged 27 commits into from Jun 20, 2023

Conversation

khaeru
Copy link
Member

@khaeru khaeru commented May 24, 2023

Following the release of https://doi.org/10.5281/zenodo.5793870, this PR adds code to fetch the snapshot from Zenodo and load it in a platform of the user's choice, for various uses including further development and testing other code.

Notes

Units

The first snapshot contains the unit strings "USD_2005/t" and "USD_2005/t " (with trailing space).

  • With the current (only) JDBCBackend, the ixmp_source Java code rejects an attempt to define either of these unit strings if the other is already defined. However, on the ixmp-dev platform (IIASA ECE's internal Oracle database), these two unit strings are already defined (it's unclear how this happened; possibly was done with an older version of the ixmp_source Java code).
  • As a result, attempting to load the snapshot with the built-in Scenario.read_excel() fails.
  • Further, rewriting just a few values in the snapshot Excel file using pandas or openpyxl has very poor performance.
  • This PR works around these issues using a utility function, .snapshot._unpack(), that unpacks or explodes the entire Excel snapshot into 1 compressed CSV file per parameter. These files are then individually added to the scenario.

Testing

  • Initially, running the entire process failed on GitHub Actions macOS and Windows runners with java.lang.OutOfMemoryError: Java heap space.
  • On Linux, the job would time out after 6 hours, which seems excessive: on my machine, the conversion step only takes about 5 minutes.
  • See below Fetch and load public base model snapshot #102 (comment) —most of this time is simply reading the Excel file using pandas/openpyxl.
  • This PR works around these issues with the following steps:
    • The already-existing --jvmargs pytest option (defined in message_ix_models.testing) is used to increase JVM heap space to 6 GB; this is slightly below the total available on GHA runners.
    • An already-unpacked set of files is added in message_ix_models/data/test/MESSAGEix-GLOBIOM…
      These files are excluded from packaging (MANIFEST.in).
    • A fixture is added, unpacked_snapshot_data, which moves these files into the location they would be unpacked to.
      The files are thus not read from Excel again when the tests execute.
    • test_snapshot.test_load is limited to the ubuntu-latest runners, i.e. skipped on macOS and Windows. This is because it increases the total job run time from:
      • Linux: ~3 to ~9–11 minutes = still tolerable.
      • macOS: ~8 to ~37–41 minutes = too long.
      • Windows: ~7 minutes to >2.5 hours = ditto.

Other changes

  • Update from Python 3.10 to 3.11 in CI workflows.
  • All XLSX files are stored using Git LFS; this includes those previously added in Add MESSAGEix-Nexus module #88.

How to review

  • Read the added documentation and ensure it is clear about how to use the code.
  • Run mix-models snapshot fetch 0 on the branch; confirm the code works.
  • Note that the CI checks all pass.

PR checklist

  • Continuous integration checks all ✅
  • Add or expand tests; coverage checks both ✅
  • Add, expand, or update documentation.
  • Update doc/whatsnew.

@khaeru khaeru added the enh New features or functionality label May 24, 2023
@khaeru khaeru self-assigned this May 24, 2023
@khaeru khaeru marked this pull request as draft May 24, 2023 09:01
@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Merging #102 (2c685e3) into main (1cd26f9) will increase coverage by 0.76%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
+ Coverage   67.11%   67.87%   +0.76%     
==========================================
  Files          58       61       +3     
  Lines        4011     4106      +95     
==========================================
+ Hits         2692     2787      +95     
  Misses       1319     1319              
Impacted Files Coverage Δ
message_ix_models/cli.py 100.00% <100.00%> (ø)
message_ix_models/model/snapshot.py 100.00% <100.00%> (ø)
message_ix_models/testing.py 100.00% <100.00%> (ø)
message_ix_models/tests/model/test_snapshot.py 100.00% <100.00%> (ø)
message_ix_models/tests/util/test_context.py 100.00% <100.00%> (ø)
message_ix_models/util/pooch.py 100.00% <100.00%> (ø)

@khaeru
Copy link
Member Author

khaeru commented Jun 16, 2023

On Linux, the job times out after 6 hours, which seems excessive: on my machine, the conversion step only takes about 5 minutes.

One solution here would be to increase the JVM heap space for the JDBCBackend used in the tests, when on GHA. However, because this uses ixmp.testing._platform_fixture, this would require changing that module to accept/use a configuration option to set the specific value. Forgot I had already added this feature, in #26.

@glatterf42
Copy link
Member

combined
Before your latest commit, I started a profiling run locally. Maybe that wasn't necessary, but I thought I should still share what it produced. I don't see a single culprit (except that this goes back to _unpack()), but some functions are called thousands or millions of times, which obviously is a huge factor. Though I don't know if that's avoidable.

@glatterf42
Copy link
Member

Also, since we are actually working on ubuntu-latest as well, we could use https://github.com/nektos/act to run the tests locally as if we were using GHA.

@khaeru
Copy link
Member Author

khaeru commented Jun 16, 2023

Before your latest commit, I started a profiling run locally. Maybe that wasn't necessary, but I thought I should still share what it produced. I don't see a single culprit (except that this goes back to _unpack()), but some functions are called thousands or millions of times, which obviously is a huge factor. Though I don't know if that's avoidable.

Thanks, that's useful. Looking at snapshot._read_excel(), we see that's 99% of run time, while within it parse_item_sheets() is 73%—and all of that is internal to pandas and openpyxl.

This means everything else, i.e.:

  • Write the parsed data to many CSV files.
  • Re-read those CSV files.
  • Correct the unit error.
  • Add them to the Scenario.

…altogether takes up 99 - 73 = 26% of the run-time, i.e. 1/3 of the duration of the Excel read.

If there's a faster way of reading large Excel files, we could incorporate that. I have searched repeatedly, but not found anything.

Track all .csv.gz and .xlsx files using Git LFS.
@khaeru khaeru marked this pull request as ready for review June 16, 2023 20:00
@khaeru khaeru requested a review from glatterf42 June 16, 2023 20:04
@khaeru
Copy link
Member Author

khaeru commented Jun 16, 2023

FYI @awais307: @glatterf42 mentioned that you wanted to write some code to fetch GLOBIOM data from Zenodo. I didn't know that GLOBIOM data was already published there! In any case, please look at the message_ix_models.util.pooch.fetch() function added by this PR. I'd recommend using this for your purposes. If it doesn't meet your needs, please say how and we can talk about how to extend it.

Copy link
Member

@glatterf42 glatterf42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are all passing and the CLI command works perfectly, but the docs could use improvement:
Trying to execute the code in doc/api/model-snapshot.rst, I find ImportError: cannot import name 'snapshot' from 'message_ix_models' (/home/fridolin/message-ix-models/message_ix_models/__init__.py) or Module "message_ix_models" has no attribute "snapshot". Also, line 24 should either be scenario = ... or line 26 should be snapshot.load(s, 0), I think.

@khaeru
Copy link
Member Author

khaeru commented Jun 19, 2023

Thanks for catching those! I will push another commit to fix them, so that you can approve.

Copy link
Member

@glatterf42 glatterf42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, snapshot is importable now. And while I trust that you checked the load function yourself, I actually don't have enough modeling experience to get this code snippet to work (since I fail to provide suitable parameters for Scenario(...)). This might indicate that the example should be expanded depending on who the intended users are, but I also don't doubt that I would be able to get this to run if I spent more time on reading the docs about Scenario().

@khaeru
Copy link
Member Author

khaeru commented Jun 20, 2023

Thanks for the fixes, snapshot is importable now.

Great —can you then please approve? ✅

This code is indeed meant to be used by users who have already learned how to create new Scenarios on message_ix, and the links are there to the documentation of the Scenario class if they need to remind themselves.

Copy link
Member

@glatterf42 glatterf42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, looks good to me then.

@khaeru khaeru merged commit e9aeb26 into main Jun 20, 2023
10 checks passed
@glatterf42 glatterf42 deleted the enh/zenodo-snapshot branch June 20, 2023 10:36
measrainsey pushed a commit that referenced this pull request Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enh New features or functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants