Fetch and load public base model snapshot #102

khaeru · 2023-05-24T09:01:11Z

Following the release of https://doi.org/10.5281/zenodo.5793870, this PR adds code to fetch the snapshot from Zenodo and load it in a platform of the user's choice, for various uses including further development and testing other code.

Notes

Units

The first snapshot contains the unit strings "USD_2005/t" and "USD_2005/t " (with trailing space).

With the current (only) JDBCBackend, the ixmp_source Java code rejects an attempt to define either of these unit strings if the other is already defined. However, on the ixmp-dev platform (IIASA ECE's internal Oracle database), these two unit strings are already defined (it's unclear how this happened; possibly was done with an older version of the ixmp_source Java code).
As a result, attempting to load the snapshot with the built-in Scenario.read_excel() fails.
Further, rewriting just a few values in the snapshot Excel file using pandas or openpyxl has very poor performance.
This PR works around these issues using a utility function, .snapshot._unpack(), that unpacks or explodes the entire Excel snapshot into 1 compressed CSV file per parameter. These files are then individually added to the scenario.

Testing

Initially, running the entire process failed on GitHub Actions macOS and Windows runners with java.lang.OutOfMemoryError: Java heap space.
On Linux, the job would time out after 6 hours, which seems excessive: on my machine, the conversion step only takes about 5 minutes.
See below Fetch and load public base model snapshot #102 (comment) —most of this time is simply reading the Excel file using pandas/openpyxl.
This PR works around these issues with the following steps:
- The already-existing --jvmargs pytest option (defined in message_ix_models.testing) is used to increase JVM heap space to 6 GB; this is slightly below the total available on GHA runners.
- An already-unpacked set of files is added in message_ix_models/data/test/MESSAGEix-GLOBIOM…
  These files are excluded from packaging (MANIFEST.in).
- A fixture is added, unpacked_snapshot_data, which moves these files into the location they would be unpacked to.
  The files are thus not read from Excel again when the tests execute.
- test_snapshot.test_load is limited to the ubuntu-latest runners, i.e. skipped on macOS and Windows. This is because it increases the total job run time from:
  - Linux: ~3 to ~9–11 minutes = still tolerable.
  - macOS: ~8 to ~37–41 minutes = too long.
  - Windows: ~7 minutes to >2.5 hours = ditto.

Other changes

Update from Python 3.10 to 3.11 in CI workflows.
All XLSX files are stored using Git LFS; this includes those previously added in Add MESSAGEix-Nexus module #88.

How to review

Read the added documentation and ensure it is clear about how to use the code.
Run mix-models snapshot fetch 0 on the branch; confirm the code works.
Note that the CI checks all pass.

PR checklist

Continuous integration checks all ✅
Add or expand tests; coverage checks both ✅
Add, expand, or update documentation.
Update doc/whatsnew.

codecov · 2023-05-24T09:05:43Z

Codecov Report

Merging #102 (2c685e3) into main (1cd26f9) will increase coverage by 0.76%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
+ Coverage   67.11%   67.87%   +0.76%     
==========================================
  Files          58       61       +3     
  Lines        4011     4106      +95     
==========================================
+ Hits         2692     2787      +95     
  Misses       1319     1319

Impacted Files	Coverage Δ
message_ix_models/cli.py	`100.00% <100.00%> (ø)`
message_ix_models/model/snapshot.py	`100.00% <100.00%> (ø)`
message_ix_models/testing.py	`100.00% <100.00%> (ø)`
message_ix_models/tests/model/test_snapshot.py	`100.00% <100.00%> (ø)`
message_ix_models/tests/util/test_context.py	`100.00% <100.00%> (ø)`
message_ix_models/util/pooch.py	`100.00% <100.00%> (ø)`

cf. fatiando/pooch#330

khaeru · 2023-06-16T08:01:11Z

On Linux, the job times out after 6 hours, which seems excessive: on my machine, the conversion step only takes about 5 minutes.

One solution here would be to increase the JVM heap space for the JDBCBackend used in the tests, when on GHA. ~~However, because this uses ixmp.testing._platform_fixture, this would require changing that module to accept/use a configuration option to set the specific value.~~ Forgot I had already added this feature, in #26.

glatterf42 · 2023-06-16T11:49:19Z

Before your latest commit, I started a profiling run locally. Maybe that wasn't necessary, but I thought I should still share what it produced. I don't see a single culprit (except that this goes back to _unpack()), but some functions are called thousands or millions of times, which obviously is a huge factor. Though I don't know if that's avoidable.

glatterf42 · 2023-06-16T11:52:36Z

Also, since we are actually working on ubuntu-latest as well, we could use https://github.com/nektos/act to run the tests locally as if we were using GHA.

khaeru · 2023-06-16T12:20:37Z

Before your latest commit, I started a profiling run locally. Maybe that wasn't necessary, but I thought I should still share what it produced. I don't see a single culprit (except that this goes back to _unpack()), but some functions are called thousands or millions of times, which obviously is a huge factor. Though I don't know if that's avoidable.

Thanks, that's useful. Looking at snapshot._read_excel(), we see that's 99% of run time, while within it parse_item_sheets() is 73%—and all of that is internal to pandas and openpyxl.

This means everything else, i.e.:

Write the parsed data to many CSV files.
Re-read those CSV files.
Correct the unit error.
Add them to the Scenario.

…altogether takes up 99 - 73 = 26% of the run-time, i.e. 1/3 of the duration of the Excel read.

If there's a faster way of reading large Excel files, we could incorporate that. I have searched repeatedly, but not found anything.

Track all .csv.gz and .xlsx files using Git LFS.

khaeru · 2023-06-16T20:11:38Z

FYI @awais307: @glatterf42 mentioned that you wanted to write some code to fetch GLOBIOM data from Zenodo. I didn't know that GLOBIOM data was already published there! In any case, please look at the message_ix_models.util.pooch.fetch() function added by this PR. I'd recommend using this for your purposes. If it doesn't meet your needs, please say how and we can talk about how to extend it.

glatterf42

The tests are all passing and the CLI command works perfectly, but the docs could use improvement:
Trying to execute the code in doc/api/model-snapshot.rst, I find ImportError: cannot import name 'snapshot' from 'message_ix_models' (/home/fridolin/message-ix-models/message_ix_models/__init__.py) or Module "message_ix_models" has no attribute "snapshot". Also, line 24 should either be scenario = ... or line 26 should be snapshot.load(s, 0), I think.

khaeru · 2023-06-19T08:31:59Z

Thanks for catching those! I will push another commit to fix them, so that you can approve.

glatterf42

Thanks for the fixes, snapshot is importable now. And while I trust that you checked the load function yourself, I actually don't have enough modeling experience to get this code snippet to work (since I fail to provide suitable parameters for Scenario(...)). This might indicate that the example should be expanded depending on who the intended users are, but I also don't doubt that I would be able to get this to run if I spent more time on reading the docs about Scenario().

khaeru · 2023-06-20T08:16:57Z

Thanks for the fixes, snapshot is importable now.

Great —can you then please approve? ✅

This code is indeed meant to be used by users who have already learned how to create new Scenarios on message_ix, and the links are there to the documentation of the Scenario class if they need to remind themselves.

glatterf42

I see, looks good to me then.

khaeru added the enh New features or functionality label May 24, 2023

khaeru self-assigned this May 24, 2023

khaeru marked this pull request as draft May 24, 2023 09:01

khaeru force-pushed the enh/zenodo-snapshot branch from 5bc623e to 6e50b47 Compare May 26, 2023 16:01

khaeru added 19 commits June 15, 2023 16:17

Add .util.pooch for fetching remote files

4bc97d0

Add pooch, tqdm to dependencies

fabed10

Show a progress bar when downloading snapshot

0b0b004

Connect .util.pooch CLI to mix-models

3eeede7

Add unit/snapshot.yaml with units appearing in snapshot

c4618c5

Add .model.snapshot.load()

13540c5

Test .model.snapshot.load()

106edfc

Omit pooch from type checking

79f4ae5

cf. fatiando/pooch#330

Add types-tqdm in "lint" CI workflow

1b0604b

Bump Python version to 3.11 in lint, pytest workflows

72a14d1

Test against both ixmp and message-ix released versions

db016e4

Add Python 3.11 to classifiers

0bf5063

Use Context.get_cache_path() in pooch.fetch()

2920f9e

Rely on Path type in .snapshot

f2e5703

Use --local-cache option from message_ix_models.testing

66a0bfd

Cache local data on GitHub Actions

20a92a0

Reuse the ixmp "config" CLI command for mix-models

cb77557

Adjust test_get_cache_path() for --local-cache usage

a1128ee

Make progressbar optional in pooch.fetch()

393f213

khaeru force-pushed the enh/zenodo-snapshot branch from 6e50b47 to 393f213 Compare June 15, 2023 14:18

Add unpacked data from Zenodo snapshot

436a177

Track all .csv.gz and .xlsx files using Git LFS.

Add fixture to use unpacked snapshot data

b3c043d

khaeru force-pushed the enh/zenodo-snapshot branch from 8d51636 to b3c043d Compare June 16, 2023 16:14

khaeru added 5 commits June 16, 2023 18:28

Add -Xmx6G in "pytest" CI workflow

3587010

Skip test_snapshot.test_load on macOS, Windows for GHA

a4985b9

Restore coverage in .model.snapshot, .util.pooch

d5b43eb

Document .model.snapshot, .util.pooch

7dceae4

Add #102 to doc/whatsnew

da744a0

khaeru marked this pull request as ready for review June 16, 2023 20:00

khaeru requested a review from glatterf42 June 16, 2023 20:04

glatterf42 reviewed Jun 19, 2023

View reviewed changes

Improve doc/model-snapshot per @glatterf42 review comment

2c685e3

glatterf42 reviewed Jun 19, 2023

View reviewed changes

glatterf42 approved these changes Jun 20, 2023

View reviewed changes

khaeru merged commit e9aeb26 into main Jun 20, 2023
10 checks passed

glatterf42 deleted the enh/zenodo-snapshot branch June 20, 2023 10:36

measrainsey pushed a commit that referenced this pull request Jul 19, 2023

Add #102 to doc/whatsnew

2104516

This was referenced Jan 15, 2024

Error occurs when runing MESSAGEix-GLOBIOM iiasa/message_ix#780

Closed

Correct DOI URL in doc/api/model-snapshot #143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch and load public base model snapshot #102

Fetch and load public base model snapshot #102

khaeru commented May 24, 2023 •

edited

codecov bot commented May 24, 2023 •

edited

khaeru commented Jun 16, 2023 •

edited

glatterf42 commented Jun 16, 2023

glatterf42 commented Jun 16, 2023

khaeru commented Jun 16, 2023

khaeru commented Jun 16, 2023

glatterf42 left a comment

khaeru commented Jun 19, 2023

glatterf42 left a comment

khaeru commented Jun 20, 2023

glatterf42 left a comment

Fetch and load public base model snapshot #102

Fetch and load public base model snapshot #102

Conversation

khaeru commented May 24, 2023 • edited

Notes

Units

Testing

Other changes

How to review

PR checklist

codecov bot commented May 24, 2023 • edited

Codecov Report

khaeru commented Jun 16, 2023 • edited

glatterf42 commented Jun 16, 2023

glatterf42 commented Jun 16, 2023

khaeru commented Jun 16, 2023

khaeru commented Jun 16, 2023

glatterf42 left a comment

Choose a reason for hiding this comment

khaeru commented Jun 19, 2023

glatterf42 left a comment

Choose a reason for hiding this comment

khaeru commented Jun 20, 2023

glatterf42 left a comment

Choose a reason for hiding this comment

khaeru commented May 24, 2023 •

edited

codecov bot commented May 24, 2023 •

edited

khaeru commented Jun 16, 2023 •

edited