Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python test infrastructure for testing memory usage limits #15231

Closed
itamarst opened this issue Mar 22, 2024 · 2 comments · Fixed by #15285
Closed

Add Python test infrastructure for testing memory usage limits #15231

itamarst opened this issue Mar 22, 2024 · 2 comments · Fixed by #15285
Labels
enhancement New feature or an improvement of an existing feature

Comments

@itamarst
Copy link
Contributor

itamarst commented Mar 22, 2024

Description

Motivation

Polars makes a variety of implicit or explicit guarantees about memory usage. For example:

  • Only the specifically requested columns will be loaded into memory when reading files. High memory usage reading single column with read_parquet #15098 was a regression that resulted in very significant memory usage increase for users of read_parquet() in some situations.
  • Streaming APIs will process data in chunks, keeping data from all being loaded in memory.

Unfortunately, bugs like #15098 won't be caught by normal tests. There is a need for test infrastructure specifically focused on measuring memory usage.

In particular, the goal is to measure peak memory usage, because that is the bottleneck relevant to users. (Allocating 100MB and freeing it in a loop 10 times means you've allocated a total 1GB, but peak usage is still only 100MB. This is very different than allocating 1GB at once, which has a peak of 1GB.)

Possible requirements

Keeping existing allocators in place is helpful: it's a pretty core API, so making it significantly different between CI runs and released code is not ideal for testing purpose.

Options

Memray

pytest-memray uses the Memray memory profiler, which works on the level of operating system allocator hooks (malloc/free/calloc/mmap). Memray doesn't support Windows.

Benefits are that it tracks everything, regardless of source.

Downsides:

Insofar as Polars uses jemalloc on platforms supported by Memray, catching smaller allocations probably won't work well, since the allocations will only be caught by Memray via the larger mmap() chunks jemalloc uses for its memory pools. Options to deal with this:

  1. Tests for memory usage would need to allocate sufficiently large amounts of memory to not fit in an existing jemalloc pool.
  2. Disable jemalloc in test builds used to run tests, allowing smaller allocations to be tracked.

Another downside is that memray impacts every test, not just those that care about memory usage. This adds some overhead, though perhaps not enough to matter in practice.

Finally, very much tied to Python.

jemalloc or mimalloc debugging hooks

jemalloc has profiling APIs, which can be turned on or off at runtime (if it's compiled with it). https://www.magiroux.com/rust-jemalloc-profiling/ talks about it a bit. It's not clear if they actually give peak memory, which is what one cares about, and it seems somewhat oriented towards dump-and-analyze later.

Not sure mimalloc has this.

The downside is that this is limited to only Rust allocations, you wouldn't see Python or NumPy or PyArrow allocations.

tracemalloc

tracemalloc is Python's built-in memory API. It makes it very easy to get peak memory over a period of time, and you can choose to only enable it for specific tests.

By default Polars' allocations won't get tracked by tracemalloc. However, there's a C API for registering allocations which Polars could use.

Essentially this would involve having a wrapper for the global allocator uses in the Rust extensions. It could only be enabled in test builds, or it could always there but disabled by default.

Benefits:

  • Tests could use small allocations.
  • Works on all platforms where Python runs.
  • NumPy hooks into Tracemalloc.

Downsides:

  • It's not clear if e.g. PyArrow does (I can check, I'm guessing not), so it's not clear if everything can be tracked this way. PyArrow does have its own hooks.
  • Python-specific.

peak_alloc

This is a global allocator that lets you get peak allocated memory easily. This could be conditionally enabled for easy Rust-only testing.

Hybrid options

Use e.g. peak_alloc + tracemalloc + PyArrow hooks to figure out peak memory.

Notes

It appears you can plug in new memory pools for PyArrow, so one could create a new one for a test and then use its max_memory() method.

@itamarst itamarst added the enhancement New feature or an improvement of an existing feature label Mar 22, 2024
@itamarst
Copy link
Contributor Author

itamarst commented Mar 22, 2024

Based on the above, it seems like the best initial approach is probably:

  1. Create a wrapper global allocator for Rust, that wraps jemalloc/mimalloc depending on platform, and registers memory with tracemalloc.
  2. Use this wrapper only for dev profiles (doing it in release would also be nice and probably low overhead, but requires benchmarking so probably out of scope for this issue).
  3. Add some PyArrow utility.
  4. Write tests for the above, to make sure they're actually tracking things!
  5. Write test infrastructure that allows asserting peak memory doesn't get above a certain level.
  6. Write a demonstration test focusing on read_parquet().

@itamarst
Copy link
Contributor Author

OK, I have this mostly working, will finish up and do PR on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant