Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make int64 casting optional #267

Closed
wants to merge 3 commits into from

Conversation

maxfenv
Copy link

@maxfenv maxfenv commented Oct 25, 2022

With 'large' raster datasets of integer datatype, rasterstats wastefully casts the ndarray read from the raster up to int64. In the extreme case, this uses 8 times more memory than necessary if the underlying datatype of the raster dataset is Byte.

At least for categorical rasters and when using categorical=True, it seems desirable to disable this behaviour to save memory.

In a not particularly extreme case, this saved us on the order of 20G of memory:

The raster datasets covers all of Scotland and is of the Byte type. The individual features in the vector dataset are the Scottish country boundaries and therefore cover a decent portion of the raster area (and their bounding box an even larger portion).

The raster dataset is 46478 * 69365 pixels, 1 byte per pixel = 3.00GiB uncompressed in memory if stored in an ndarray of dtype int8. It would be 24GiB in int64.

@maxfenv
Copy link
Author

maxfenv commented Oct 25, 2022

@sebastianclarke for visibility.

Our current hack to reduce memory usage is monkey patching rasterstats.main.sys.maxsize = 2**32, which is not ideal.

@maxfenv maxfenv marked this pull request as ready for review October 25, 2022 10:27
@groutr
Copy link

groutr commented Oct 25, 2022

These numpy stats functions accept a dtype argument that determines the dtype of the accumulator for functions like mean, sum, etc (see: https://numpy.org/doc/stable/reference/generated/numpy.sum.html).
When calculating the mean, for example, we can force that calculation to happen with int64 without having to cast the input to int64 by doing masked.mean(dtype='int64') instead of masked.mean()

Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).

@maxfenv
Copy link
Author

maxfenv commented Oct 26, 2022

Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).

I would tend to agree that casting seems unnecessary, but I'm insufficiently familiar with the operations happening here to be sure. Certainly for categorical rasters, it seems unlikely that any value in the array should overflow the datatype in the raster dataset. With non categorical rasters, I don't know. Maybe some mathematical operations happen on the data that would cause an overflow?

I can only assume there was a good reason for introducing the cast in the first place.

Copy link
Owner

@perrygeo perrygeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxfenv the casting is necessary to avoid integer overflows on math ops. Without it, users can get back garbage data which I'd like to avoid at any cost. Hence up-casting to bigint as the default.

Avoiding the int cast is a great idea for memory optimization, especially if you know a priori that your values are small enough and/or you're not requesting any stats that could potentially overflow. It should be opt-in though; the default behavior is there for a very good reason and should not change.

1 change requested, otherwise, looks good.

@@ -47,7 +47,8 @@ def gen_zonal_stats(
raster_out=False,
prefix=None,
geojson_out=False,
boundless=True, **kwargs):
boundless=True,
cast_to_int64=None, **kwargs):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make cast_to_int64 default to True to keep backwards compatibility and avoid the potential for int overflows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it this already keeps backwards compatibility. cast_to_int64, if it remains unspecified, will be set to True on 64 bit systems on line 139 of this file.

On 32 bit systems it is set to False. This is exactly the same casting behaviour that existed prior to this PR

@perrygeo
Copy link
Owner

@groutr Thanks for the tip - the dtype accumulator might be exactly what we need here - the array stays in its native dtype but the result can't overflow.

@maxfenv if you'd like, I can implement the above as an alternative to this PR. It should be the best of both worlds rather than having a boolean flag.

@maxfenv
Copy link
Author

maxfenv commented Jan 16, 2023

@perrygeo Yeah go ahead with that if you prefer and you have the time! It is certainly cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants