New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make int64 casting optional #267
Conversation
@sebastianclarke for visibility. Our current hack to reduce memory usage is monkey patching |
These numpy stats functions accept a dtype argument that determines the dtype of the accumulator for functions like mean, sum, etc (see: https://numpy.org/doc/stable/reference/generated/numpy.sum.html). Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either). |
I would tend to agree that casting seems unnecessary, but I'm insufficiently familiar with the operations happening here to be sure. Certainly for categorical rasters, it seems unlikely that any value in the array should overflow the datatype in the raster dataset. With non categorical rasters, I don't know. Maybe some mathematical operations happen on the data that would cause an overflow? I can only assume there was a good reason for introducing the cast in the first place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxfenv the casting is necessary to avoid integer overflows on math ops. Without it, users can get back garbage data which I'd like to avoid at any cost. Hence up-casting to bigint as the default.
Avoiding the int cast is a great idea for memory optimization, especially if you know a priori that your values are small enough and/or you're not requesting any stats that could potentially overflow. It should be opt-in though; the default behavior is there for a very good reason and should not change.
1 change requested, otherwise, looks good.
@@ -47,7 +47,8 @@ def gen_zonal_stats( | |||
raster_out=False, | |||
prefix=None, | |||
geojson_out=False, | |||
boundless=True, **kwargs): | |||
boundless=True, | |||
cast_to_int64=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make cast_to_int64
default to True to keep backwards compatibility and avoid the potential for int overflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see it this already keeps backwards compatibility. cast_to_int64, if it remains unspecified, will be set to True
on 64 bit systems on line 139 of this file.
On 32 bit systems it is set to False
. This is exactly the same casting behaviour that existed prior to this PR
@groutr Thanks for the tip - the dtype accumulator might be exactly what we need here - the array stays in its native dtype but the result can't overflow. @maxfenv if you'd like, I can implement the above as an alternative to this PR. It should be the best of both worlds rather than having a boolean flag. |
@perrygeo Yeah go ahead with that if you prefer and you have the time! It is certainly cleaner. |
With 'large' raster datasets of integer datatype, rasterstats wastefully casts the ndarray read from the raster up to int64. In the extreme case, this uses 8 times more memory than necessary if the underlying datatype of the raster dataset is Byte.
At least for categorical rasters and when using
categorical=True
, it seems desirable to disable this behaviour to save memory.In a not particularly extreme case, this saved us on the order of 20G of memory:
The raster datasets covers all of Scotland and is of the Byte type. The individual features in the vector dataset are the Scottish country boundaries and therefore cover a decent portion of the raster area (and their bounding box an even larger portion).
The raster dataset is 46478 * 69365 pixels, 1 byte per pixel = 3.00GiB uncompressed in memory if stored in an ndarray of dtype int8. It would be 24GiB in int64.