Make int64 casting optional #267

maxfenv · 2022-10-25T10:25:11Z

With 'large' raster datasets of integer datatype, rasterstats wastefully casts the ndarray read from the raster up to int64. In the extreme case, this uses 8 times more memory than necessary if the underlying datatype of the raster dataset is Byte.

At least for categorical rasters and when using categorical=True, it seems desirable to disable this behaviour to save memory.

In a not particularly extreme case, this saved us on the order of 20G of memory:

The raster datasets covers all of Scotland and is of the Byte type. The individual features in the vector dataset are the Scottish country boundaries and therefore cover a decent portion of the raster area (and their bounding box an even larger portion).

The raster dataset is 46478 * 69365 pixels, 1 byte per pixel = 3.00GiB uncompressed in memory if stored in an ndarray of dtype int8. It would be 24GiB in int64.

maxfenv · 2022-10-25T10:27:12Z

@sebastianclarke for visibility.

Our current hack to reduce memory usage is monkey patching rasterstats.main.sys.maxsize = 2**32, which is not ideal.

groutr · 2022-10-25T19:23:19Z

These numpy stats functions accept a dtype argument that determines the dtype of the accumulator for functions like mean, sum, etc (see: https://numpy.org/doc/stable/reference/generated/numpy.sum.html).
When calculating the mean, for example, we can force that calculation to happen with int64 without having to cast the input to int64 by doing masked.mean(dtype='int64') instead of masked.mean()

Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).

maxfenv · 2022-10-26T08:41:56Z

Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).

I would tend to agree that casting seems unnecessary, but I'm insufficiently familiar with the operations happening here to be sure. Certainly for categorical rasters, it seems unlikely that any value in the array should overflow the datatype in the raster dataset. With non categorical rasters, I don't know. Maybe some mathematical operations happen on the data that would cause an overflow?

I can only assume there was a good reason for introducing the cast in the first place.

perrygeo

@maxfenv the casting is necessary to avoid integer overflows on math ops. Without it, users can get back garbage data which I'd like to avoid at any cost. Hence up-casting to bigint as the default.

Avoiding the int cast is a great idea for memory optimization, especially if you know a priori that your values are small enough and/or you're not requesting any stats that could potentially overflow. It should be opt-in though; the default behavior is there for a very good reason and should not change.

1 change requested, otherwise, looks good.

perrygeo · 2023-01-15T04:44:11Z

src/rasterstats/main.py

@@ -47,7 +47,8 @@ def gen_zonal_stats(
        raster_out=False,
        prefix=None,
        geojson_out=False, 
-        boundless=True, **kwargs):
+        boundless=True,
+        cast_to_int64=None, **kwargs):


Can you make cast_to_int64 default to True to keep backwards compatibility and avoid the potential for int overflows?

As I see it this already keeps backwards compatibility. cast_to_int64, if it remains unspecified, will be set to True on 64 bit systems on line 139 of this file.

On 32 bit systems it is set to False. This is exactly the same casting behaviour that existed prior to this PR

perrygeo · 2023-01-15T04:51:20Z

@groutr Thanks for the tip - the dtype accumulator might be exactly what we need here - the array stays in its native dtype but the result can't overflow.

@maxfenv if you'd like, I can implement the above as an alternative to this PR. It should be the best of both worlds rather than having a boolean flag.

maxfenv · 2023-01-16T09:04:58Z

@perrygeo Yeah go ahead with that if you prefer and you have the time! It is certainly cleaner.

maxf130 added 3 commits October 25, 2022 11:04

Make casting optional

af61ef4

Add a note about cast_to_int64 in manual

1e6f8ef

Link to the numpy issue in docstring

3d82847

maxfenv marked this pull request as ready for review October 25, 2022 10:27

perrygeo requested changes Jan 15, 2023

View reviewed changes

perrygeo added the Needs Additional Info label Jan 15, 2023

perrygeo mentioned this pull request Feb 16, 2023

Use numpy dtype arg selectively, instead of casting all integer data #279

Merged

perrygeo closed this in #279 Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make int64 casting optional #267

Make int64 casting optional #267

maxfenv commented Oct 25, 2022

maxfenv commented Oct 25, 2022

groutr commented Oct 25, 2022 •

edited

maxfenv commented Oct 26, 2022

perrygeo left a comment

perrygeo Jan 15, 2023

maxfenv Jan 16, 2023

perrygeo commented Jan 15, 2023

maxfenv commented Jan 16, 2023

Make int64 casting optional #267

Make int64 casting optional #267

Conversation

maxfenv commented Oct 25, 2022

maxfenv commented Oct 25, 2022

groutr commented Oct 25, 2022 • edited

maxfenv commented Oct 26, 2022

perrygeo left a comment

Choose a reason for hiding this comment

perrygeo Jan 15, 2023

Choose a reason for hiding this comment

maxfenv Jan 16, 2023

Choose a reason for hiding this comment

perrygeo commented Jan 15, 2023

maxfenv commented Jan 16, 2023

groutr commented Oct 25, 2022 •

edited