Performance optimizations #268

groutr · 2022-11-04T22:21:13Z

A small collection of performance enhancements aimed at reducing the number of times the data is traversed.

Also improved percentile computation by computing them all at once instead of one at a time (and avoid recreating the compressed array each time).

groutr · 2022-11-09T06:51:23Z

src/rasterstats/main.py

-                    except AttributeError:
-                        pixel_count = dict(zip([np.asscalar(k) for k in keys],
-                                               [np.asscalar(c) for c in counts]))
+                    pixel_count = dict(zip(keys.tolist(), counts.tolist()))


This builds pixel_count in C instead of Python, and is consequently significantly faster.
Also, pixel_count should really be a list of tuples. Floats do not make good keys in hash table.

Good idea on the pixel_count calcs.

Tuples would work, though I'm hestitant to change that now due to backwards compatibility with existing code. I'd consider it thought. But In practice, categorical rasters (the only rasters it would make sense to request pixel counts on) are going to be some integer dtype. A categorical float would be exceedingly rare.

Perhaps adding a note about this stat being intended for categorical rasters in the docstring would be enough.

groutr · 2022-11-09T06:56:45Z

src/rasterstats/main.py

@@ -205,14 +207,10 @@ def gen_zonal_stats(
                if 'count' in stats:  # special case, zero makes sense here
                    feature_stats['count'] = 0
            else:
+                valid = masked.compressed()
+                keys, counts = np.unique(valid, return_counts=True)


The unique elements, keys, has the really useful property of being sorted. We can take advantage of that for several stats and thus eliminate two O(n) passes (min, max) over the raster data. We can also use it to avoid a sort in computing the median.

Nice idea. I like this approach. Two comments:

We could eliminate the np.unique call when the stats do not require it. I guess most stats do now... but we could avoid it entirely if the stats were only mean, count or std

This impacts memory usage. Keeping the entire keys and counts arrays in memory for the remainder of the loop might be a bad tradeoff in some cases. Prior to this PR, the unique stat was the only one that called np.unique, and users had to opt in. That stat is intended for categorical rasters - using it on continuous values would not make much sense. Now we're calling np.unique on every invocation and using that to derive other stats - fine in theory. But for continuous rasters or those with 16+ bits per band, there's not likely to be any duplicates - meaning we're allocating a keys and counts array that are both roughly the length of the masked array itself (possible 200% increase in memory footprint). This causes a performance regression for several use cases; code which never had to call np.unique now does, and allocates huge chunks of memory as a result. Multiple O(n) passes have the advantage of not requiring new allocations unless unique is specifically requested.

So while this may speed up some use cases, it may slow down others and will certainly cause additional resource usage. I'll have to do some more testing to quantify the impact and evaluate the tradeoffs

groutr · 2022-11-09T07:40:42Z

A quick benchmark using some datasets from your blog @perrygeo (https://www.perrygeo.com/zonal-stats-with-postgis-rasters-part-2.html)

import time
import rasterstats
print(rasterstats)
from rasterstats import zonal_stats

start = time.perf_counter()
stats = zonal_stats(
    "ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
    "wc2.1_2.5m_tavg/wc2.1_2.5m_tavg_07.tif",
    stats=["unique", "range", "std", "sum", "mean", "count", "std", "min", "max", "median", "percentile_.1", "percentile_.5", "percentile_.9"]
)
print(time.perf_counter() - start)

On my laptop, the latest version of rasterstats runs in 11.847925074999694s. With this PR, that runtime drops to 8.6056675789996s

perrygeo

Thanks @groutr - excellent work!

I'm concerned about the additional memory usage here. I need to run some more rigorous benchmarks to quantify the impact on all the various edge cases

Hopefully it's a net win for speed in most cases, and a tolerable increase in memory footprint.

perrygeo · 2023-01-15T04:05:40Z

src/rasterstats/main.py

@@ -205,14 +207,10 @@ def gen_zonal_stats(
                if 'count' in stats:  # special case, zero makes sense here
                    feature_stats['count'] = 0
            else:
+                valid = masked.compressed()
+                keys, counts = np.unique(valid, return_counts=True)


Nice idea. I like this approach. Two comments:

We could eliminate the np.unique call when the stats do not require it. I guess most stats do now... but we could avoid it entirely if the stats were only mean, count or std

This impacts memory usage. Keeping the entire keys and counts arrays in memory for the remainder of the loop might be a bad tradeoff in some cases. Prior to this PR, the unique stat was the only one that called np.unique, and users had to opt in. That stat is intended for categorical rasters - using it on continuous values would not make much sense. Now we're calling np.unique on every invocation and using that to derive other stats - fine in theory. But for continuous rasters or those with 16+ bits per band, there's not likely to be any duplicates - meaning we're allocating a keys and counts array that are both roughly the length of the masked array itself (possible 200% increase in memory footprint). This causes a performance regression for several use cases; code which never had to call np.unique now does, and allocates huge chunks of memory as a result. Multiple O(n) passes have the advantage of not requiring new allocations unless unique is specifically requested.

So while this may speed up some use cases, it may slow down others and will certainly cause additional resource usage. I'll have to do some more testing to quantify the impact and evaluate the tradeoffs

perrygeo · 2023-01-15T04:12:32Z

src/rasterstats/main.py

-                    except AttributeError:
-                        pixel_count = dict(zip([np.asscalar(k) for k in keys],
-                                               [np.asscalar(c) for c in counts]))
+                    pixel_count = dict(zip(keys.tolist(), counts.tolist()))


Good idea on the pixel_count calcs.

Tuples would work, though I'm hestitant to change that now due to backwards compatibility with existing code. I'd consider it thought. But In practice, categorical rasters (the only rasters it would make sense to request pixel counts on) are going to be some integer dtype. A categorical float would be exceedingly rare.

perrygeo · 2023-01-15T16:37:10Z

Initial testing

Setup

Uint16 tif, 25235x25944, tiled (source)
shapefile with 11 multipolygons, fully covering the valid tif values. (source)

Results (basic stats)

zonal_stats(polygon, raster, stats=["min", "max", "count"])

command time --verbose python bench.py

master (fastest, lowest memory footprint)

Maximum resident set size (kbytes): 2458736
Elapsed (wall clock) time: 0:15.30

groutr:perf1

Maximum resident set size (kbytes): 2742956
Elapsed (wall clock) time: 0:22.77

So as I suspected, in the case where the user is requesting only basic stats, this PR will increase the memory footprint (in this case +12%) and the run time (+48%)

There are a significant number of duplicates in this dataset (since the tif is integer values of elevation, highly spatially autocorrelated) so I'd suspect that datasets with larger integers or floats would have more unique values and the numbers would be worse. But for now, we'll just focus on this dataset...

Results (basic stats + unique)

Let's add unique to invoke np.unique calls in both

zonal_stats(polygon, raster, stats=["min", "max", "count", "unique"])

command time --verbose python bench.py

master: (lowest memory footprint)

Maximum resident set size (kbytes): 2458940
Elapsed (wall clock) time: 0:25.61

groutr:perf1 (fastest)

Maximum resident set size (kbytes): 2743124
Elapsed (wall clock) time: 0:22.59

It's a draw. This PR is faster at the expense of an increased memory footprint.

Results (all stats)

Let's try it with every statistic we can calculate:

zonal_stats(polygon, raster, stats="*")

command time --verbose python bench.py

master:

Maximum resident set size (kbytes): 3435952
Elapsed (wall clock) time: 0:35.89

groutr:perf1 (fastest, lowest memory footprint)

Maximum resident set size (kbytes): 3149592
Elapsed (wall clock) time: 0:25.81

This PR shines ✨

Conclusion

There's no definitive answer, only tradeoffs :-)

This PR's approach is faster with a lower memory footprint, if all the stats are requested.

This PR's approach is slower with an increased memory footprint, when using a small subset of stats and fewer duplicate pixel values.

For this dataset, the number of unique values was 4 orders of magnitude lower than the total count, ie plenty of duplicates. We can intuit that the memory footprint and performance of this PR would worsen as the number of unique values approached the total count. If I have time in the future, I'd like to test with a floating point grid of random noise to simulate the worst case scenario.

We know that this PR improves performance for certain cases but introduces performance regressions in other cases; not a clear win over the current approach. The question is: On average, do the benefits outweigh the negatives across a wide variety of use cases?. I'm going to need to see more benchmark data from real-world datasets to make the decision.

perrygeo

Looking for feedback/testing on real-world datasets.

Does this speed things up for you? Does it increase or decrease memory usage? Please let me know in the comments.

perrygeo · 2023-01-15T16:47:46Z

@groutr we could also try to make this np.unique approach optional so that users could opt-in to this behavior if they knew their datasets would benefit from it.

perrygeo · 2023-01-15T17:03:03Z

@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.

groutr · 2023-01-18T16:19:17Z

@perrygeo I will need to dig up my real-world test case today. IIRC, the changes in the PR yielded a dramatic performance increase.

groutr · 2023-01-18T16:20:20Z

@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.

Sure!

groutr · 2023-02-22T01:07:03Z

@perrygeo My real-world test case was flawed. The performance difference was largely accounted for by a change in formats that escaped my notice.

However, I've been thinking about this further and wanted to hear you thoughts about the following idea.
What if the stats were broken out into functions that were called on the zone. For example, requesting the minimum of a zone would call a function that found the minimum and returned it. This would allow for user-defined functions to be given that could be tailored to specific datasets allowing for computations to be optimized or even parallelized.
The main loop of rasterstats would simply orchestrate the execution of these functions and provide the appropriate input data. At a very high level, it would map a list of functions over the zone data and return the collected results from each function.

groutr added 3 commits November 4, 2022 17:05

Reduce the number of times data array is traversed

c01e17e

fix typo.

7a41c9f

Use isclose to compare floats.

647b64a

groutr commented Nov 9, 2022

View reviewed changes

perrygeo reviewed Jan 15, 2023

View reviewed changes

perrygeo requested changes Jan 15, 2023

View reviewed changes

perrygeo added the Needs Additional Info label Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations #268

Performance optimizations #268

groutr commented Nov 4, 2022 •

edited

groutr Nov 9, 2022 •

edited

perrygeo Jan 15, 2023

groutr Jan 18, 2023

groutr Nov 9, 2022 •

edited

perrygeo Jan 15, 2023 •

edited

groutr commented Nov 9, 2022

perrygeo left a comment

perrygeo Jan 15, 2023 •

edited

perrygeo Jan 15, 2023

perrygeo commented Jan 15, 2023 •

edited

perrygeo left a comment

perrygeo commented Jan 15, 2023

perrygeo commented Jan 15, 2023

groutr commented Jan 18, 2023

groutr commented Jan 18, 2023

groutr commented Feb 22, 2023 •

edited

Performance optimizations #268

Are you sure you want to change the base?

Performance optimizations #268

Conversation

groutr commented Nov 4, 2022 • edited

groutr Nov 9, 2022 • edited

Choose a reason for hiding this comment

perrygeo Jan 15, 2023

Choose a reason for hiding this comment

groutr Jan 18, 2023

Choose a reason for hiding this comment

groutr Nov 9, 2022 • edited

Choose a reason for hiding this comment

perrygeo Jan 15, 2023 • edited

Choose a reason for hiding this comment

groutr commented Nov 9, 2022

perrygeo left a comment

Choose a reason for hiding this comment

perrygeo Jan 15, 2023 • edited

Choose a reason for hiding this comment

perrygeo Jan 15, 2023

Choose a reason for hiding this comment

perrygeo commented Jan 15, 2023 • edited

Setup

Results (basic stats)

Results (basic stats + unique)

Results (all stats)

Conclusion

perrygeo left a comment

Choose a reason for hiding this comment

perrygeo commented Jan 15, 2023

perrygeo commented Jan 15, 2023

groutr commented Jan 18, 2023

groutr commented Jan 18, 2023

groutr commented Feb 22, 2023 • edited

groutr commented Nov 4, 2022 •

edited

groutr Nov 9, 2022 •

edited

groutr Nov 9, 2022 •

edited

perrygeo Jan 15, 2023 •

edited

perrygeo Jan 15, 2023 •

edited

perrygeo commented Jan 15, 2023 •

edited

groutr commented Feb 22, 2023 •

edited