Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations #268

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Performance optimizations #268

wants to merge 3 commits into from

Conversation

groutr
Copy link

@groutr groutr commented Nov 4, 2022

A small collection of performance enhancements aimed at reducing the number of times the data is traversed.

Also improved percentile computation by computing them all at once instead of one at a time (and avoid recreating the compressed array each time).

except AttributeError:
pixel_count = dict(zip([np.asscalar(k) for k in keys],
[np.asscalar(c) for c in counts]))
pixel_count = dict(zip(keys.tolist(), counts.tolist()))
Copy link
Author

@groutr groutr Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builds pixel_count in C instead of Python, and is consequently significantly faster.
Also, pixel_count should really be a list of tuples. Floats do not make good keys in hash table.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea on the pixel_count calcs.

Tuples would work, though I'm hestitant to change that now due to backwards compatibility with existing code. I'd consider it thought. But In practice, categorical rasters (the only rasters it would make sense to request pixel counts on) are going to be some integer dtype. A categorical float would be exceedingly rare.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps adding a note about this stat being intended for categorical rasters in the docstring would be enough.

@@ -205,14 +207,10 @@ def gen_zonal_stats(
if 'count' in stats: # special case, zero makes sense here
feature_stats['count'] = 0
else:
valid = masked.compressed()
keys, counts = np.unique(valid, return_counts=True)
Copy link
Author

@groutr groutr Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unique elements, keys, has the really useful property of being sorted. We can take advantage of that for several stats and thus eliminate two O(n) passes (min, max) over the raster data. We can also use it to avoid a sort in computing the median.

Copy link
Owner

@perrygeo perrygeo Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. I like this approach. Two comments:

  • We could eliminate the np.unique call when the stats do not require it. I guess most stats do now... but we could avoid it entirely if the stats were only mean, count or std

  • This impacts memory usage. Keeping the entire keys and counts arrays in memory for the remainder of the loop might be a bad tradeoff in some cases. Prior to this PR, the unique stat was the only one that called np.unique, and users had to opt in. That stat is intended for categorical rasters - using it on continuous values would not make much sense. Now we're calling np.unique on every invocation and using that to derive other stats - fine in theory. But for continuous rasters or those with 16+ bits per band, there's not likely to be any duplicates - meaning we're allocating a keys and counts array that are both roughly the length of the masked array itself (possible 200% increase in memory footprint). This causes a performance regression for several use cases; code which never had to call np.unique now does, and allocates huge chunks of memory as a result. Multiple O(n) passes have the advantage of not requiring new allocations unless unique is specifically requested.

So while this may speed up some use cases, it may slow down others and will certainly cause additional resource usage. I'll have to do some more testing to quantify the impact and evaluate the tradeoffs

@groutr
Copy link
Author

groutr commented Nov 9, 2022

A quick benchmark using some datasets from your blog @perrygeo (https://www.perrygeo.com/zonal-stats-with-postgis-rasters-part-2.html)

import time
import rasterstats
print(rasterstats)
from rasterstats import zonal_stats

start = time.perf_counter()
stats = zonal_stats(
    "ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
    "wc2.1_2.5m_tavg/wc2.1_2.5m_tavg_07.tif",
    stats=["unique", "range", "std", "sum", "mean", "count", "std", "min", "max", "median", "percentile_.1", "percentile_.5", "percentile_.9"]
)
print(time.perf_counter() - start)

On my laptop, the latest version of rasterstats runs in 11.847925074999694s. With this PR, that runtime drops to 8.6056675789996s

Copy link
Owner

@perrygeo perrygeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @groutr - excellent work!

I'm concerned about the additional memory usage here. I need to run some more rigorous benchmarks to quantify the impact on all the various edge cases

Hopefully it's a net win for speed in most cases, and a tolerable increase in memory footprint.

@@ -205,14 +207,10 @@ def gen_zonal_stats(
if 'count' in stats: # special case, zero makes sense here
feature_stats['count'] = 0
else:
valid = masked.compressed()
keys, counts = np.unique(valid, return_counts=True)
Copy link
Owner

@perrygeo perrygeo Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. I like this approach. Two comments:

  • We could eliminate the np.unique call when the stats do not require it. I guess most stats do now... but we could avoid it entirely if the stats were only mean, count or std

  • This impacts memory usage. Keeping the entire keys and counts arrays in memory for the remainder of the loop might be a bad tradeoff in some cases. Prior to this PR, the unique stat was the only one that called np.unique, and users had to opt in. That stat is intended for categorical rasters - using it on continuous values would not make much sense. Now we're calling np.unique on every invocation and using that to derive other stats - fine in theory. But for continuous rasters or those with 16+ bits per band, there's not likely to be any duplicates - meaning we're allocating a keys and counts array that are both roughly the length of the masked array itself (possible 200% increase in memory footprint). This causes a performance regression for several use cases; code which never had to call np.unique now does, and allocates huge chunks of memory as a result. Multiple O(n) passes have the advantage of not requiring new allocations unless unique is specifically requested.

So while this may speed up some use cases, it may slow down others and will certainly cause additional resource usage. I'll have to do some more testing to quantify the impact and evaluate the tradeoffs

except AttributeError:
pixel_count = dict(zip([np.asscalar(k) for k in keys],
[np.asscalar(c) for c in counts]))
pixel_count = dict(zip(keys.tolist(), counts.tolist()))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea on the pixel_count calcs.

Tuples would work, though I'm hestitant to change that now due to backwards compatibility with existing code. I'd consider it thought. But In practice, categorical rasters (the only rasters it would make sense to request pixel counts on) are going to be some integer dtype. A categorical float would be exceedingly rare.

@perrygeo
Copy link
Owner

perrygeo commented Jan 15, 2023

Initial testing

Setup

  • Uint16 tif, 25235x25944, tiled (source)
  • shapefile with 11 multipolygons, fully covering the valid tif values. (source)

Results (basic stats)

zonal_stats(polygon, raster, stats=["min", "max", "count"])

command time --verbose python bench.py

master (fastest, lowest memory footprint)

Maximum resident set size (kbytes): 2458736
Elapsed (wall clock) time: 0:15.30

groutr:perf1

Maximum resident set size (kbytes): 2742956
Elapsed (wall clock) time: 0:22.77

So as I suspected, in the case where the user is requesting only basic stats, this PR will increase the memory footprint (in this case +12%) and the run time (+48%)

There are a significant number of duplicates in this dataset (since the tif is integer values of elevation, highly spatially autocorrelated) so I'd suspect that datasets with larger integers or floats would have more unique values and the numbers would be worse. But for now, we'll just focus on this dataset...

Results (basic stats + unique)

Let's add unique to invoke np.unique calls in both

zonal_stats(polygon, raster, stats=["min", "max", "count", "unique"])

command time --verbose python bench.py

master: (lowest memory footprint)

Maximum resident set size (kbytes): 2458940
Elapsed (wall clock) time: 0:25.61

groutr:perf1 (fastest)

Maximum resident set size (kbytes): 2743124
Elapsed (wall clock) time: 0:22.59

It's a draw. This PR is faster at the expense of an increased memory footprint.

Results (all stats)

Let's try it with every statistic we can calculate:

zonal_stats(polygon, raster, stats="*")

command time --verbose python bench.py

master:

Maximum resident set size (kbytes): 3435952
Elapsed (wall clock) time: 0:35.89

groutr:perf1 (fastest, lowest memory footprint)

Maximum resident set size (kbytes): 3149592
Elapsed (wall clock) time: 0:25.81

This PR shines ✨

Conclusion

There's no definitive answer, only tradeoffs :-)

This PR's approach is faster with a lower memory footprint, if all the stats are requested.

This PR's approach is slower with an increased memory footprint, when using a small subset of stats and fewer duplicate pixel values.

For this dataset, the number of unique values was 4 orders of magnitude lower than the total count, ie plenty of duplicates. We can intuit that the memory footprint and performance of this PR would worsen as the number of unique values approached the total count. If I have time in the future, I'd like to test with a floating point grid of random noise to simulate the worst case scenario.

We know that this PR improves performance for certain cases but introduces performance regressions in other cases; not a clear win over the current approach. The question is: On average, do the benefits outweigh the negatives across a wide variety of use cases?. I'm going to need to see more benchmark data from real-world datasets to make the decision.

Copy link
Owner

@perrygeo perrygeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for feedback/testing on real-world datasets.

Does this speed things up for you? Does it increase or decrease memory usage? Please let me know in the comments.

@perrygeo
Copy link
Owner

@groutr we could also try to make this np.unique approach optional so that users could opt-in to this behavior if they knew their datasets would benefit from it.

@perrygeo
Copy link
Owner

@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.

@groutr
Copy link
Author

groutr commented Jan 18, 2023

@perrygeo I will need to dig up my real-world test case today. IIRC, the changes in the PR yielded a dramatic performance increase.

@groutr
Copy link
Author

groutr commented Jan 18, 2023

@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.

Sure!

@groutr
Copy link
Author

groutr commented Feb 22, 2023

@perrygeo My real-world test case was flawed. The performance difference was largely accounted for by a change in formats that escaped my notice.

However, I've been thinking about this further and wanted to hear you thoughts about the following idea.
What if the stats were broken out into functions that were called on the zone. For example, requesting the minimum of a zone would call a function that found the minimum and returned it. This would allow for user-defined functions to be given that could be tailored to specific datasets allowing for computations to be optimized or even parallelized.
The main loop of rasterstats would simply orchestrate the execution of these functions and provide the appropriate input data. At a very high level, it would map a list of functions over the zone data and return the collected results from each function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants