Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Space Efficiency for Zonal Analysis #17

Closed
ffrosch opened this issue Aug 12, 2021 · 2 comments
Closed

Improving Space Efficiency for Zonal Analysis #17

ffrosch opened this issue Aug 12, 2021 · 2 comments

Comments

@ffrosch
Copy link

ffrosch commented Aug 12, 2021

Description

Idea

An improvement in space efficiency could be achieved by a size reduction of Landscape Arrays with sparse arrays (compression) or references instead of copies (duplication avoidance).

Reason

Improving the space efficiency of Landscape Arrays would greatly improve the usability of the library for zonal analyses of large datasets with many regions.

e.g.: I am using pylandstats on a large dataset with about 1100 regions and a raster of size 2600x2400. Zonal Analysis creates a copy of the raster for each region. This leads to high memory consumption (about 12 GB in my case -> but only with some manual improvements, like choosing the smallest possible dtype).

Possible improvements

  1. Using Zonal Analysis with a large set of regions means that each landscape has only a small fraction of non-null values. It should be possible to implement the landscape arrays as sparse arrays with the python library sparse without many adjustments to the rest of the code. I haven't tested it yet though and would appreciate an assessment on the feasibility and possible problems with other pieces of the code.
  2. Would it be possible to not copy the landscape for each region and instead use a reference to the original landscape together with the mask for the region? On request the array for the region could be computed on-the-fly with the mask and the original landscape. It could be discarded after the computation.

Let me know what you think about the suggestion, I might be able to submit a pull request if this change seems reasonable.

@martibosch
Copy link
Owner

Hello @ffrosch,

sorry for the huge delay in my response, I have not had time to work on pylandstats for a while. Thank you for sharing your ideas. This was indeed a conception error on my end which made using pylandstats in zonal analysis with large number of zones practically impossible.

This should be fixed in v3.0.0rc0, where zones are defined by vector geometries (a geoseries) and the zone landscapes are instantiated for the zone bounds only (rasterio mask.mask with crop=True). The change is implemented in the ZonalAnalysis class but its children classes (BufferAnalysis, ZonalGridAnalysis and even its spatiotemporal implementations) also operate accordingly.

You can install v3.0.0rc0 from pypi or conda forge and try if it now works for your use case. Note that the API has changed a little bit: the new signature takes a positional zones argument right after the landscape file path (see https://pylandstats.readthedocs.io/en/latest/zonal.html). You can see an overview in the example notebook at https://github.com/martibosch/pylandstats-notebooks/blob/main/notebooks/03-zonal-analysis.ipynb.

I hope this addresses your remarks and sorry again for my delay. Feel free to reopen if needed. Best,
Martí

@ffrosch
Copy link
Author

ffrosch commented Jan 16, 2024

Hello @martibosch,

thank you for coming back to this issue and taking care of it. That's awesome and should be a huge performance boost!

At the moment I don't have projects where I need the library, but I'm looking forward to trying it out in the future :-)

Best,
Florian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants