Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 4k SWO Ecoplot dataset #8

Closed
aazuspan opened this issue Apr 1, 2024 · 7 comments · Fixed by #11
Closed

Add 4k SWO Ecoplot dataset #8

aazuspan opened this issue Apr 1, 2024 · 7 comments · Fixed by #11
Assignees
Labels
enhancement New feature or request

Comments

@aazuspan
Copy link
Contributor

aazuspan commented Apr 1, 2024

The current SWO Ecoplot dataset rasters are 128x128 - good for a quick test but too small to really evaluate Dask performance. @grovduck processed some 4096x4096 versions that would be a good addition. The files are too large to include in the package (~300MB), so we'll need to use an on-the-fly downloader. We could build that from scratch, but I think it makes more sense to use a pre-existing solution like pooch that handles caching and versioning. In the interest of keeping dependencies light, it could be an optional dependency that's checked when using the data loader. Once that's in place, it may be worth throwing the 128x128 version in there as well, just to keep all the data consolidated and the package minimal.

In terms of API design, I could see this as a separate dataset (e.g. load_swo_ecoplot_large) or as a parameter (e.g. load_swo_ecoplot(large=True). If we wanted to include a lot of different resolution options, using a size parameter like load_swo_ecoplot(size=4096) might make sense, but with just two options that seems like it's more confusing than helpful.

Any preferences on API design, or better naming ideas @grovduck?

@aazuspan aazuspan added the enhancement New feature or request label Apr 1, 2024
@aazuspan aazuspan self-assigned this Apr 1, 2024
@grovduck
Copy link
Member

grovduck commented Apr 2, 2024

@aazuspan, pooch looks really cool! I agree with keeping all data out of the package and serving it this way.

As far as an API, I suppose it all depends on what we intend to serve as sample data. For just the two that we have currently, I like either load_swo_ecoplot_large or load_swo_ecoplot(large=True), but if you think we might benefit from multiple resolutions, then I like your load_swo_ecoplot(size=4096) idea better.

If we went with the latter option, would any valid value 1 <= size <= 4096 work? We could arbitrarily decide to anchor the rasters at top-left or center and then return the user requested clip sizes? Might be more trouble than it's worth for test data, but would there be any advantage to having customizable resolutions for optimizing chunk sizes? In fact, if we went this route, we would only need to maintain the 4096 x 4096 raster set and could generate the smaller ones from it.

Absent that flexibility, I'd go with one of the first two options.

@aazuspan
Copy link
Contributor Author

aazuspan commented Apr 2, 2024

If we went with the latter option, would any valid value 1 <= size <= 4096 work? We could arbitrarily decide to anchor the rasters at top-left or center and then return the user requested clip sizes?

I hadn't considered clipping on-the-fly! I love the flexibility and the fact that we would only need to provide a single raster size, but I'm a little worried about the time it would take to load a small dataset (about 20s on my laptop to clip each raster, excluding download time). Maybe that's not a big issue if the files are cached in a temp dir after the first request? We could also provide a few different sizes and clip from whichever is closest to reduce IO time, but that adds more complexity...

I think at this point my inclination is to keep things simple with just a couple premade sizes, but getting the Pooch system in place with the 4096 dataset leaves this option open if we find there's a need for testing with more arbitrary sizes.

For distributing the 4096x4096 rasters, I'm hoping to package them as a single ZIP to simplify and speed up the data fetching, but currently we're well beyond Github's 100MB per-file limit (~290MB). We could reduce the number of covariates or shrink the image size, but what do you think about using lossy compression? I can get it down to 94MB while keeping each band within a 0.5% max error using LERC_DEFLATE, but it definitely creates artifacts in a few covariates with big ranges of values and gradual changes, like the DEM below:

LZW (22.1MB at 4096x4096)
image

LERC_DEFLATE (4.4MB at 4096x4096)
image

What do you think? Is that acceptable, or are we better off just using a smaller dataset or downloading all the files individually?

@grovduck
Copy link
Member

grovduck commented Apr 3, 2024

I hadn't considered clipping on-the-fly! I love the flexibility and the fact that we would only need to provide a single raster size, but I'm a little worried about the time it would take to load a small dataset (about 20s on my laptop to clip each raster, excluding download time). Maybe that's not a big issue if the files are cached in a temp dir after the first request? We could also provide a few different sizes and clip from whichever is closest to reduce IO time, but that adds more complexity...

Right, I was thinking that once the user had requested a certain resolution, that image would exist in cache locally for that resolution, so it would be a one-time cost. But I totally understand your hesitation with that approach.

I think at this point my inclination is to keep things simple with just a couple premade sizes, but getting the Pooch system in place with the 4096 dataset leaves this option open if we find there's a need for testing with more arbitrary sizes.

That sounds like a great path forward.

What do you think? Is that acceptable, or are we better off just using a smaller dataset or downloading all the files individually?

I think your proposed tests were more about performance rather than adherence to correct values, right? If that's the case, I don't see any issue with using lossy compression to get down to a file size that can be hosted on Github. I think keeping all files in a zip makes sense to me as well.

@aazuspan
Copy link
Contributor Author

aazuspan commented Apr 3, 2024

I think your proposed tests were more about performance rather than adherence to correct values, right? If that's the case, I don't see any issue with using lossy compression to get down to a file size that can be hosted on Github.

Yeah, that was my thought as well - thanks for the sanity check!

@aazuspan
Copy link
Contributor Author

aazuspan commented Apr 8, 2024

@grovduck I just noticed that I've been committing the latitude and longitude rasters (including in the new zipped datasets) even though they're not in the tabular data and wouldn't be used for prediction.

Any reason you can think of that I should leave those in? If not, I'll replace those zips when I make the next PR.

@grovduck
Copy link
Member

grovduck commented Apr 8, 2024

@aazuspan, totally my bad. Sorry for having those in there. I think I stripped them out as covariates in the model out of an abundance of caution. Sounds good to strip those out in the next PR.

@aazuspan
Copy link
Contributor Author

aazuspan commented Apr 8, 2024

No problem! Turns out they compressed so well that they barely affected the file size anyways, but still probably worth pulling out to avoid confusion in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants