Add 4k SWO Ecoplot dataset #8

aazuspan · 2024-04-01T21:49:41Z

The current SWO Ecoplot dataset rasters are 128x128 - good for a quick test but too small to really evaluate Dask performance. @grovduck processed some 4096x4096 versions that would be a good addition. The files are too large to include in the package (~300MB), so we'll need to use an on-the-fly downloader. We could build that from scratch, but I think it makes more sense to use a pre-existing solution like pooch that handles caching and versioning. In the interest of keeping dependencies light, it could be an optional dependency that's checked when using the data loader. Once that's in place, it may be worth throwing the 128x128 version in there as well, just to keep all the data consolidated and the package minimal.

In terms of API design, I could see this as a separate dataset (e.g. load_swo_ecoplot_large) or as a parameter (e.g. load_swo_ecoplot(large=True). If we wanted to include a lot of different resolution options, using a size parameter like load_swo_ecoplot(size=4096) might make sense, but with just two options that seems like it's more confusing than helpful.

Any preferences on API design, or better naming ideas @grovduck?

The text was updated successfully, but these errors were encountered:

grovduck · 2024-04-02T18:14:11Z

@aazuspan, pooch looks really cool! I agree with keeping all data out of the package and serving it this way.

As far as an API, I suppose it all depends on what we intend to serve as sample data. For just the two that we have currently, I like either load_swo_ecoplot_large or load_swo_ecoplot(large=True), but if you think we might benefit from multiple resolutions, then I like your load_swo_ecoplot(size=4096) idea better.

If we went with the latter option, would any valid value 1 <= size <= 4096 work? We could arbitrarily decide to anchor the rasters at top-left or center and then return the user requested clip sizes? Might be more trouble than it's worth for test data, but would there be any advantage to having customizable resolutions for optimizing chunk sizes? In fact, if we went this route, we would only need to maintain the 4096 x 4096 raster set and could generate the smaller ones from it.

Absent that flexibility, I'd go with one of the first two options.

aazuspan · 2024-04-02T20:35:39Z

If we went with the latter option, would any valid value 1 <= size <= 4096 work? We could arbitrarily decide to anchor the rasters at top-left or center and then return the user requested clip sizes?

I hadn't considered clipping on-the-fly! I love the flexibility and the fact that we would only need to provide a single raster size, but I'm a little worried about the time it would take to load a small dataset (about 20s on my laptop to clip each raster, excluding download time). Maybe that's not a big issue if the files are cached in a temp dir after the first request? We could also provide a few different sizes and clip from whichever is closest to reduce IO time, but that adds more complexity...

I think at this point my inclination is to keep things simple with just a couple premade sizes, but getting the Pooch system in place with the 4096 dataset leaves this option open if we find there's a need for testing with more arbitrary sizes.

For distributing the 4096x4096 rasters, I'm hoping to package them as a single ZIP to simplify and speed up the data fetching, but currently we're well beyond Github's 100MB per-file limit (~290MB). We could reduce the number of covariates or shrink the image size, but what do you think about using lossy compression? I can get it down to 94MB while keeping each band within a 0.5% max error using LERC_DEFLATE, but it definitely creates artifacts in a few covariates with big ranges of values and gradual changes, like the DEM below:

LZW (22.1MB at 4096x4096)

LERC_DEFLATE (4.4MB at 4096x4096)

What do you think? Is that acceptable, or are we better off just using a smaller dataset or downloading all the files individually?

grovduck · 2024-04-03T15:28:52Z

I hadn't considered clipping on-the-fly! I love the flexibility and the fact that we would only need to provide a single raster size, but I'm a little worried about the time it would take to load a small dataset (about 20s on my laptop to clip each raster, excluding download time). Maybe that's not a big issue if the files are cached in a temp dir after the first request? We could also provide a few different sizes and clip from whichever is closest to reduce IO time, but that adds more complexity...

Right, I was thinking that once the user had requested a certain resolution, that image would exist in cache locally for that resolution, so it would be a one-time cost. But I totally understand your hesitation with that approach.

I think at this point my inclination is to keep things simple with just a couple premade sizes, but getting the Pooch system in place with the 4096 dataset leaves this option open if we find there's a need for testing with more arbitrary sizes.

That sounds like a great path forward.

What do you think? Is that acceptable, or are we better off just using a smaller dataset or downloading all the files individually?

I think your proposed tests were more about performance rather than adherence to correct values, right? If that's the case, I don't see any issue with using lossy compression to get down to a file size that can be hosted on Github. I think keeping all files in a zip makes sense to me as well.

aazuspan · 2024-04-03T23:00:08Z

I think your proposed tests were more about performance rather than adherence to correct values, right? If that's the case, I don't see any issue with using lossy compression to get down to a file size that can be hosted on Github.

Yeah, that was my thought as well - thanks for the sanity check!

aazuspan · 2024-04-08T19:35:38Z

@grovduck I just noticed that I've been committing the latitude and longitude rasters (including in the new zipped datasets) even though they're not in the tabular data and wouldn't be used for prediction.

Any reason you can think of that I should leave those in? If not, I'll replace those zips when I make the next PR.

grovduck · 2024-04-08T20:35:38Z

@aazuspan, totally my bad. Sorry for having those in there. I think I stripped them out as covariates in the model out of an abundance of caution. Sounds good to strip those out in the next PR.

aazuspan · 2024-04-08T21:23:10Z

No problem! Turns out they compressed so well that they barely affected the file size anyways, but still probably worth pulling out to avoid confusion in the future.

aazuspan added the enhancement New feature or request label Apr 1, 2024

aazuspan self-assigned this Apr 1, 2024

aazuspan mentioned this issue Apr 3, 2024

Add zipped 128x128 and 2048x4096 SWO rasters #9

Merged

This was referenced Apr 8, 2024

Remove latitude and longitude rasters #10

Merged

Add large SWO Ecoplot rasters and use Pooch #11

Merged

aazuspan closed this as completed in #11 Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 4k SWO Ecoplot dataset #8

Add 4k SWO Ecoplot dataset #8

aazuspan commented Apr 1, 2024

grovduck commented Apr 2, 2024 •

edited

Loading

aazuspan commented Apr 2, 2024

grovduck commented Apr 3, 2024

aazuspan commented Apr 3, 2024

aazuspan commented Apr 8, 2024

grovduck commented Apr 8, 2024

aazuspan commented Apr 8, 2024

Add 4k SWO Ecoplot dataset #8

Add 4k SWO Ecoplot dataset #8

Comments

aazuspan commented Apr 1, 2024

grovduck commented Apr 2, 2024 • edited Loading

aazuspan commented Apr 2, 2024

grovduck commented Apr 3, 2024

aazuspan commented Apr 3, 2024

aazuspan commented Apr 8, 2024

grovduck commented Apr 8, 2024

aazuspan commented Apr 8, 2024

grovduck commented Apr 2, 2024 •

edited

Loading