New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: GeoTiff Layers and Cloud Optimized GeoTiff Support #2284

Closed
lossyrob opened this Issue Jul 16, 2017 · 17 comments

Comments

Projects
None yet
7 participants
@lossyrob
Member

lossyrob commented Jul 16, 2017

This issue will explain the reasoning behind and initial architecture considerations of
a set of related features that are upcoming in the next release of
GeoTrellis (after 1.1): GeoTiff Layers and Cloud Optimized GeoTiff Support

Context and Background

Issues with current GeoTrellis Layer architecture

These two features will address a number of issues:

The "many small files" issue

A large pain point for using GeoTrellis in AWS S3 is the number of small files a GeoTrellis ingest
produces. If you ingest a layer to a tile layout with 256x256 tiles covering a large area, you'll be
saving off millions of small tiles to S3. S3 is not built for this use case, and is not priced for
it either: each GET and PUT costs, and the high number of them when dealing with GeoTrellis layers
is a burden on our users.

This issue is mostly felt with S3 backend users, but could potentially effect Hadoop and file system
users, as well as any future object store backends we might support like Google Cloud Storage.
This is less of a concern with the Cassandra, Accumulo, or HBase backends.

Data format interopability

GeoTrellis saves tiles in an Avro serialization format. Once a GeoTrellis layer is written,
that data can only be read using the GeoTrellis library. This is unlike storing raster
data as GeoTiffs, which can be read by most raster libraries under the sun (and most importantly
GDAL-based software).

Duplication of Data

Users can use GeoTrellis to perform spark jobs on GeoTiffs currently, without having to save off
GeoTrellis layers. However, if they want to take full advantage of GeoTrellis, utilizing it's ability
to query space-filling curve indexed tile sets and combine raster data quickly, it is
always advised to ingest data into GeoTrellis layers stored in one of our supported backends.
However, once the data is saved as GeoTrellis layers,
not only does this make the data readable only by GeoTrellis, as mentioned above, but it also
creates a duplicate copy of the data. For users dealing with very large raster data, this
is a burden; unless they commit to only storing their data as GeoTrellis layers, there
is a requirement to hold onto twice the amount of data than they would like to.

Cloud Optimized GeoTiffs (COGs)

As the page from GDAL's wiki on Cloud Optimized GeoTiffs states:

A cloud optimized GeoTIFF is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, whose internal organization is friendly for consumption by clients issuing ​HTTP GET range request ("bytes: start_offset-end_offset" HTTP header).

COGs don't actually add anything to GeoTiffs; they simply state an ideal layout of the data inside the TIFF that will make HTTP GET range requests most optimal.

There is a community effort to make more raster data available through COGs,
which GeoTrellis can both take advantage of and help support. By treating COGs
as first class citizens in GeoTrellis and working with raster data that is stored in COGs directly,
we can solve the duplication, interopability, and file number problems mentioned above.

COG extension: Web COGs

GeoTrellis layers are often created such that there is a layer per {z}/{x}/{y} zoom level,
to allow easy retrieval of tiles for the purpose of serving on a web map such as Leaflet or
Open Layers. In this way we represent large raster data as smaller web-aligned tiles along zoom levels.

We can further scope COGs so that they can support this access pattern with the additional constraints to the data format and layout. Because this is a more specialized scope than COGs in general, we can call GeoTiffs that fit to this extended set of constraints as "Web COGs".

In addition to the required and optional COG file structure attributes, the following
conditions will apply to Web COGs:

  • The projection is WebMercator (EPSG:3857)
  • The Web COG internal tiling scheme for each overview is 256x256
  • The bounding box of the Web COG must alight with the tile layout a web map tiling scheme,
    as described in Open Street Map's Slippy Tile wiki.
  • The smallest overview is a single 256x256 tile. This data represents the Web COG at "min zoom level"
  • Each overview past the number of tile columns and rows by a factor of 2, and represent the tiles in
    the next zoom level.
  • The full resolution overview represents the "max zoom level".
  • Optional: Band-Interleave, so that bands can be subsetted via HTTP GET range requests.

We can represent data such as images as single Web COGs with a min zoom of 0 and a max zoom that represents the closest zoom level resolution to the native image. To serve web tiles out
of the image, only one HTTP GET range request will be necessary to fulfill each tile request, if the metadata of the Web COG is known.

If the Web COG is stored with band interleave (INTERLEAVE=BAND), subsets of the bands could be retrieved. This can be beneficial if, for instance, we were storing all 11 bands of Landsat 8, and wanted to serve out an NDVI operation.
For each web tile, we would do two HTTP GET range read requests
for 256x256 uint16 tiles; one for the red band and one for the near infrared band. If we stored
the 11 bands with pixel interleave (INTERLEAVE=PIXEL), we would do only one HTTP GET range read request, but
we would be pulling down 256x256x11 uint16 values, as opposed of 256x256x2 uint16 values over two requests
with band interleaving.

A set of Web COGs can represent a large amount of data in reasonably sized tiles. Though you could store
an arbitrary amount of data in a single BigTiff Web COG, there might be reasons to want to break the
data over a sequence of tiles. In this case, each Web COG in the set can represent a portion of the data
over some range of zoom levels in their overviews. If we broke the data up over some sequence of zoom levels
[z0, z1, ..., zN], we would have a situation where:

  • The "top level" Web COG represented the whole of the data from zoom level z0 to zoom z1 - 1.
  • The next level of Web COGs would have as many files as it would take to cover the data in z1
    tiles, and each of those Web COGs min zoom levels would be z1. The max zoom level would be z2 - 1.
  • In general, for each level of Web COG, the min zoom level overview would be the tiles of zX that
    cover the data, and the maximum zoom level would contain an internal tiling that represented z(X+1) - 1
    zoom level tiles covering the data.

I believe we could construct a VRT over the set of Web COGs such that users could
interact with the VRT as they would a single Web COG for the entire layer.

Decisions and Initial Architecture Considerations

This section will outline the two features that are aimed at solving the issues mentioned above, and
the work that is required to provide support for COGs.

Feature A - Cloud Optimized GeoTiff support

In order to support COGs, we will need to flesh out some functionality that is missing
from our GeoTiff reader and writer, including:

  • The ability to do selective reads of band subsets of a band-interleave GeoTiff (that
    work is started in PR 2115)
  • The ability to read and write overviews - We currently do not support any overview
    reading and writing.
  • Adding smart resample reads: while we support window reads that will only fetch
    data via HTTP GET range requests, we need to also support resampling that will smartly
    choose the relevant overview that is available to the GeoTiff, if overviews are present.
  • API for dealing with COGs: If a GeoTiff is a COG or Web COG, there should be specific
    API to make interacting with them as simple to the user as possible. For example, if I
    wanted a specific {z}/{x}/{y} tile from a COG, there should be a method to do so
    such as geoTiff.getTile(z, x, y).

We should also supply library functionality for converting sets of GeoTiffs to COGs or Web COGs
in Spark, so that users who want to move their big raster data to COG format can utilize GeoTrellis
and Spark for this task.

Feature B - GeoTiff Layers

Once we have COG support in GeoTrellis, we can create a new type of GeoTrellis Layer architecture
that takes advantage of COGs as the storage mechanism.

More generally, we can create a layer architecture that uses GeoTiffs in general as their storage
mechanism. If the GeoTiffs backing these layers happen to be COGs, then performance will be optimal.
However, these layers should be able to work on an arbitrary set of GeoTiffs data.

Decoupling the index metadata from the raster data

The way indexing currently works with GeoTrellis layers is that each tile (or set of tiles,
in the case of spatiotemporal datasets) is stored along with their key (either a SpatialKey,
SpaceTimeKey, or some custom client key) in an entry into the backend that is indexed by
the space filling curve index for that layer. To retrieve, say, a bounding box query for a spatial
layer, the key index is deserialized from the metadata (stored inside an AttributeStore),
and used to compute the space filling curve indexes that need to be read to satisfy the query.
The space filling curve indexes are use to directly query for values out of the backend. In the
case of a filesystem or S3 backend, the files that are the tiles that satisfy the query are
stored with a file name that is the space filling curve index. When we read the data
at that space filling curve index, we are reading the raster data directly.

The proposal here is to decouple the indexing mechanism with the raster data that is indexed by it.
This requires substituting another value where the raster data was, which directly tied to the
space filling curve index: we can store how to get the data, not the actual data itself.

To think of it in another way, we currently store data like this:

Space Filling Curve index Value
Array[Byte]* Seq[(K, V)]

where K is the key type and V is the value type (the raster data).

*Currently we store the SFC index as
Long values, but we are moving to Array[Byte] and will use that here to indicate the key might contain
additional complexity, such as periodicity).

The proposal is a shift to storing instead something like

Space Filling Curve index Value
Array[Byte] Seq[() => (K, V)]

so that each value is a function, let's call the "fetch function", that can return the data
that would have been held statically in the backend, but with no raster data stored in this structure.

This opens up a world of possibilities for what this function that supplies the data could do.
For instance, if my raster data is stored in EPSG:4326, but I want to interact with
the layer as an EPSG:3857 layer, this function can do the necessary reads on the source
data and reproject on the fly as needed.

In this way, we can move towards thinking of GeoTrellis layers as collections of metadata,
instead of collections of the data themselves. If the fetch function is straightforward,
then the fetch can be performed quickly. If the fetch function is more complicated, the
fetch happens slower but still results in the same fetch.

For instance, if my fetch function is against a layer of Web COGs, then the fetch is a simple, single
HTTP GET range request against the COG per tile of the layer. If the layer is backed by
a set of GeoTiffs that are not COG, then potentially each tile fetch will need to read from
multiple segments or multiple GeoTiffs, and be less optimal. However the results
are the same according to the user, and optimizations paths are there if desired.

Because the metadata and raster data are currently coupled, they exist in the same place (in the case
of S3 files, the SFC-indexed file contains both the metadata and the raster data). With this decoupling,
we can also concentrate on the storage format for the metadata that is most optimal for it's access
patterns, so that we do not end up doing more fetches to S3 than we need to.

In this architecture, an "ingest" of a layer can simply be the reading of all the GeoTiff metadata, the construction of the SFC-indexed fetch functions, and the storage of that metadata (which should be minuscule compared to the
pixel data) into a metadata backend.

The querying of layers can look the same as it does now; this new layer architecture can mostly be transparent
to the user.

@lossyrob lossyrob added this to the 1.2 milestone Jul 16, 2017

@echeipesh

This comment has been minimized.

Show comment
Hide comment
@echeipesh

echeipesh Jul 17, 2017

Contributor

One concern that needs to be addressed directly is a mechanism for keeping the data at original resolution in Web COG. Unless such mechanism exists we fall short of avoiding data duplication because resampling is a lossy transformation.

One way this could be achieve is to tile the segments of input rasters at their native resolution but matching the tile boundaries of the closest zoom level with lower resolution. You would end up with segment sized somewhere between 256x256 and 512x512 pixels.

Example: let there be a GeoTiff layer whose resolution falls somewhere between zoom 15 and zoom 16 when tiled to TMS for epsg:3857.

Tiling it to match the layout of zoom 15 layer produces 480x480 tiles.
When fulfilling a request for tile at zoom=16, col=626, row=214 we:

  • fetch of segment corresponding to col:(626/2)=313, row:(214/2)=107
  • result is a tile of 480x480 pixels
  • crop the (626%2)=0, (214%2)=0 quadrant of said tile: pixels from 0,0 to 239,239
  • resample cropped pixels to 256x256 on the fly.

Because the resolution is preserved and tiling does not change the extent of any pixel the original data is preserved.

Because the segment extent matches the previous zoom level a request for any higher zoom level would need to resample only one segment.

Contributor

echeipesh commented Jul 17, 2017

One concern that needs to be addressed directly is a mechanism for keeping the data at original resolution in Web COG. Unless such mechanism exists we fall short of avoiding data duplication because resampling is a lossy transformation.

One way this could be achieve is to tile the segments of input rasters at their native resolution but matching the tile boundaries of the closest zoom level with lower resolution. You would end up with segment sized somewhere between 256x256 and 512x512 pixels.

Example: let there be a GeoTiff layer whose resolution falls somewhere between zoom 15 and zoom 16 when tiled to TMS for epsg:3857.

Tiling it to match the layout of zoom 15 layer produces 480x480 tiles.
When fulfilling a request for tile at zoom=16, col=626, row=214 we:

  • fetch of segment corresponding to col:(626/2)=313, row:(214/2)=107
  • result is a tile of 480x480 pixels
  • crop the (626%2)=0, (214%2)=0 quadrant of said tile: pixels from 0,0 to 239,239
  • resample cropped pixels to 256x256 on the fly.

Because the resolution is preserved and tiling does not change the extent of any pixel the original data is preserved.

Because the segment extent matches the previous zoom level a request for any higher zoom level would need to resample only one segment.

@metasim

This comment has been minimized.

Show comment
Hide comment
@metasim

metasim Jul 20, 2017

Member

@echeipesh To echo your concern, being able to retrieve the data it its native resolution and projection is likely very important to our use cases. For example, when working with MODIS we operate in the native sinusoidal projection. An automatic reprojection into 3857 would likely inject additional error we'd rather not have.

Member

metasim commented Jul 20, 2017

@echeipesh To echo your concern, being able to retrieve the data it its native resolution and projection is likely very important to our use cases. For example, when working with MODIS we operate in the native sinusoidal projection. An automatic reprojection into 3857 would likely inject additional error we'd rather not have.

@lossyrob

This comment has been minimized.

Show comment
Hide comment
@lossyrob

lossyrob Jul 20, 2017

Member

I wasn't suggesting we would force/automatically reproject any data. The point of the Web COGs is to support the use case where we want EPSG:3857 (which is a common use case), where we area already manually reprojecting to to WebMercator TMS. For other use cases, we can create COGs in a profile that fits the native resolution and provides an appropriate tile layout (e.g. when we use a FloatingLayoutScheme current).

Can you help me clarify the language of the issue a bit? I don't want to give the impression that the WebCOG or EPSG:3857 is the only supported case, and want to modify the language as to not give that impression.

Member

lossyrob commented Jul 20, 2017

I wasn't suggesting we would force/automatically reproject any data. The point of the Web COGs is to support the use case where we want EPSG:3857 (which is a common use case), where we area already manually reprojecting to to WebMercator TMS. For other use cases, we can create COGs in a profile that fits the native resolution and provides an appropriate tile layout (e.g. when we use a FloatingLayoutScheme current).

Can you help me clarify the language of the issue a bit? I don't want to give the impression that the WebCOG or EPSG:3857 is the only supported case, and want to modify the language as to not give that impression.

@lossyrob

This comment has been minimized.

Show comment
Hide comment
@lossyrob

lossyrob Jul 20, 2017

Member

@echeipesh That use case can be supported, although I wouldn't call the Web COGs. The Web COG idea is that if you want to format your data in a pyramided TMS, you can do so in a set of GeoTiffs that's readable by GeoTrellis and also by a set of other tools.

If data integrity is your goal, then you probably don't want to be using Web Mercator at all. However, you could still use COGs that can be beneficial to use in GeoTrellis layers; they just wouldn't be of the Web COG "profile".

We can image a series of "profiles" of COGs (@chomles's language) that fit specific needs; for example, currently in GeoTrellis to keep data integrity we use a floating layout scheme in the original project. We can have a specific profile of COG that performs optimally in GeoTrellis for this use case, and encourage users to use it; however if they didn't want to reformat their data or duplicate data, we could just read the base COGs (or non-COG GeoTiffs for that matter) in whatever format they have it, and have to work a little harder to get the right data.

Your idea hits a use case where the user either already has EPSG:3857 data, or are willing to reproject it to that data, but want to keep the "native resolution" i.e. the resolution in EPSG:3857 that most closely matches (I'm assuming for doing analytics as well as visualization for things not nearing the poles), and not duplicate it such that you would just do the analytics on the original tiles and have a visualization layer. It's an interesting middle ground, and might be one type of profile for COG. I'd vote on keeping Web COGs simply to fit the zoom levels, though.

Member

lossyrob commented Jul 20, 2017

@echeipesh That use case can be supported, although I wouldn't call the Web COGs. The Web COG idea is that if you want to format your data in a pyramided TMS, you can do so in a set of GeoTiffs that's readable by GeoTrellis and also by a set of other tools.

If data integrity is your goal, then you probably don't want to be using Web Mercator at all. However, you could still use COGs that can be beneficial to use in GeoTrellis layers; they just wouldn't be of the Web COG "profile".

We can image a series of "profiles" of COGs (@chomles's language) that fit specific needs; for example, currently in GeoTrellis to keep data integrity we use a floating layout scheme in the original project. We can have a specific profile of COG that performs optimally in GeoTrellis for this use case, and encourage users to use it; however if they didn't want to reformat their data or duplicate data, we could just read the base COGs (or non-COG GeoTiffs for that matter) in whatever format they have it, and have to work a little harder to get the right data.

Your idea hits a use case where the user either already has EPSG:3857 data, or are willing to reproject it to that data, but want to keep the "native resolution" i.e. the resolution in EPSG:3857 that most closely matches (I'm assuming for doing analytics as well as visualization for things not nearing the poles), and not duplicate it such that you would just do the analytics on the original tiles and have a visualization layer. It's an interesting middle ground, and might be one type of profile for COG. I'd vote on keeping Web COGs simply to fit the zoom levels, though.

@hectcastro

This comment has been minimized.

Show comment
Hide comment
@hectcastro

hectcastro Aug 23, 2017

Contributor

Just as a validation step to ensure that we're playing nicely with other ecosystem tools, it would be great to ensure that COGs produced by GeoTrellis validate with validate_cloud_optimized_geotiff.py.

See: https://trac.osgeo.org/gdal/wiki/CloudOptimizedGeoTIFF#HowtocheckifaGeoTIFFhasacloudoptimizationinternalorganization

Contributor

hectcastro commented Aug 23, 2017

Just as a validation step to ensure that we're playing nicely with other ecosystem tools, it would be great to ensure that COGs produced by GeoTrellis validate with validate_cloud_optimized_geotiff.py.

See: https://trac.osgeo.org/gdal/wiki/CloudOptimizedGeoTIFF#HowtocheckifaGeoTIFFhasacloudoptimizationinternalorganization

@pomadchin

This comment has been minimized.

Show comment
Hide comment
@pomadchin

pomadchin Aug 29, 2017

Member

I used a bit modified version of this script: https://gist.github.com/pomadchin/6704ee210456a1a11debf296fd6545fd
The difference is here. As i understand it's not obligatory TIFF tag, and if it's not set it's 8 by default. Is there any concrete information about it as even gdal won't set this tag?

Member

pomadchin commented Aug 29, 2017

I used a bit modified version of this script: https://gist.github.com/pomadchin/6704ee210456a1a11debf296fd6545fd
The difference is here. As i understand it's not obligatory TIFF tag, and if it's not set it's 8 by default. Is there any concrete information about it as even gdal won't set this tag?

@pomadchin

This comment has been minimized.

Show comment
Hide comment
@pomadchin

pomadchin Aug 29, 2017

Member

Current PR (against band subsetting PR though): https://github.com/pomadchin/geotrellis/compare/feature/geotiff-streaming-band-subset...pomadchin:feature/geotiff-overview?expand=1

optimizedOrder flag in writer means default or not ordering of segments bytes.
As @lossyrob mentioned probably it makes sense to rename it into cloudOptimized flag, which would take into account 1. bytes ordering 2. overviews writing (by default tiffs would be saved without overviews at all even if geotiff would be with overviews).

Member

pomadchin commented Aug 29, 2017

Current PR (against band subsetting PR though): https://github.com/pomadchin/geotrellis/compare/feature/geotiff-streaming-band-subset...pomadchin:feature/geotiff-overview?expand=1

optimizedOrder flag in writer means default or not ordering of segments bytes.
As @lossyrob mentioned probably it makes sense to rename it into cloudOptimized flag, which would take into account 1. bytes ordering 2. overviews writing (by default tiffs would be saved without overviews at all even if geotiff would be with overviews).

@metasim

This comment has been minimized.

Show comment
Hide comment
@metasim

metasim Aug 29, 2017

Member

Sorry for the tangent, but just came to mind: has there been discussion as to whether the GDAL vrt format could fulfill (some) requirements of a COG?

Member

metasim commented Aug 29, 2017

Sorry for the tangent, but just came to mind: has there been discussion as to whether the GDAL vrt format could fulfill (some) requirements of a COG?

@lossyrob

This comment has been minimized.

Show comment
Hide comment
@lossyrob

lossyrob Aug 29, 2017

Member

VRTs could mimic an internal tiling structure and overviews, but I'm unclear on the advantages of storing separate files fronted by a VRT vs a COG. @cholmes might have some input here.

Member

lossyrob commented Aug 29, 2017

VRTs could mimic an internal tiling structure and overviews, but I'm unclear on the advantages of storing separate files fronted by a VRT vs a COG. @cholmes might have some input here.

@metasim

This comment has been minimized.

Show comment
Hide comment
@metasim

metasim Aug 29, 2017

Member

Certainly doesn't address the S3 optimization goals, but wonder if it that could be handled by generating the VRT in different ways. Bringing this up is probably just a distraction, but I like hacking existing (de facto) standards when possible.

Member

metasim commented Aug 29, 2017

Certainly doesn't address the S3 optimization goals, but wonder if it that could be handled by generating the VRT in different ways. Bringing this up is probably just a distraction, but I like hacking existing (de facto) standards when possible.

@echeipesh echeipesh modified the milestones: 2.0, 1.2 Sep 7, 2017

@pomadchin

This comment has been minimized.

Show comment
Hide comment
@pomadchin

pomadchin Sep 11, 2017

Member

Some results of our talk with @lossyrob.

There is still an open question, how should GeoTiff / COG Layer look like?
How should look the "index" of a such GeoTiff catalog? What data do we need to keep in it and where? Should it be some Sql / No SQL DB or that should be some file?

To give a bit more context: it's not enough just to have a set of tiffs in some directory to call it a layer, there should be some additional metadata to keep information about tiffs URIs / segment offsets / other tiff metadata. So such a layer still should be some geotiff data + geotiff metadata.

A potential result API design:

val source = "s3://bucket/prefix" // path to some set of tiffs
val layerPath = "s3://source/layer"
// a Spark job to create the index
// JSON? DB?
val metadata = GeoTiffLayer.fromFiles(source)
metadata.save(layerPath)
val layer = GeoTiffLayer(layerPath)
// query the layer
val rdd: RDD[(URI, Raster[MultibandTile])] = layer.query().where(Intersects(extent)) 

// somehow merge into a single tile
val tile: MultibandTile =
  layer
    .createTile(geom = extent, crs = WebMercator, bands=List(1, 2, 3), dateInternal=???)
    .merge()

layer.toCogLayer(tileLayout, etc) // explanations below

Also there can be a COG Layer:

  • a group of COGs that are
    • Non overlapping
    • Fit exactly to some tile layout, between files and wrt the internal tiling
val layer = COGGeoTiffLayer.index("s3://some-set-of-cogs-that-fit-cog-layer")

val layer = COGGeoTiffLayer("s3://pathtocoglayer") // <- JSON? DB?
// Runs the spark job to create the index.
// Determines the layout definition

layer.get(x, y, z)
layer.query(),where(Intersects(...)) ...

Two goals:

  • Create a queryable layer from any set of GeoTiffs.
    • A allows us to not have to rewrite data. Turns this sort of into a query database layer.
  • Store GeoTrellis layers as COG Layers (and WebCog layers).
    • Requires rewriting data still. But we get the joy of storing large files on S3, and band subsetting.

Potentially we'll be able to use GeoTiffs instead of our own internal binary format which appears to be the result of our current ingest process (somewhere in the future).

Want to hear some additional opinions about it and feedback.

@echeipesh @notthatbreezy @hectcastro @moradology @metasim and others.

Member

pomadchin commented Sep 11, 2017

Some results of our talk with @lossyrob.

There is still an open question, how should GeoTiff / COG Layer look like?
How should look the "index" of a such GeoTiff catalog? What data do we need to keep in it and where? Should it be some Sql / No SQL DB or that should be some file?

To give a bit more context: it's not enough just to have a set of tiffs in some directory to call it a layer, there should be some additional metadata to keep information about tiffs URIs / segment offsets / other tiff metadata. So such a layer still should be some geotiff data + geotiff metadata.

A potential result API design:

val source = "s3://bucket/prefix" // path to some set of tiffs
val layerPath = "s3://source/layer"
// a Spark job to create the index
// JSON? DB?
val metadata = GeoTiffLayer.fromFiles(source)
metadata.save(layerPath)
val layer = GeoTiffLayer(layerPath)
// query the layer
val rdd: RDD[(URI, Raster[MultibandTile])] = layer.query().where(Intersects(extent)) 

// somehow merge into a single tile
val tile: MultibandTile =
  layer
    .createTile(geom = extent, crs = WebMercator, bands=List(1, 2, 3), dateInternal=???)
    .merge()

layer.toCogLayer(tileLayout, etc) // explanations below

Also there can be a COG Layer:

  • a group of COGs that are
    • Non overlapping
    • Fit exactly to some tile layout, between files and wrt the internal tiling
val layer = COGGeoTiffLayer.index("s3://some-set-of-cogs-that-fit-cog-layer")

val layer = COGGeoTiffLayer("s3://pathtocoglayer") // <- JSON? DB?
// Runs the spark job to create the index.
// Determines the layout definition

layer.get(x, y, z)
layer.query(),where(Intersects(...)) ...

Two goals:

  • Create a queryable layer from any set of GeoTiffs.
    • A allows us to not have to rewrite data. Turns this sort of into a query database layer.
  • Store GeoTrellis layers as COG Layers (and WebCog layers).
    • Requires rewriting data still. But we get the joy of storing large files on S3, and band subsetting.

Potentially we'll be able to use GeoTiffs instead of our own internal binary format which appears to be the result of our current ingest process (somewhere in the future).

Want to hear some additional opinions about it and feedback.

@echeipesh @notthatbreezy @hectcastro @moradology @metasim and others.

@notthatbreezy

This comment has been minimized.

Show comment
Hide comment
@notthatbreezy

notthatbreezy Sep 13, 2017

Contributor

I have a few thoughts/questions -- not sure how much of this is feasible, but mostly thinking about how I think COGs could be most useful in Raster Foundry and the path that would be easiest to adopt given Raster Foundry's architecture.

I think adhering to the current API as close as possible would be ideal in being able to quickly adopt this while slowly being able to transition existing layers to COGs where possible. In my ideal scenario, this would look something like the following world where we have COGs and Avro encoded tiles in pseudo-ish code:

val store = PostgresAttributeStore()
val reader = sceneIngestFormat match {
  case S3 => new S3ValueReader(store).reader[SpatialKey, MultibandTile](LayerId)
  case COG => new COGValueReader(store).reader[SpatialKey, MultibandTile(LayerId)
}
reader.read(key)

I think sticking to JSON and continuing to use the attribute store would result in the least amount of friction. JSON has some benefits in that it is fairly agnostic to where it gets put and how it is read, it is easy to debug because it can be read. Adding a PostgresAttributeStore for instance was straightforward for this reason. I'm not sure what additional metadata is necessary to construct the spatial keys to request the correct bytes from a COG, but that seems like the tough thing to figure out.

In conversation not on the issue (and a little bit here above) there was a question of what is a COG layer? I think @lossyrob got pretty close to ideal in the original formulation of the issue:

In this architecture, an "ingest" of a layer can simply be the reading of all the GeoTiff metadata, the construction of the SFC-indexed fetch functions, and the storage of that metadata (which should be minuscule compared to thepixel data) into a metadata backend.

A couple other comments

Potentially we'll be able to use GeoTiffs instead of our own internal binary format which appears to be the result of our current ingest process (somewhere in the future).

This seems like a good idea, but we'd be giving up a lot -- a lot of data has already been ingested in the current format and it's not clear how COGs can (if at all) play with other storage backends.

Contributor

notthatbreezy commented Sep 13, 2017

I have a few thoughts/questions -- not sure how much of this is feasible, but mostly thinking about how I think COGs could be most useful in Raster Foundry and the path that would be easiest to adopt given Raster Foundry's architecture.

I think adhering to the current API as close as possible would be ideal in being able to quickly adopt this while slowly being able to transition existing layers to COGs where possible. In my ideal scenario, this would look something like the following world where we have COGs and Avro encoded tiles in pseudo-ish code:

val store = PostgresAttributeStore()
val reader = sceneIngestFormat match {
  case S3 => new S3ValueReader(store).reader[SpatialKey, MultibandTile](LayerId)
  case COG => new COGValueReader(store).reader[SpatialKey, MultibandTile(LayerId)
}
reader.read(key)

I think sticking to JSON and continuing to use the attribute store would result in the least amount of friction. JSON has some benefits in that it is fairly agnostic to where it gets put and how it is read, it is easy to debug because it can be read. Adding a PostgresAttributeStore for instance was straightforward for this reason. I'm not sure what additional metadata is necessary to construct the spatial keys to request the correct bytes from a COG, but that seems like the tough thing to figure out.

In conversation not on the issue (and a little bit here above) there was a question of what is a COG layer? I think @lossyrob got pretty close to ideal in the original formulation of the issue:

In this architecture, an "ingest" of a layer can simply be the reading of all the GeoTiff metadata, the construction of the SFC-indexed fetch functions, and the storage of that metadata (which should be minuscule compared to thepixel data) into a metadata backend.

A couple other comments

Potentially we'll be able to use GeoTiffs instead of our own internal binary format which appears to be the result of our current ingest process (somewhere in the future).

This seems like a good idea, but we'd be giving up a lot -- a lot of data has already been ingested in the current format and it's not clear how COGs can (if at all) play with other storage backends.

@metasim

This comment has been minimized.

Show comment
Hide comment
@metasim

metasim Sep 13, 2017

Member

WRT:

Potentially we'll be able to use GeoTiffs instead of our own internal binary format which appears to be the result of our current ingest process (somewhere in the future).

This seems like a good idea, but we'd be giving up a lot -- a lot of data has already been ingested in the current format and it's not clear how COGs can (if at all) play with other storage backends.

While dealing with legacy data ingests is something to consider, my gut says that if you can replace the internal binary format with something standards based (just mini GeoTiffs or other GDAL format), there are a lot of down stream benefits that will be had. I could imagine certain export/transfer/transform operations being simplified.

Perhaps you can support both by maintaining readers/modifiers for the legacy forms, but making whatever the new format is the default for creating. IMHO, in this context, sticking with standards will make adoption even more favorable.

Member

metasim commented Sep 13, 2017

WRT:

Potentially we'll be able to use GeoTiffs instead of our own internal binary format which appears to be the result of our current ingest process (somewhere in the future).

This seems like a good idea, but we'd be giving up a lot -- a lot of data has already been ingested in the current format and it's not clear how COGs can (if at all) play with other storage backends.

While dealing with legacy data ingests is something to consider, my gut says that if you can replace the internal binary format with something standards based (just mini GeoTiffs or other GDAL format), there are a lot of down stream benefits that will be had. I could imagine certain export/transfer/transform operations being simplified.

Perhaps you can support both by maintaining readers/modifiers for the legacy forms, but making whatever the new format is the default for creating. IMHO, in this context, sticking with standards will make adoption even more favorable.

@notthatbreezy

This comment has been minimized.

Show comment
Hide comment
@notthatbreezy

notthatbreezy Sep 13, 2017

Contributor

Definitely -- I'm completely on board with switching to something that is more standards based and all of the downstream benefits associated with it and for the other reasons you pointed out.

I agree this should be the default format for creating and that maintaining legacy support for readers would be good -- it would make adoption much easier and more straightforward for us since we can transition to defaulting with COGs while not losing support for existing layers.

Contributor

notthatbreezy commented Sep 13, 2017

Definitely -- I'm completely on board with switching to something that is more standards based and all of the downstream benefits associated with it and for the other reasons you pointed out.

I agree this should be the default format for creating and that maintaining legacy support for readers would be good -- it would make adoption much easier and more straightforward for us since we can transition to defaulting with COGs while not losing support for existing layers.

@cholmes

This comment has been minimized.

Show comment
Hide comment
@cholmes

cholmes Sep 19, 2017

The legacy support with new data stored in COG sounds ideal.

One thing to think about in at least RasterFoundry would be making sure that the layers exposed in the UI reference the COG source of data that is used in the system, so that users could point directly at the data with other tools.

Also, on the Web COG, @mojodna had some concerns in another thread about band interleaving - it 'prevents YCbCr JPEGs from being created, which have dramatically decreased our storage needs on the OAM end of things (~10x).'

I'll try to get a discussion going on 'profiles' of COG, indeed I think it'd be good to figure out exactly what should be in the COG validator Even wrote.

cholmes commented Sep 19, 2017

The legacy support with new data stored in COG sounds ideal.

One thing to think about in at least RasterFoundry would be making sure that the layers exposed in the UI reference the COG source of data that is used in the system, so that users could point directly at the data with other tools.

Also, on the Web COG, @mojodna had some concerns in another thread about band interleaving - it 'prevents YCbCr JPEGs from being created, which have dramatically decreased our storage needs on the OAM end of things (~10x).'

I'll try to get a discussion going on 'profiles' of COG, indeed I think it'd be good to figure out exactly what should be in the COG validator Even wrote.

@cholmes

This comment has been minimized.

Show comment
Hide comment
@cholmes

cholmes Jul 6, 2018

Should this be closed? Seems like it's pretty well supported now. Though some docs on it would be nice.

cholmes commented Jul 6, 2018

Should this be closed? Seems like it's pretty well supported now. Though some docs on it would be nice.

@lossyrob

This comment has been minimized.

Show comment
Hide comment
@lossyrob

lossyrob Jul 6, 2018

Member

Good call, I'm closing, and will create an issue for docs.

Member

lossyrob commented Jul 6, 2018

Good call, I'm closing, and will create an issue for docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment