Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ghost header or; data outside IFDs and Images #99

Open
HeroicKatora opened this issue Oct 14, 2020 · 4 comments
Open

Ghost header or; data outside IFDs and Images #99

HeroicKatora opened this issue Oct 14, 2020 · 4 comments

Comments

@HeroicKatora
Copy link
Member

At least two formats, GeoTIFF and ScanImage, use a similar technique to include additional data in tiff's without affecting the structure of the file for conformant, pure image readers. The usual process when reading the first frame is like so:

  1. Read the header, 8 bytes or 16 bytes for normal and Big Tiff respectively. This headers contains an Offset of first image directory.
  2. Seek forward to the first image directory.
  3. Find StripOffset tags or similar, and seek to image data accordingly.

Note that, by adjusting all offsets accordingly, it is possible to 'hide' data inside the file in the sense that conformant image decoders will simply seek over the data. In particular by choosing an initial offset greater than 8 resp. 16 some number of bytes immediately after the tiff header will be skipped. This is referred to as a Ghost Header in GeoTIFF and used to provide additional data for 'Cloud compatibility', i.e. it contains a separate index that can be used to optimize reads by transferring less of the whole file which would otherwise be a single linked list in the form of IFDs. (See also #85).

It seems interesting to provide access to this data in tiff in a way that is agnostic of the usage in the variants of the tiff format and in such a way that it is purely opt-in. The main questions are:

  • What extra methods are necessary in the decoder? Note that we skip the Ghost Header already in new.
  • How should the encoder be instructed to write such data?
  • How can we ensure that the internal state/the offsets stored in encoder and decoder is consistent with the additional data? Are there are hard-coded offsets we need to adjust?
@Farkal
Copy link
Member

Farkal commented Oct 15, 2020

I found a really interesting explanation of the optimization with the ghost header here -> https://lists.osgeo.org/pipermail/gdal-dev/2019-May/050169.html
The ghost header for cloud optimized geotiff is described here -> https://gdal.org/drivers/raster/cog.html#header-ghost-area
We could just add a method to check the presence of ghost header and store it in the Decoder with a method to get it. So if a user is interested he could get the ghost headers data and access the data.
The current issue is that there is no method to get some tile in the lib today so we only get the full image or strip but there is no way to jump in the tiff. I use this lib to get the ifd data and then i make the range requests myself. If i know there is a ghost header i can optimize my storage and my range request.
We could choose to add some function like get_tile(index) that could get the index from the TileOffsets and if there is a ghost header choose to get 4 bytes before the TileOffset and get the size of the tile and return the tile if there is no ghost header it get the size from ByteCounts.
But i don't think it fit the current design, from my point of view we should rethink the design of the decoder to make it easier to use for cloud files. It should read all the ifd and then have methods to point on some ifd and ask for and image, a strip or a tile.
COG work like this, each ifd is a Z level and the tile index is Y * TileWidth + X.
With this behavior we could easily implement a reader that transform index asked by the lib to range request and the cloud compatibility would be great.

@HeroicKatora
Copy link
Member Author

HeroicKatora commented Oct 15, 2020

The current issue is that there is no method to get some tile in the lib today so we only get the full image or strip but there is no way to jump in the tiff.

That seems to be the root cause. Being able to store an IFD with its offset and then jump back to it (respectively to the start of the image) seems to solve a few lot of related problems in one single operation. It shouldn't even be expensive in terms of interface as the decoder already requires Seek. If it's possible to seek back to any ifd then that also enables a bunch of other interesting APIs such as:

  • Seeking to and reading from arbitrary positions in the file while reading. Then you don't need to add get_tile in the main crate but can implement it directly.
  • .. which in turn allows reading the Ghost Header without any dedicated interface.
  • Doing partial reads of an IFD, skipping some attributes as an optimization of only retrieving those tags necessary.

@HeroicKatora
Copy link
Member Author

I'm not even sure why init—which reads and interprets the TIFF header—is exposed. Is there any other decoder than new that leaves the decoder without having read the header?

@Farkal
Copy link
Member

Farkal commented Oct 17, 2020

Yes i don't think init need to be public as it's already called in new 😉
Being able to seek would solve a lot of issues yes !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants