flatpoints is an in-development data format for representing coordinates, particularly points of interest, as a compressed list of integers (i.e. an inverted list).
Warning: this library is currently in active development, so its API and the underlying data format are subject to change at any moment.
Section | Description | Bytes |
---|---|---|
Header | Magic Number | 3 Bytes |
Header | Start of Data (S) | 8 Bytes |
Header | Coordinate Count (N) | 8 Bytes |
Header | Properties Count | 8 Bytes |
Header | Property Types | N Bytes |
Header | Property Names | ((8 * (N + 3) + 3) - S) Bytes |
Header | Property Offsets | 8 * N Bytes |
Data | Coordinate Data | |
Data | Property 1 Data | |
Data | Property 2 Data | |
Data | ... | |
Data | Property N Data |
- Coordinate Encoding Codec: Sorted Hilbert Delta Indexing
- Integer Compression Codec: LEB128
- Properties Compression: Gzip
Note: These codecs are defaults for now, the plan is to allow for other encodings in later implementations.
The general idea for the flatpoints data format is to store point geometry in a small footprint, with cloud-native capabilities. The offset approach allows users to take advantage of HTTP Range Requests.
Coordinates are first projected to 1D by gridding the coordinates using the world extent, then filling the space with a Hilbert Curve. Point positions are then associated along the curve to get an unsigned 64-bit integer representation. These integers (and the associated properties) are then sorted in ascending order, and delta encoded. The resulting differences are then compressed using a codec of choice. In this case, they are compressed into LEB128 form. This encoding scheme is slightly lossy, but can be mitigated to be practically lossless by modifying the extent of the coordinate grid, and the number of iterations (exponent) for the Hilbert Curve. The defaults used (world extent with N = 31
) are the maximums, and result in a (at most) ±1m margin of error in accuracy (see: Decimal Degrees). This can be reduced to <±1cm margin of error in accuracy by reducing the extent to the coordinate's extent, but would require storing the extent to disk/in-memory.
See spress for the basis of this idea.
Reading the flatpoints format starts with a pre-flight request of reading the first 11 bytes, to ensure the file is a flatpoints file by the magic number, and to get the header size. The start of data (SOD) explicitly states the first (inclusive) position where the coordinate data starts, and implicitly gives the header size (SOD - 1)
.
A header request can then be made to get the entirety of the header.
Coordinate data and the feature properties can be read independently of one another. So, if a user only requires the coordinates and property 1, then they can read only those byte ranges via the offsets & SOD.
- Giulio Ermanno Pibiri, Rossano Venturini (2019). "Techniques for Inverted Index Compression." arXiv:1908.10598v2.
- Vicente Romera Calero (2022). "VTEnc". GitHub Repository.
- Vukovic, Tibor (2016). "Hilbert-Geohash - Hashing Geographical Point Data Using the Hilbert Space-Filling Curve."
The reference implementation of flatpoints is licensed under APLv2. The format specification for the flatpoints data format is public domain (or under a CC0 license as applicable).