Skip to content

program--/flatpoints

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flatpoints

flatpoints is an in-development data format for representing coordinates, particularly points of interest, as a compressed list of integers (i.e. an inverted list).

Warning: this library is currently in active development, so its API and the underlying data format are subject to change at any moment.

Format

Section Description Bytes
Header Magic Number 3 Bytes
Header Start of Data (S) 8 Bytes
Header Coordinate Count (N) 8 Bytes
Header Properties Count 8 Bytes
Header Property Types N Bytes
Header Property Names ((8 * (N + 3) + 3) - S) Bytes
Header Property Offsets 8 * N Bytes
Data Coordinate Data
Data Property 1 Data
Data Property 2 Data
Data ...
Data Property N Data

Braindumping

  • Coordinate Encoding Codec: Sorted Hilbert Delta Indexing
  • Integer Compression Codec: LEB128
  • Properties Compression: Gzip

Note: These codecs are defaults for now, the plan is to allow for other encodings in later implementations.

The general idea for the flatpoints data format is to store point geometry in a small footprint, with cloud-native capabilities. The offset approach allows users to take advantage of HTTP Range Requests.

Coordinates are first projected to 1D by gridding the coordinates using the world extent, then filling the space with a Hilbert Curve. Point positions are then associated along the curve to get an unsigned 64-bit integer representation. These integers (and the associated properties) are then sorted in ascending order, and delta encoded. The resulting differences are then compressed using a codec of choice. In this case, they are compressed into LEB128 form. This encoding scheme is slightly lossy, but can be mitigated to be practically lossless by modifying the extent of the coordinate grid, and the number of iterations (exponent) for the Hilbert Curve. The defaults used (world extent with N = 31) are the maximums, and result in a (at most) ±1m margin of error in accuracy (see: Decimal Degrees). This can be reduced to <±1cm margin of error in accuracy by reducing the extent to the coordinate's extent, but would require storing the extent to disk/in-memory.

See spress for the basis of this idea.

Reading the flatpoints format starts with a pre-flight request of reading the first 11 bytes, to ensure the file is a flatpoints file by the magic number, and to get the header size. The start of data (SOD) explicitly states the first (inclusive) position where the coordinate data starts, and implicitly gives the header size (SOD - 1).

A header request can then be made to get the entirety of the header.

Coordinate data and the feature properties can be read independently of one another. So, if a user only requires the coordinates and property 1, then they can read only those byte ranges via the offsets & SOD.

References

  • Giulio Ermanno Pibiri, Rossano Venturini (2019). "Techniques for Inverted Index Compression." arXiv:1908.10598v2.
  • Vicente Romera Calero (2022). "VTEnc". GitHub Repository.
  • Vukovic, Tibor (2016). "Hilbert-Geohash - Hashing Geographical Point Data Using the Hilbert Space-Filling Curve."

License

The reference implementation of flatpoints is licensed under APLv2. The format specification for the flatpoints data format is public domain (or under a CC0 license as applicable).

About

Flattened Points of Interest Data Format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published