Skip to content

Latest commit

 

History

History
156 lines (94 loc) · 10 KB

MANUAL.md

File metadata and controls

156 lines (94 loc) · 10 KB

OSM Express is a database file format for OpenStreetMap data (.osmx), as well as a command line tool and C++ library for reading and writing .osmx files. Find it on GitHub at github.com/protomaps/OSMExpress

screenshot

Illustration of the cell covering for a rectangular input region and its overlap with indexed OpenStreetMap geometries.

Motivation

Here are some use cases that OSM Express fits well.

  • You want an offline copy of OpenStreetMap, which can be updated every day, hour or minute from the main openstreetmap.org database, instead of redownloading the entire planet.
  • You want to quickly access all OSM objects in a geographical region, such as as neighborhood, city or small country.
  • You want to quickly look up OSM objects by ID, such as getting the height and name tags for a given way that represents a building, and construct geometries for ways and relations.
  • You want to embed a database that does any of the above, such as in a web application that returns OSM objects as GeoJSON.

Quick Start

Command Line

Binaries are available for MacOS (Darwin) and GNU/Linux at GitHub Releases.

For information on how to compile the osmx program from source, see the Programming Guide.

Once you have the osmx command line program, you'll need to start with an .osm.pbf or OSM XML file. The Planet file is available at planet.openstreetmap.org, but it's preferable to begin with something smaller to learn with.

There are numerous sites for downloading .osm.pbf extracts, including Protomaps Minutely Extracts, a service itself powered by OSM Express. For testing purposes let's start with this small PBF I generated of New York County:

new_york_county.pbf (5.86 MB, generated 2019/09/02 6:42PM UTC)

Create an .osmx file by using the expand command on the .osm.pbf file:

osmx expand new_york_county.osm.pbf new_york_county.osmx

This will result in a 91 MB .osmx file.

We can access objects inside this .osmx file by ID, displaying the node IDs of its member nodes and all tags:

osmx query new_york_county.osmx way 34633854

> 402743563 402743567 402743571 402743573 2709307502 2709307499 2709307464 402743563
addr:city=New York City addr:housenumber=350 addr:postcode=10018 ...

We can also extract regions of the .osmx file into a new .osm.pbf file, which is useful for interoperability with other OSM tools.

osmx extract new_york_county.osmx downtown.osm.pbf --bbox 40.7411\,-73.9937\,40.7486\,-73.9821

Updating

utils/osmx-update is provided to update .osmx to the most recent file on a replication server using osmx update. For example to update a planet.osmx file with minutely updates:

python utils/osmx-update planet.osmx https://planet.openstreetmap.org/replication/minute/

Library

the OSM Express library is intentionally minimal and non-opinionated - for example, no attempt is made to transform OSM tags to a fixed schema, distinguish between polygon and linear ways, or assemble multipolygon relations into polygons. For these typical tasks it's recommended to use OSM Express as a library in your own program. Documentation and example code are available at the Programming Guide.

Other Languages

An .osmx file can be opened and queried direcly in a Python program using the osmx Python package. See Python for details.

Languages other than Python may be supported in the future by either language-specific libraries or a new C API. See Development if you're interested or discuss on GitHub.

Technical Details

Storage Requirements

A full planet.osmx created from planet.osm.pbf (47 GB) is around 580 GB.

OSM Express is optimized for fast lookups, extracts and updates, goals opposed to making the database size as compact as possible. A typical .osmx file can be 10 times the size of the corresponding .osm.pbf, because:

  • Relationships between parent elements and member elements are encoded in both directions, to enable lookups from node to way, way to relation, etc.
  • The storage engine (LMDB) has no built-in compression, unlike some LSM-tree storage engines such as LevelDB.
  • The mmap-based design of LMDB and Cap'n Proto requires that fields are word-aligned on disk, causing storage overhead.
  • Keys and values are stored in full as strings. Keys could be hardcoded in a lookup table, saving about 10% space, but this would make the database less portable.

As of 2019, fast local storage is cheap; 1 terabyte solid state drives are less than 150 USD. On managed hosting providers like AWS and Google Cloud, extra storage is affordable compared to more memory or CPU cores.

If it's necessary to optimize for storage space, an .osmx file can be stored on a filesystem with transparent compression such as ZFS or Btrfs, at the cost of CPU overhead. This can reduce planet.osmx to around 200GB.

Privacy

OSM Express stores all metadata - version, timestamp, changeset, username and user ID - for all OSM objects, except for untagged nodes. The osmx extract --noUserData flag ignores changeset, username and user ID information for extracts, to comply with GDPR guidelines.

Performance

OSM Express should work with reasonable amounts of memory, less than 8 gigabytes, even for expand and extract on planet.osmx. The strongest predictor of performance is I/O latency. If benchmarking different storage environments, I/O latency can be best measured via IOPS at queue depth 1.

WIP: benchmarks

Alternatives

  • osmium-tool for creating extracts from osm.pbf files. This is more efficient for large country or continent sized extracts, or any task where the entire dataset needs to be read.
  • Overpass API is a powerful server application for interactive querying and tag-based lookup of OSM data.
  • conveyal/osm-lib is a similar design, written in Java.
  • imposm3, osm2pgsql if you want OSM data in PostgreSQL and/or want to render maps.

Concepts

File Layout

The osmx query command with no arguments reveals the layout of an .osmx database:

osmx query planet.osmx
locations: 5313351219
nodes: 144307630
ways: 590470034
relations: 6895065
cell_node: 5313351219
node_way: 5906888644
node_relation: 10242142
way_relation: 63350432
relation_relation: 497137

an .osmx file is a LMDB database with 10 sub-databases. All keys are 64 bit integers in host byte order (little-endian on most modern CPUs).

  • locations: maps OSM node IDs to Locations, which store the coordinates and version number of the node (documented below).
  • nodes, ways, relations map OSM object IDs to a Cap'n Proto message defined in include/osmx/messages.capnp.
    • nodes only contains tagged nodes; the value for each key describes the node's tags and other metadata. Untagged nodes are included only in locations to save space on disk.
    • ways contains all ways; the value for each key describes the way's tags, metadata, and the list of node IDs that are part of the way.
    • relations contains all relations; the value for each key contains the relation's tags, metadata, and the IDs and roles of its members.
  • cell_node maps a level 16 S2 cell ID to a node ID, using LMDB's DUPSORT to store multiple values for each key (since each S2 cell will intersect many OSM objects).
  • node_way, node_relation, way_relation and relation_relation map OSM object IDs to their parent object IDs, also using DUPSORT (since nodes can belong to multiple ways, ways to multiple relations, etc).

Finally, the metadata sub-database holds arbitrary string:string values. This is used to store the replication sequence number and timestamp.

It is important to note that LMDB transactions span all sub-databases. This means that a read operation will retrieve the correct timestamp for the data it fetches, even if the database is written to while the read is happening.

Encoding of Locations

Values in the locations sub-database are structs with the following layout:

struct Location {
    int32_t longitude_i;
    int32_t latitude_i;
    int32_t version;
};

Each field is serialized in host byte order.

Longitude and latitude are stored as integers. To obtain the actual longitude and latitude as decimal numbers, divide the integer value by 10000000 (1e7). This integer-based encoding is precise to within a few centimeters anywhere on Earth. The same encoding is used by libosmium and by the openstreetmap.org database internally.

Spatial Indexing

OSM Express avoids expensive point-in-polygon computations for spatial operations. Instead, a query region is approximated by S2 cells with maximum level 16. The level 16 is chosen as a reasonable tradeoff between covering precision and storage space.

Author's note: the S2 Covering of a region may differ depending on choice of architecture and compiler, while still being valid. Let me know if you know how to make this consistent.

Further Development

If you'd like to sponsor development of OSM Express features, or integrate it into your product, get in contact at brandon@protomaps.com.

Presentations

State of the Map US 2019, Minneapolis - Video