Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store slim table tags in hstore #692

Closed
pnorman opened this issue Feb 28, 2017 · 5 comments · Fixed by #1969
Closed

Store slim table tags in hstore #692

pnorman opened this issue Feb 28, 2017 · 5 comments · Fixed by #1969

Comments

@pnorman
Copy link
Collaborator

pnorman commented Feb 28, 2017

I did some testing, and the representation of tags in a hstore is more efficient than our current text[] array of alternating keys and values. For testing latter can be turned into the former with hstore(tags).

Doing this on the slim table of a planet and summing the pg_column_size gives a size of 38GB for text[] representation and 30GB for hstore. The overall size of the current slim table is 94GB for the table and 190GB for the indexes so this is about a 5% space savings.

I considered jsonb, but it is the same size as hstore.

This change would require the hstore extension for osm2pgsql databases. I do not consider this an issue because hstore is a standard part of contrib, and many databases require it already.

In addition to being more efficient, if someone did want to use our internal slim tables to query by tag, it is much easier to do so with hstore.

@Komzpa
Copy link
Contributor

Komzpa commented Mar 6, 2017

Internally jsonb and hstore are similar, except that jsonb is supported in core Postgres. Reasons to use hstore nowadays are about it being a legacy. If you're writing new code please use jsonb.

@Komzpa
Copy link
Contributor

Komzpa commented Mar 6, 2017

If you want to go extensions, there's between-row deduplicating jsonb drop-in replacement from the same people who made jsonb and hstore: https://github.com/postgrespro/zson

@pnorman
Copy link
Collaborator Author

pnorman commented Mar 6, 2017

The version requirements for jsonb would require two different implementations in osm2pgsql, one for 9.4+ and another for previous versions. This rules it out.

I do not see requiring the extension as an issue, given we require the postgis extension and many uses will require the hstore extension anyways.

This would be true regardless of if I felt if jsonb was a better option for new code. Absent good evidence on performance, I don't.

@pnorman
Copy link
Collaborator Author

pnorman commented Mar 6, 2017

As for zson, requiring an extension that's not in contrib and compiled by default (or part of postgis) is a barrier.

@ppKrauss
Copy link
Contributor

Hi, I am using direct jsonb_object(text[]) to convert tags of slim tables, was fine. For geometry tables also easy, alter table USING hstore_to_jsonb_loose(tags) as commented here (instead ::jsonb default cast)... And it is not only "to being more efficient", is also to being more standard.

About other (above) comments, nowdays — 2018 and pg v9.3 end of life, the v9.4+ is the reference version — is time to use jsonb datatype... Even when equivalent (as @Komzpa say "Internally jsonb and hstore are similar"), jsonb is better. It is important an osm2pgsql option to user decision.

The jsonb datatype (a JSON-standard) is a de facto standard for PostgreSQL developer community:

if you're choosing a dynamic structure you should choose jsonb over hstore,
dba.stackexchange C. Ringer (2015)

jsonb-vs-hstore performance: hstore have equal performance only "especially if it's indexed with GIN", M. Marchel (2016)... But default osm2pgsql is without indexation, so hstore have low performance.

In most cases JSONB is likely what you want when looking for a NoSQL, schema-less, datatype (...)
(...) JSONB - In most cases;
hstore - Can work fine for text based key-value looks, but in general JSONB can still work great here, citusdata.com C. Kerstiens (2016).

joto added a commit to joto/osm2pgsql that referenced this issue May 27, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=NUM - (1 for old format, 2 for new format)
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

Note that this commit currently doesn't have a mechanism for detecting
which format a database has when doing updates. This will be added
later. For the time being you have to use the same command line options
for updates that were used on import.

This commit doesn't properly do testing. If you want to run all the
tests with the new database format, set `middle_database_format = 2` in
options.hpp and recompile.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue May 27, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=NUM - (1 for old format, 2 for new format)
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

Note that this commit currently doesn't have a mechanism for detecting
which format a database has when doing updates. This will be added
later. For the time being you have to use the same command line options
for updates that were used on import.

This commit doesn't properly do testing. If you want to run all the
tests with the new database format, set `middle_database_format = 2` in
options.hpp and recompile.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 22, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit doesn't properly do testing. If you want to run all the
tests with the new database format, you have to set
`options.middle_database_format = 2` in osm2pgsql.cpp and recompile.
We'll need a better solution for that.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 23, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 23, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 24, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 25, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 25, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 30, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jun 30, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
joto added a commit to joto/osm2pgsql that referenced this issue Jul 3, 2023
The database format we have been using for the middle has some problems:

* Node tags are not stored in the planet_osm_nodes table, because they
  are not needed for updates. But having access to those tags is useful
  in some cases, for instance when relations are processed after import.
* Attributes (version, timestamp, changeset, uid, user) of ways and
  relations are stored as special pseudo-tags ("osm_*") (if
  --extra-attributes) is used. This has the potential of name clashes
  and format problems and makes the attributes difficult to access from
  the database. Attributes for nodes are never stored.
* The way we store tags as array of text fields with keys and values
  intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use
  from the database.
* The way relation members are stored is rather arcane.
* When using --extra-attributes/-x the middle tables become huge (due
  to storage in pseudo-tags).

This commit fixes all those problems introducing a new database
structure:

* Tags are stored in JSONB columns.
* The nodes table gets a new "tags" column.
* Attributes are optionally stored in normal typed database columns. The
  columns are only added when --extra-attributes is specified and the
  columns can be NULL if not used which makes the overhead tiny.
* Relation members are now stored as JSONB as an array of objects, for
  example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using
  JSONB allows us to build the indexes needed to find all relations with
  certain members.
* The format for way nodes has been kept as an array of bigints.

The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with
"osm_planet" as default prefix) has been kept, but we might want to
change this and get rid of the prefix, schemas are a better mechanism
and they have been available for a while.

There is a new table PREFIX_users which contains a user id->name lookup
table. The user name isn't stored in the other tables, just the id. This
saves disk space and has the added benefit of updating the user name
correctly if a user name changes.

There are two new command line options:
* --middle-database-format=FORMAT - 'legacy' (default) or 'new'
* --middle-with-nodes - set this to store tagged nodes in the database
  even if a flat-node file is used. Untagged nodes are only stored in
  the database if there is no flat-node file.

For the first time this new format allows you to have a database created
by osm2pgsql that contains *all* the information in an OSM file, all
nodes, ways, and relations with all their tags and attributes.

A new property "db_format" is written to the osm2pgsql_properties table
with the value "0" (non-slim import), "1" (slim import with legacy
format) or "2" (slim import with new format). This is read in append
mode and handled appropriately.

This commit adds a new dependency on a [JSON
library](https://github.com/nlohmann/json). Parsing JSON isn't something
we want to do ourselves. This library has been around for a while, is
available everywhere and is well supported with regular releases unless
the RapidJSON library we were using before.

Closes osm2pgsql-dev#692
Closes osm2pgsql-dev#1170
See osm2pgsql-dev#1502
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants