Skip to content

Conversation

@Komzpa
Copy link
Contributor

@Komzpa Komzpa commented Jan 3, 2019

nodes.cache is 50GB nowadays:

-rw-r--r--  1 gis gis 49423874128 сту  3 23:53 nodes.cache

@lonvia lonvia merged commit 45ec5dc into osm2pgsql-dev:master Jan 4, 2019
@lonvia
Copy link
Collaborator

lonvia commented Jan 4, 2019

Eventually, we should avoid absolute numbers in the help texts and switch to a wording that flat node file is about the size of the planet pbf and cache the size of the imported pbf.

@StyXman
Copy link
Contributor

StyXman commented Jan 5, 2019

flat node file is about the size of the planet pbf

Did you measure with stat or du? I have the impression that the file would be fairly sparse, specially with partial imports.

@mmd-osm
Copy link
Contributor

mmd-osm commented Jan 5, 2019

I think that depends a bit on the size of the extract and the distribution of node ids: on a file system with 4K block size,12 mio nodes would be theoretical lower bound to still allocate 50GB, assuming equal distribution of node ids. In reality it won't be this bad, though.

One way to keep nodes.cache small for a one-time import could be renumbering ids via https://docs.osmcode.org/osmium/latest/osmium-renumber.html.

@lonvia
Copy link
Collaborator

lonvia commented Jan 5, 2019

Sparseness does not matter for the flat node file. Unused nodes are still written out to disk (they are -1 not 0 because 0 is a valid coordinate value).

@mmd-osm
Copy link
Contributor

mmd-osm commented Jan 6, 2019

Getting rid of that special -1 shouldn't be too difficult: calculating the node location as: {Node location} bitwise-XOR {undefined location magic number} before writing locations to disk would turn 512 undefined Locations into a file system page with all zeros in them. This way, a zero always represents the undefined value, rather than the max int value.

It could be a starting point to enable sparse files. For sure this would require some support on the libosmium side as well. I don't know if this would actually help, it's just an idea and I didn't test anything.

Today, flat nodes are recommended for planet file imports, and here it's quite unlikely to find larger amounts of sparse blocks. For smaller extracts, this option might become more interesting when using sparse files, in case memory is limited and writing a full 50GB flat node file would be prohibitively expensive.

In reality though, some extracts like D-A-CH (size: 3.7GB, 390 mio. nodes) have nodes in every single 4K block. In this case, this all becomes a bit futile.

@Komzpa Komzpa deleted the patch-2 branch January 6, 2019 10:20
@Komzpa
Copy link
Contributor Author

Komzpa commented Jan 10, 2019

a) is it a problem if non-mentioned nodes go to (0,0) in flat mode during way reconstruction?
b) if it is, can it be worked around by just shifting every (0,0) by 1e-15 so that it's binarily different?
c) isn't -1 a valid coordinate too?

@joto
Copy link
Collaborator

joto commented Jan 10, 2019

When lonvia mentioned "-1" as the invalid coordinate what she meant was the largest positive int 32 value. That can never be a valid coordinate, -1 is valid of course.

The real solution here is not to fiddle around with the invalid value, but to find a different encoding of the flat node file for small datasets. Something like the FlexMem index in libosmium, but one that also works on disk. This would be totally doable, it just needs some careful work defining such a format and make sure it re-sizes to the other format when the dataset grows too much.

The reason for this is the following: If we change the current format somehow to use the zero byte for the invalid value, we could potentially recover disk space for blocks that are completely empty. But if you have a sizable number of completely empty blocks (and only then this would matter), chances are you have lots of block containing just one location, or two, or three. For them the optimization would not work, so you still use 4k or so for a single or a few locations. So in this case a different format would be much better, even if it would use, say, 10 bytes per location it would still be two orders of magnitude better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants