Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate cost of adding metadata #12

Closed
bdon opened this issue Oct 26, 2019 · 24 comments
Closed

Investigate cost of adding metadata #12

bdon opened this issue Oct 26, 2019 · 24 comments

Comments

@bdon
Copy link
Member

bdon commented Oct 26, 2019

Possibly add one or all of:

  • Changeset ID
  • User ID
  • Object version number
  • Timestamp

We would ignore metadata for nodes that have no tags.

@invisiblefunnel
Copy link
Sponsor

Hi @bdon. I'm a big fan of OSMExpress and the Protomaps extract service. At my company we have some internal tooling that relies on the object version number for caching. We don't need the changeset/user/timestamp. Would you consider adding version numbers to Protomaps extracts? Thanks.

@bdon
Copy link
Member Author

bdon commented Feb 15, 2020

are you working with a .osmx locally or just a .pbf extract? If an .osmx is it a region or the whole planet? I'm wary to implement this because it will probably double the total db size.

Ideally: metadata is optional, and you won't pay the storage cost for it if you don't use it. but I think this depends on migrating from capnproto to flatbuffers (#1) because of how empty fields are stored.

@invisiblefunnel
Copy link
Sponsor

We are just working with .pbf extracts for now.

@bdon
Copy link
Member Author

bdon commented Jun 4, 2020

  • added version, timestamp, changeset, uid, username to database
  • currently working on a new planet import to confirm the expansion in size is reasonable
    • untagged nodes are ignored
    • still using capnproto

@bdon
Copy link
Member Author

bdon commented Jun 4, 2020

on an AWS i3.xlarge instance, osmx expand planet.osm.pbf planet.osmx took exactly 7 hours and resulted in a 643G planet.osmx file. The expansion in size when adding all metadata (ignoring untagged nodes) should be less than 10% total, so I'd prefer to always include metadata.

@bdon
Copy link
Member Author

bdon commented Jun 5, 2020

download server at http://protomaps.com/extracts now includes version and timestamp information

@invisiblefunnel let me know if this is working for you; I'm working on the ecosystem around these tools so I'm interested in what people are building!

@bdon bdon closed this as completed Jun 5, 2020
@invisiblefunnel
Copy link
Sponsor

Thanks @bdon! This is great news. I'll take a look this week and reply back.

@blackboxlogic
Copy link

I just grabbed an extract from protomaps, loaded it into josm, fixed a road's name, and uploaded the change. This demonstrates that the extract had the required meta-data (version). I also manually verified that elements had edited at and edited by attributes.

However... I cannot use this as a source to change the shape of a road, since most of the way's nodes are tag-less, and you don't provider them with a version.

Please reconsider including meta-data (or at least version) on tag-less points. That would allow the extract to be used for any type of edit.

@invisiblefunnel
Copy link
Sponsor

invisiblefunnel commented Jun 9, 2020

Please reconsider including meta-data (or at least version) on tag-less points.

FWIW this is also a blocker for my use cases which rely on the ID and version to uniquely identify objects in time detect changes.

@bdon
Copy link
Member Author

bdon commented Jun 9, 2020

just to confirm - to make this work for your use cases only version is needed and no other metadata?

@invisiblefunnel
Copy link
Sponsor

Yes, just the version is needed. We don't use timestamps at all.

@bdon bdon reopened this Jun 9, 2020
@blackboxlogic
Copy link

Confirmed, version would make exports usable for editing projects. I can't think of a reason I'd want other meta-data on tag-less nodes, and I'm sure any reason I eventually think of won't justify the cost.

@bdon
Copy link
Member Author

bdon commented Jun 10, 2020

changed location values from a 64 bit integer to a 96-bit struct that includes the version

AWS i3.xlarge: osmx expand planet.osm.pbf planet.osmx took 7.38 hours and resulted in a 666G planet.osmx file. so another 3-5% bump in expand time and planet size. need to verify now that this is correct and benchmark some extracts, because the page fault rate when accessing locations should be higher now.

@CloudNiner
Copy link
Contributor

so another 3-5% bump in expand time and planet size

That seems pretty reasonable. For the augmented diff use case #17, version information is useful for the same reason as @invisiblefunnel mentioned above, it allows for unique identification of a particular node in order to match it to its metadata.

the page fault rate when accessing locations should be higher now

Can you describe this a bit more?

@bdon
Copy link
Member Author

bdon commented Jun 10, 2020

the page fault rate when accessing locations should be higher now

Locations were previously stored as 64 bit integers. The records for the "Locations" table in the osmx file occupy contiguous pages of storage on disk, ordered by node ID. Adding a 32 bit version number increases the record size by 50%, so less records fit on a single disk page.

When osm extract is run, a way's member nd references are resolved into lat/lng by seeking over the locations table; in order of increasing way id. This has very poor locality; extracting Boston might include ways 12345 and 12346, but ways 12345 and 12346 might reference nodes anywhere from 1 to 1000000; the node ID is essentially random (unless it's a set of ways and nodes that were all created around the same time and not edited heavily)

the osmx design (by using lmdb) implements no application level caching. it relies on the kernel to cache pages as they are retrieved from disk. This is tuned to automatically manage a pool in RAM of cached disk pages. Since the locations table is now less dense, it's more likely when fetching Locations that you will need a page that has not been fetched yet or has been evicted from cache.

This is just my performance hypothesis, I need to run some benchmarks to determine whether or not it makes any significant difference.

@bdon
Copy link
Member Author

bdon commented Jun 11, 2020

Here's my test region:

osmx extract planet.osmx benchmark.osm.pbf --bbox 38.462,-77.519,41.0130,-73.333

first run on versionless planet: 943 seconds
second run: 919 seconds

version planet: 873 seconds
version planet 2nd time: 773 seconds

echo 3 > /proc/sys/vm/drop_caches can be used to clear the page cache, but the extract is probably big enough so that it doesn't make a difference. This isn't a very controlled experiment because the versionless planet has been being updated for a few weeks and might be more fragmented. In any case, it doesn't look like adding versions to locations negatively affects the speed by that much.

@bdon
Copy link
Member Author

bdon commented Jun 11, 2020

@blackboxlogic @invisiblefunnel new planet with versions is now online - can you try on https://protomaps.com/extracts ?

@invisiblefunnel
Copy link
Sponsor

Works perfectly for me. Many thanks @bdon.

@blackboxlogic
Copy link

Every element has a version number, so the extracts are usable for editing.
Tag-less nodes don't have edited at, which is expected. However, I'm noticing that edited by and changeset are both 0 for all objects. Is that intentional?

@bdon
Copy link
Member Author

bdon commented Jun 14, 2020

Yes, the data is stored but I intentionally am excluding it. That seems to be the convention for GDPR compliance. Is that needed for any of your applications?

@blackboxlogic
Copy link

I definitely don't need it but it could be plausibly useful* and if you're storing it already then there isn't much to gain by withholding it. Other services handle GDPR by offering the "pii" only to OSM users who have signed in with oAuth, since they have agreed to terms of service. That would, of course, complicate your service by involving oAuth.

*Possible use-case: A vandal changes all buildings into parks, I want to remove all leisure=park where [vandal's name] was the last editor. I've had to do this sort of thing a few times.

@bdon
Copy link
Member Author

bdon commented Jun 14, 2020

I have an auth system built which is separate from osm Oauth. I could make PII only available to logged in users.

Can you describe your editing workflow in more detail ? I’d like to include it in my SOTM talk and I can mention your username if that’s ok.

@blackboxlogic
Copy link

blackboxlogic commented Jun 14, 2020

Re: "Describe your workflow"
Short version: "I've build up a collection of scripts which can be chained together" but I think that's just called "programming"?
Here's a recent example of my work, but I plan to do more and there are two parts of my pipeline I'm rewriting (pulling data from OSM, and schema translation). One of the more cumbersome parts was retrieving up-to-date large regions of data from OSM. It was awkward for multiple reasons and my future projects will benefit from your work.
If you want a longer description shoot me an email [blackboxlogic at gmail dot com] with your phone number and best time to call, I'd love to chat.

Yes, "ok" to mention my username.

@bdon
Copy link
Member Author

bdon commented Jun 15, 2020

Great, we can discuss over email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants