-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of Concept #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does it make sense to start splitting the code into logical modules in separate sources?
- Implementing convenient cmd line args may save some time.
- Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.
- Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.
I think so, at the very least it needs a
I agree, but I wanted to get the logic working before worrying about that.
Good point, I'll keep that in mind if I run into any other work that needs to be done.
There are a number of crates related to the |
Can pbf be used to get fresh OpenStreetMap.org updates without converting it to o5m? |
The planet downloads are available in pbf: https://planet.openstreetmap.org/pbf/ I think this is saying you can apply the daily/hourly diffs to pbf, but I'm not sure I'm reading it correctly: https://wiki.openstreetmap.org/wiki/Planet.osm/diffs#Using_the_replication_diffs This makes it seem like diffs can be appended at the end of the file, which seems like you'd need to read the entire file to get the true state of any node/object? I don't understand enough about these formats or how the generator works to know if they'd be compatible. |
That's why we use o5m. It is suitable for incremental updates, and osmupdate tool creates a new planet.o5m file automatically by applying these updates. |
It looks like osmupdate supports pbf for the same thing, but I don't understand how that works |
Tested, it works for pbf too. Cool! The whole planet can be stripped to 58gb: |
Alright, here's what I've learned from comparing the html outputs: The simplifying step of the python and the rust versions give comparable output (proper minification is still needed), but the html that the python gets from the api is much more simple than I expected. It uses the Extracts API, which strips most of the markup. The html in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump html is around 10x the size of that of the extracts for the subset I looked at. To get to parity with that output, we'll need to add additional steps to the html processing to remove:
This is doable, it will involve some more exploration of what exactly to match on. I'll make issues for that and the minification. |
Minification is not a blocker and can be done later, if necessary. It would be great to compare final mwm sizes, AFAIR, some compression is used there. Showing images for those who wants them, and maybe leave links too may be a good idea. |
👍
Good point. I've seen the compression referenced in
I know that's a goal, but right now the links/images are all relative urls so they'll still need some processing. |
Next steps are:
After that I think this will be ready to try out in production! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few comments
Thanks, I'll address those comments, rebase into individual changes, and merge. |
The html processing should perform both of the main steps handled from the original `descriptions_downloader.py` script: - remove specific sections, e.g. "References" - remove elements with no non-whitespace text Determining how similar the output is will require more testing. A separate binary target is included for standalone html processing. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Initial results: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 34 ns/iter (+/- 1) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 60,682 ns/iter (+/- 83,376) Based on these results and a flamegraph of loading the file, url parsing is the most expensive part. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The largest time by far was wasted because I used `ok_or` instead of `ok_or_else` for converting to the anyhow errors. I thought with a static string and no arguments they weren't that expensive, but they still took a lot of time, even with backtraces disabled. Removing them reduces the benchmark time 60x: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 36 ns/iter (+/- 11) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 835 ns/iter (+/- 68) I also tried removing url::parse and using string operations. That got it down another 10x, but I don't think that's worth potential bugs right now. A small optimization for reading the json: sys::stdin() is behind a lock by default, and already has an internal buffer. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Per-language section removal is configured with a static json file. This includes a test to make sure the file exists and is formatted correctly. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This is an initial implementation of searching for Wikipedia urls/Wikidata QIDs and simplifying html.
It's a little messy, inefficient,
and doesn't handle multiple languages or writing to an output directory.Remaining work for this PR:
WikipediaTitleNorm
design (parsing 3 million QIDs is almost instant, 1.5 million titles around a minute)