Proof of Concept #3

newsch · 2023-06-01T20:00:39Z

This is an initial implementation of searching for Wikipedia urls/Wikidata QIDs and simplifying html.
It's a little messy, inefficient, ~~and doesn't handle multiple languages or writing to an output directory~~.

Remaining work for this PR:

Refactor into library/modules
Set up comparison of html processing with python implementation
- processing of same html files
- scraped vs enterprise dump
Add unit tests
Revisit WikipediaTitleNorm design (parsing 3 million QIDs is almost instant, 1.5 million titles around a minute)
Add logging/tracing
Add proper CLI
Look through TODOs
Add multiple language support
Write to directory

biodranik

Does it make sense to start splitting the code into logical modules in separate sources?
Implementing convenient cmd line args may save some time.
Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.
Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.

newsch · 2023-06-01T21:05:35Z

Does it make sense to start splitting the code into logical modules in separate sources?

I think so, at the very least it needs a lib.rs to make a separate binary for testing the html simplifying. I won't go overboard 😉.

Implementing convenient cmd line args may save some time.

I agree, but I wanted to get the logic working before worrying about that.

Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.

Good point, I'll keep that in mind if I run into any other work that needs to be done.

Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.

There are a number of crates related to the .pbf format, and bindings for libosmium, but I haven't come across any .o5m ones that seem thoroughly developed.

biodranik · 2023-06-01T22:01:29Z

Can pbf be used to get fresh OpenStreetMap.org updates without converting it to o5m?

newsch · 2023-06-02T13:20:01Z

The planet downloads are available in pbf: https://planet.openstreetmap.org/pbf/

I think this is saying you can apply the daily/hourly diffs to pbf, but I'm not sure I'm reading it correctly: https://wiki.openstreetmap.org/wiki/Planet.osm/diffs#Using_the_replication_diffs

This makes it seem like diffs can be appended at the end of the file, which seems like you'd need to read the entire file to get the true state of any node/object?
https://wiki.openstreetmap.org/wiki/PBF_Format#What_are_the_replication_fields_for?

I don't understand enough about these formats or how the generator works to know if they'd be compatible.

biodranik · 2023-06-02T13:21:44Z

That's why we use o5m. It is suitable for incremental updates, and osmupdate tool creates a new planet.o5m file automatically by applying these updates.

newsch · 2023-06-02T13:30:21Z

It looks like osmupdate supports pbf for the same thing, but I don't understand how that works

biodranik · 2023-06-02T21:12:33Z

Tested, it works for pbf too. Cool! The whole planet can be stripped to 58gb:
"$OSMUPDATE" -v --drop-authors --drop-version --hash-memory=16000 planet-230529.osm.pbf planet-updated.osm.pbf

newsch · 2023-06-05T21:13:33Z

Alright, here's what I've learned from comparing the html outputs:

The simplifying step of the python and the rust versions give comparable output (proper minification is still needed), but the html that the python gets from the api is much more simple than I expected.

It uses the Extracts API, which strips most of the markup.

The html in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump html is around 10x the size of that of the extracts for the subset I looked at.

To get to parity with that output, we'll need to add additional steps to the html processing to remove:

mw-specific tags and attributes
links
images
info boxes

This is doable, it will involve some more exploration of what exactly to match on.

I'll make issues for that and the minification.

biodranik · 2023-06-06T00:23:28Z

Minification is not a blocker and can be done later, if necessary. It would be great to compare final mwm sizes, AFAIR, some compression is used there.

Showing images for those who wants them, and maybe leave links too may be a good idea.

newsch · 2023-06-06T13:36:33Z

Minification is not a blocker and can be done later, if necessary.

👍

It would be great to compare final mwm sizes, AFAIR, some compression is used there.

Good point. I've seen the compression referenced in text_storage.hpp.
I'll look into doing this locally.

Showing images for those who wants them, and maybe leave links too may be a good idea.

I know that's a goal, but right now the links/images are all relative urls so they'll still need some processing.

newsch · 2023-06-07T22:14:11Z

Next steps are:

reviewing html output (Additional HTML Simplification #4)
connecting output to map generator (mainly handling wikipedia -> wikidata mapping)
defining some custom errors for better handling
initial profiling of pipeline (right now it can roughly keep up with tar on my machine, and that's with writing the output files to a ramdisk)

After that I think this will be ready to try out in production!

src/html.rs

biodranik

LGTM with a few comments

README.md

article_processing_config.json

src/html.rs

src/wm/mod.rs

newsch · 2023-06-22T22:35:52Z

Thanks, I'll address those comments, rebase into individual changes, and merge.

The html processing should perform both of the main steps handled from the original `descriptions_downloader.py` script: - remove specific sections, e.g. "References" - remove elements with no non-whitespace text Determining how similar the output is will require more testing. A separate binary target is included for standalone html processing. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Initial results: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 34 ns/iter (+/- 1) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 60,682 ns/iter (+/- 83,376) Based on these results and a flamegraph of loading the file, url parsing is the most expensive part. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

The largest time by far was wasted because I used `ok_or` instead of `ok_or_else` for converting to the anyhow errors. I thought with a static string and no arguments they weren't that expensive, but they still took a lot of time, even with backtraces disabled. Removing them reduces the benchmark time 60x: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 36 ns/iter (+/- 11) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 835 ns/iter (+/- 68) I also tried removing url::parse and using string operations. That got it down another 10x, but I don't think that's worth potential bugs right now. A small optimization for reading the json: sys::stdin() is behind a lock by default, and already has an internal buffer. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Per-language section removal is configured with a static json file. This includes a test to make sure the file exists and is formatted correctly. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik reviewed Jun 1, 2023

View reviewed changes

This was referenced Jun 5, 2023

Additional HTML Simplification #4

Closed

HTML Minification #5

Open

newsch marked this pull request as ready for review June 7, 2023 22:13

biodranik reviewed Jun 8, 2023

View reviewed changes

src/html.rs Outdated Show resolved Hide resolved

newsch requested a review from biodranik June 22, 2023 17:16

biodranik approved these changes Jun 22, 2023

View reviewed changes

biodranik approved these changes Jun 23, 2023

View reviewed changes

newsch added 6 commits June 23, 2023 14:15

Add support for multiple languages

dcbc828

Per-language section removal is configured with a static json file. This includes a test to make sure the file exists and is formatted correctly. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Rewrite comments as sentences for readability

cd67f0e

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Add checks for whitespace/empty strings in ids and titles

171cfec

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch force-pushed the proof-of-concept branch from 70f7edf to 171cfec Compare June 23, 2023 18:31

newsch merged commit bb1f897 into main Jun 23, 2023
1 check passed

newsch deleted the proof-of-concept branch June 23, 2023 19:50

newsch added this to the v0.1 milestone Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of Concept #3

Proof of Concept #3

newsch commented Jun 1, 2023 •

edited

Loading

biodranik left a comment

newsch commented Jun 1, 2023

biodranik commented Jun 1, 2023

newsch commented Jun 2, 2023

biodranik commented Jun 2, 2023

newsch commented Jun 2, 2023

biodranik commented Jun 2, 2023

newsch commented Jun 5, 2023

biodranik commented Jun 6, 2023

newsch commented Jun 6, 2023

newsch commented Jun 7, 2023 •

edited

Loading

biodranik left a comment

newsch commented Jun 22, 2023

Proof of Concept #3

Proof of Concept #3

Conversation

newsch commented Jun 1, 2023 • edited Loading

biodranik left a comment

Choose a reason for hiding this comment

newsch commented Jun 1, 2023

biodranik commented Jun 1, 2023

newsch commented Jun 2, 2023

biodranik commented Jun 2, 2023

newsch commented Jun 2, 2023

biodranik commented Jun 2, 2023

newsch commented Jun 5, 2023

biodranik commented Jun 6, 2023

newsch commented Jun 6, 2023

newsch commented Jun 7, 2023 • edited Loading

biodranik left a comment

Choose a reason for hiding this comment

newsch commented Jun 22, 2023

newsch commented Jun 1, 2023 •

edited

Loading

newsch commented Jun 7, 2023 •

edited

Loading