Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of Concept #3

Merged
merged 6 commits into from
Jun 23, 2023
Merged

Proof of Concept #3

merged 6 commits into from
Jun 23, 2023

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Jun 1, 2023

This is an initial implementation of searching for Wikipedia urls/Wikidata QIDs and simplifying html.
It's a little messy, inefficient, and doesn't handle multiple languages or writing to an output directory.

Remaining work for this PR:

  • Refactor into library/modules
  • Set up comparison of html processing with python implementation
    • processing of same html files
    • scraped vs enterprise dump
  • Add unit tests
  • Revisit WikipediaTitleNorm design (parsing 3 million QIDs is almost instant, 1.5 million titles around a minute)
  • Add logging/tracing
  • Add proper CLI
  • Look through TODOs
  • Add multiple language support
  • Write to directory

Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does it make sense to start splitting the code into logical modules in separate sources?
  2. Implementing convenient cmd line args may save some time.
  3. Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.
  4. Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.

@newsch
Copy link
Collaborator Author

newsch commented Jun 1, 2023

  1. Does it make sense to start splitting the code into logical modules in separate sources?

I think so, at the very least it needs a lib.rs to make a separate binary for testing the html simplifying. I won't go overboard 😉.

  1. Implementing convenient cmd line args may save some time.

I agree, but I wanted to get the logic working before worrying about that.

  1. Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.

Good point, I'll keep that in mind if I run into any other work that needs to be done.

  1. Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.

There are a number of crates related to the .pbf format, and bindings for libosmium, but I haven't come across any .o5m ones that seem thoroughly developed.

@biodranik
Copy link
Member

Can pbf be used to get fresh OpenStreetMap.org updates without converting it to o5m?

@newsch
Copy link
Collaborator Author

newsch commented Jun 2, 2023

The planet downloads are available in pbf: https://planet.openstreetmap.org/pbf/

I think this is saying you can apply the daily/hourly diffs to pbf, but I'm not sure I'm reading it correctly: https://wiki.openstreetmap.org/wiki/Planet.osm/diffs#Using_the_replication_diffs

This makes it seem like diffs can be appended at the end of the file, which seems like you'd need to read the entire file to get the true state of any node/object?
https://wiki.openstreetmap.org/wiki/PBF_Format#What_are_the_replication_fields_for?

I don't understand enough about these formats or how the generator works to know if they'd be compatible.

@biodranik
Copy link
Member

That's why we use o5m. It is suitable for incremental updates, and osmupdate tool creates a new planet.o5m file automatically by applying these updates.

@newsch
Copy link
Collaborator Author

newsch commented Jun 2, 2023

It looks like osmupdate supports pbf for the same thing, but I don't understand how that works

@biodranik
Copy link
Member

Tested, it works for pbf too. Cool! The whole planet can be stripped to 58gb:
"$OSMUPDATE" -v --drop-authors --drop-version --hash-memory=16000 planet-230529.osm.pbf planet-updated.osm.pbf

@newsch
Copy link
Collaborator Author

newsch commented Jun 5, 2023

Alright, here's what I've learned from comparing the html outputs:

The simplifying step of the python and the rust versions give comparable output (proper minification is still needed), but the html that the python gets from the api is much more simple than I expected.

It uses the Extracts API, which strips most of the markup.

The html in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump html is around 10x the size of that of the extracts for the subset I looked at.

To get to parity with that output, we'll need to add additional steps to the html processing to remove:

  • mw-specific tags and attributes
  • links
  • images
  • info boxes

This is doable, it will involve some more exploration of what exactly to match on.

I'll make issues for that and the minification.

This was referenced Jun 5, 2023
@biodranik
Copy link
Member

Minification is not a blocker and can be done later, if necessary. It would be great to compare final mwm sizes, AFAIR, some compression is used there.

Showing images for those who wants them, and maybe leave links too may be a good idea.

@newsch
Copy link
Collaborator Author

newsch commented Jun 6, 2023

Minification is not a blocker and can be done later, if necessary.

👍

It would be great to compare final mwm sizes, AFAIR, some compression is used there.

Good point. I've seen the compression referenced in text_storage.hpp.
I'll look into doing this locally.

Showing images for those who wants them, and maybe leave links too may be a good idea.

I know that's a goal, but right now the links/images are all relative urls so they'll still need some processing.

@newsch newsch marked this pull request as ready for review June 7, 2023 22:13
@newsch
Copy link
Collaborator Author

newsch commented Jun 7, 2023

Next steps are:

  • reviewing html output (Additional HTML Simplification #4)
  • connecting output to map generator (mainly handling wikipedia -> wikidata mapping)
  • defining some custom errors for better handling
  • initial profiling of pipeline (right now it can roughly keep up with tar on my machine, and that's with writing the output files to a ramdisk)

After that I think this will be ready to try out in production!

src/html.rs Outdated Show resolved Hide resolved
@newsch newsch requested a review from biodranik June 22, 2023 17:16
Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few comments

README.md Show resolved Hide resolved
article_processing_config.json Outdated Show resolved Hide resolved
src/html.rs Outdated Show resolved Hide resolved
src/html.rs Outdated Show resolved Hide resolved
src/html.rs Outdated Show resolved Hide resolved
src/html.rs Show resolved Hide resolved
src/html.rs Show resolved Hide resolved
src/wm/mod.rs Outdated Show resolved Hide resolved
src/wm/mod.rs Show resolved Hide resolved
src/wm/mod.rs Show resolved Hide resolved
@newsch
Copy link
Collaborator Author

newsch commented Jun 22, 2023

Thanks, I'll address those comments, rebase into individual changes, and merge.

The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text

Determining how similar the output is will require more testing.

A separate binary target is included for standalone html processing.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Initial results:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          34 ns/iter (+/- 1)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:      60,682 ns/iter (+/- 83,376)

Based on these results and a flamegraph of loading the file, url parsing
is the most expensive part.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The largest time by far was wasted because I used `ok_or` instead of
`ok_or_else` for converting to the anyhow errors. I thought with a
static string and no arguments they weren't that expensive, but they
still took a lot of time, even with backtraces disabled.

Removing them reduces the benchmark time 60x:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          36 ns/iter (+/- 11)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:         835 ns/iter (+/- 68)

I also tried removing url::parse and using string operations. That got
it down another 10x, but I don't think that's worth potential bugs
right now.

A small optimization for reading the json: sys::stdin() is behind a
lock by default, and already has an internal buffer.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Per-language section removal is configured with a static json file.

This includes a test to make sure the file exists and is formatted correctly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch newsch merged commit bb1f897 into main Jun 23, 2023
1 check passed
@newsch newsch deleted the proof-of-concept branch June 23, 2023 19:50
@newsch newsch added this to the v0.1 milestone Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants