Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Html Processing #10

Merged
merged 1 commit into from
Jul 10, 2023
Merged

Initial Html Processing #10

merged 1 commit into from
Jul 10, 2023

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Jun 29, 2023

While I'm waiting for more dump files to download, I got started with this.

This PR will bring the html output to functional parity with the scraped ones.
There will still be extra metadata and other bloat covered in #4.

Remaining steps:

  • convert all relative articles to absolute
  • Remove img/picture elements
  • Make sure interwiki links are handled correctly
  • Merge Generator directory format #6

env_logger::Builder::new()
.filter_level(log::LevelFilter::Info)
.parse_default_env()
.try_init()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It enables the logger and sets a default log level that can still be overridden by the default environment variable. I've run into trouble before when I do it in a different order and then the env var doesn't work, or can filter higher levels but not enable lower levels.

src/html.rs Outdated

for id in to_remove.drain(..) {
let base_url = Url::parse(&format!("https://{}.wikipedia.org/wiki/", &lang)).unwrap();
fix_relative_urls(&mut document, base_url);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we insert full URLs on every page now? That is an overhead, webview on iOS and Android should work properly with relative URLs. Let's investigate and fix it in a separate issue later (TODO may be good here too).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at this more I found that the html does include a base element set to //lang.wikipedia.org/wiki/, but when opened as a file in firefox it assumes the scheme is file: so they don't work.
I'll remove this, and later if we run into a similar problem with the webviews, setting the scheme once in the base element should handle it.

src/html.rs Outdated

document.html()
fn is_empty_or_whitespace(el: &ElementRef) -> bool {
el.text().all(|t| t.trim().is_empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check whitespaces without modification? E.g. use is_whitespace ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do el.text().flat_map(str::chars).all(char::is_whitespace). I was going to say that working with characters might be less efficient, since the Patterns can work directly in UTF-8, but it looks like the implementation of trim also uses char::is_whitespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As trim actually returns a slice, your implementation is already optimal )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I realized though is that trim still has to check the right side if it encounters non-whitespace characters from the left. trim_left is better, but both still need to construct the slice. I don't think the compiler can optimize that away.

I tried the three approaches with a selection of strings and found that the char iterator was fastest. Ultimately a micro-optimization but still interesting 😉.

const MIXED: &str =          " \tcd\nfg ";
const NOT_WHITESPACE: &str = "abcdefgh";
const WHITESPACE: &str =     " \t\n \t\t  ";
running 9 tests
test chars_mixed              ... bench:           3 ns/iter (+/- 0)
test chars_not_whitespace     ... bench:           0 ns/iter (+/- 0)
test chars_whitespace         ... bench:           7 ns/iter (+/- 1)
test trim_left_mixed          ... bench:           4 ns/iter (+/- 1)
test trim_left_not_whitespace ... bench:           3 ns/iter (+/- 0)
test trim_left_whitespace     ... bench:           9 ns/iter (+/- 0)
test trim_mixed               ... bench:          10 ns/iter (+/- 1)
test trim_not_whitespace      ... bench:           5 ns/iter (+/- 4)
test trim_whitespace          ... bench:          12 ns/iter (+/- 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important outcome of an interesting project: to learn something new )

src/html.rs Outdated Show resolved Hide resolved
src/html.rs Outdated Show resolved Hide resolved
@newsch newsch marked this pull request as ready for review June 30, 2023 18:29
@newsch newsch added this to the v0.1 milestone Jul 4, 2023
src/html.rs Outdated
links
);

// orphan one of the links
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be clarified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// orphan one of the links
// Detach one of the links from the root tree (as if previously deleted) to ensure it handles orphan nodes nicely.

Some of the tree operations panic when the node doesn't have a parent, and we could be processing nodes that were removed from the tree in a previous pass.

src/html.rs Outdated Show resolved Hide resolved
See #11 for next steps

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch newsch merged commit 45efd77 into main Jul 10, 2023
1 check passed
@newsch newsch deleted the html-processing branch July 10, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants