Initial Html Processing #10

newsch · 2023-06-29T20:02:31Z

While I'm waiting for more dump files to download, I got started with this.

This PR will bring the html output to functional parity with the scraped ones.
There will still be extra metadata and other bloat covered in #4.

Remaining steps:

convert all relative articles to absolute
Remove img/picture elements
Make sure interwiki links are handled correctly
Merge Generator directory format #6

biodranik · 2023-06-30T05:32:12Z

src/bin/simplify_html.rs

+    env_logger::Builder::new()
+        .filter_level(log::LevelFilter::Info)
+        .parse_default_env()
+        .try_init()?;


What does it do?

It enables the logger and sets a default log level that can still be overridden by the default environment variable. I've run into trouble before when I do it in a different order and then the env var doesn't work, or can filter higher levels but not enable lower levels.

biodranik · 2023-06-30T05:34:00Z

src/html.rs


-    for id in to_remove.drain(..) {
+    let base_url = Url::parse(&format!("https://{}.wikipedia.org/wiki/", &lang)).unwrap();
+    fix_relative_urls(&mut document, base_url);


Do we insert full URLs on every page now? That is an overhead, webview on iOS and Android should work properly with relative URLs. Let's investigate and fix it in a separate issue later (TODO may be good here too).

When looking at this more I found that the html does include a base element set to //lang.wikipedia.org/wiki/, but when opened as a file in firefox it assumes the scheme is file: so they don't work.
I'll remove this, and later if we run into a similar problem with the webviews, setting the scheme once in the base element should handle it.

biodranik · 2023-06-30T05:35:35Z

src/html.rs


-    document.html()
+fn is_empty_or_whitespace(el: &ElementRef) -> bool {
+    el.text().all(|t| t.trim().is_empty())


Is it possible to check whitespaces without modification? E.g. use is_whitespace ?

We could do el.text().flat_map(str::chars).all(char::is_whitespace). I was going to say that working with characters might be less efficient, since the Patterns can work directly in UTF-8, but it looks like the implementation of trim also uses char::is_whitespace.

As trim actually returns a slice, your implementation is already optimal )

One thing I realized though is that trim still has to check the right side if it encounters non-whitespace characters from the left. trim_left is better, but both still need to construct the slice. I don't think the compiler can optimize that away.

I tried the three approaches with a selection of strings and found that the char iterator was fastest. Ultimately a micro-optimization but still interesting 😉.

const MIXED: &str = " \tcd\nfg "; const NOT_WHITESPACE: &str = "abcdefgh"; const WHITESPACE: &str = " \t\n \t\t ";

running 9 tests test chars_mixed ... bench: 3 ns/iter (+/- 0) test chars_not_whitespace ... bench: 0 ns/iter (+/- 0) test chars_whitespace ... bench: 7 ns/iter (+/- 1) test trim_left_mixed ... bench: 4 ns/iter (+/- 1) test trim_left_not_whitespace ... bench: 3 ns/iter (+/- 0) test trim_left_whitespace ... bench: 9 ns/iter (+/- 0) test trim_mixed ... bench: 10 ns/iter (+/- 1) test trim_not_whitespace ... bench: 5 ns/iter (+/- 4) test trim_whitespace ... bench: 12 ns/iter (+/- 0)

This is an important outcome of an interesting project: to learn something new )

src/html.rs

biodranik · 2023-07-06T10:19:19Z

src/html.rs

+            links
+        );
+
+        // orphan one of the links


Can this comment be clarified?

Suggested change

// orphan one of the links

// Detach one of the links from the root tree (as if previously deleted) to ensure it handles orphan nodes nicely.

Some of the tree operations panic when the node doesn't have a parent, and we could be processing nodes that were removed from the tree in a previous pass.

src/html.rs

See #11 for next steps Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik reviewed Jun 30, 2023

View reviewed changes

newsch mentioned this pull request Jun 30, 2023

Check relative link handling in webviews #11

Open

newsch marked this pull request as ready for review June 30, 2023 18:29

newsch added this to the v0.1 milestone Jul 4, 2023

biodranik approved these changes Jul 6, 2023

View reviewed changes

Remove images and links

8ec696c

See #11 for next steps Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch force-pushed the html-processing branch from fad10eb to 8ec696c Compare July 10, 2023 14:57

newsch merged commit 45efd77 into main Jul 10, 2023
1 check passed

newsch deleted the html-processing branch July 10, 2023 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Html Processing #10

Initial Html Processing #10

newsch commented Jun 29, 2023 •

edited

Loading

biodranik Jun 30, 2023

newsch Jun 30, 2023

biodranik Jun 30, 2023

newsch Jun 30, 2023

biodranik Jun 30, 2023

newsch Jun 30, 2023

biodranik Jun 30, 2023

newsch Jun 30, 2023

biodranik Jul 1, 2023

biodranik Jul 6, 2023

newsch Jul 6, 2023

	// orphan one of the links
	// Detach one of the links from the root tree (as if previously deleted) to ensure it handles orphan nodes nicely.

Initial Html Processing #10

Initial Html Processing #10

Conversation

newsch commented Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newsch commented Jun 29, 2023 •

edited

Loading