-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Html Processing #10
Conversation
env_logger::Builder::new() | ||
.filter_level(log::LevelFilter::Info) | ||
.parse_default_env() | ||
.try_init()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It enables the logger and sets a default log level that can still be overridden by the default environment variable. I've run into trouble before when I do it in a different order and then the env var doesn't work, or can filter higher levels but not enable lower levels.
src/html.rs
Outdated
|
||
for id in to_remove.drain(..) { | ||
let base_url = Url::parse(&format!("https://{}.wikipedia.org/wiki/", &lang)).unwrap(); | ||
fix_relative_urls(&mut document, base_url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we insert full URLs on every page now? That is an overhead, webview on iOS and Android should work properly with relative URLs. Let's investigate and fix it in a separate issue later (TODO may be good here too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking at this more I found that the html does include a base
element set to //lang.wikipedia.org/wiki/
, but when opened as a file in firefox it assumes the scheme is file:
so they don't work.
I'll remove this, and later if we run into a similar problem with the webviews, setting the scheme once in the base element should handle it.
src/html.rs
Outdated
|
||
document.html() | ||
fn is_empty_or_whitespace(el: &ElementRef) -> bool { | ||
el.text().all(|t| t.trim().is_empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to check whitespaces without modification? E.g. use is_whitespace ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do el.text().flat_map(str::chars).all(char::is_whitespace)
. I was going to say that working with characters might be less efficient, since the Pattern
s can work directly in UTF-8, but it looks like the implementation of trim
also uses char::is_whitespace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As trim actually returns a slice, your implementation is already optimal )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I realized though is that trim
still has to check the right side if it encounters non-whitespace characters from the left. trim_left
is better, but both still need to construct the slice. I don't think the compiler can optimize that away.
I tried the three approaches with a selection of strings and found that the char iterator was fastest. Ultimately a micro-optimization but still interesting 😉.
const MIXED: &str = " \tcd\nfg ";
const NOT_WHITESPACE: &str = "abcdefgh";
const WHITESPACE: &str = " \t\n \t\t ";
running 9 tests
test chars_mixed ... bench: 3 ns/iter (+/- 0)
test chars_not_whitespace ... bench: 0 ns/iter (+/- 0)
test chars_whitespace ... bench: 7 ns/iter (+/- 1)
test trim_left_mixed ... bench: 4 ns/iter (+/- 1)
test trim_left_not_whitespace ... bench: 3 ns/iter (+/- 0)
test trim_left_whitespace ... bench: 9 ns/iter (+/- 0)
test trim_mixed ... bench: 10 ns/iter (+/- 1)
test trim_not_whitespace ... bench: 5 ns/iter (+/- 4)
test trim_whitespace ... bench: 12 ns/iter (+/- 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important outcome of an interesting project: to learn something new )
src/html.rs
Outdated
links | ||
); | ||
|
||
// orphan one of the links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this comment be clarified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// orphan one of the links | |
// Detach one of the links from the root tree (as if previously deleted) to ensure it handles orphan nodes nicely. |
Some of the tree operations panic when the node doesn't have a parent, and we could be processing nodes that were removed from the tree in a previous pass.
See #11 for next steps Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
While I'm waiting for more dump files to download, I got started with this.
This PR will bring the html output to functional parity with the scraped ones.
There will still be extra metadata and other bloat covered in #4.
Remaining steps:
img
/picture
elements