Skip to content
This repository has been archived by the owner on Jan 21, 2023. It is now read-only.

How to get all text of document except text inside script/style/noscript tags? #85

Closed
vedantroy opened this issue Feb 19, 2021 · 3 comments

Comments

@vedantroy
Copy link

Kind of a cross-post of this.

I'm trying to get all visible text in a document (text that is not part of script/style/noscript tags).

I've come up with the following algo:

let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
    if let Some(el) = child.as_element() {
        let tag_name = &el.name.local;
        if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
            child.detach();
        }
    }
}
let text = parser.text_contents();
println!("{}", text);

However, this doesn't seem to work. parser.text_contents() still returns inline Javascript in style tags.

Am I using the detach API incorrectly?

@SimonSapin
Copy link
Collaborator

SimonSapin commented Feb 19, 2021

Are you sure this code doesn’t remove one element, then stops?

The tree traversal code inside inclusive_descendants basically assumes the tree doesn’t change while it’s running. After it’s done with a sub-tree it continues by following the "next sibling" link, and if there isn’t one the parent’s next sibling etc. But if a node has been detached it doesn’t have a next sibling or parent anymore, so the traversal code thinks it has reached the root again and stops.

At a high level this is an iterator invalidation bug. This same kind of bug could happen for example with a Vec if borrowing it through .iter() didn’t prevent taking &mut exclusive references to mutate it while iteration is going on. Here Kuchiki uses shared ownership with reference-counting so there isn’t such a compile-time check, but this is "only" a logic bug and there is no Undefined Behavior.

It’s been a while since I wrote this code, but maybe here

kuchiki/src/iter.rs

Lines 288 to 290 in f652e38

None => node.parent().map(|parent| {
State { $next: NodeEdge::$End(parent), $next_back: next_back }
})
instead of node.parent().map(…) returning None when there’s no parent, we should use expect to panic because reaching a node without a parent without reaching the next == next_back case above indicates that the tree has been modified like in your case and the traversal is now incorrect. Maybe. I’m not sure. This way your program would panic with a message that explains about mutating the tree while iterating it.

Anyway, back to your issue, consider first accumulating the nodes to detach into a Vec and only after the traversal is done detatching them.

@vedantroy
Copy link
Author

I ended up using a recursive function as follows:

fn get_visible_text(root: NodeRef, processed_text: &mut String) {
    for child in root.children() {
        if let Some(el) = child.as_element() {
            let tag_name = &el.name.local;
            if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
                return;
            }
            get_visible_text(child, processed_text);
        } else if let Some(text_node) = child.as_text() {
            let text = text_node.borrow();
            processed_text.push_str(&text);
        }
    }
}

but I'll try out your solution too.

@SimonSapin
Copy link
Collaborator

I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived

@SimonSapin SimonSapin closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants