How to get all text of document except text inside script/style/noscript tags? #85

vedantroy · 2021-02-19T07:25:24Z

Kind of a cross-post of this.

I'm trying to get all visible text in a document (text that is not part of script/style/noscript tags).

I've come up with the following algo:

let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
    if let Some(el) = child.as_element() {
        let tag_name = &el.name.local;
        if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
            child.detach();
        }
    }
}
let text = parser.text_contents();
println!("{}", text);

However, this doesn't seem to work. parser.text_contents() still returns inline Javascript in style tags.

Am I using the detach API incorrectly?

The text was updated successfully, but these errors were encountered:

SimonSapin · 2021-02-19T07:50:32Z

Are you sure this code doesn’t remove one element, then stops?

The tree traversal code inside inclusive_descendants basically assumes the tree doesn’t change while it’s running. After it’s done with a sub-tree it continues by following the "next sibling" link, and if there isn’t one the parent’s next sibling etc. But if a node has been detached it doesn’t have a next sibling or parent anymore, so the traversal code thinks it has reached the root again and stops.

At a high level this is an iterator invalidation bug. This same kind of bug could happen for example with a Vec if borrowing it through .iter() didn’t prevent taking &mut exclusive references to mutate it while iteration is going on. Here Kuchiki uses shared ownership with reference-counting so there isn’t such a compile-time check, but this is "only" a logic bug and there is no Undefined Behavior.

It’s been a while since I wrote this code, but maybe here

kuchiki/src/iter.rs

Lines 288 to 290 in f652e38

    
           None => node.parent().map(|parent| { 
        
               State { $next: NodeEdge::$End(parent), $next_back: next_back } 
        
           })

instead of node.parent().map(…) returning None when there’s no parent, we should use expect to panic because reaching a node without a parent without reaching the next == next_back case above indicates that the tree has been modified like in your case and the traversal is now incorrect. Maybe. I’m not sure. This way your program would panic with a message that explains about mutating the tree while iterating it.

Anyway, back to your issue, consider first accumulating the nodes to detach into a Vec and only after the traversal is done detatching them.

vedantroy · 2021-02-19T08:07:54Z

I ended up using a recursive function as follows:

fn get_visible_text(root: NodeRef, processed_text: &mut String) {
    for child in root.children() {
        if let Some(el) = child.as_element() {
            let tag_name = &el.name.local;
            if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
                return;
            }
            get_visible_text(child, processed_text);
        } else if let Some(text_node) = child.as_text() {
            let text = text_node.borrow();
            processed_text.push_str(&text);
        }
    }
}

but I'll try out your solution too.

SimonSapin · 2023-01-21T22:24:41Z

I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived

ezekg mentioned this issue Mar 30, 2022

How to configure Stork to ignore <pre> tags? jameslittle230/stork#279

Closed

SimonSapin closed this as completed Jan 21, 2023

SimonSapin closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get all text of document except text inside script/style/noscript tags? #85

How to get all text of document except text inside script/style/noscript tags? #85

vedantroy commented Feb 19, 2021

SimonSapin commented Feb 19, 2021 •

edited

vedantroy commented Feb 19, 2021

SimonSapin commented Jan 21, 2023

How to get all text of document except text inside script/style/noscript tags? #85

How to get all text of document except text inside script/style/noscript tags? #85

Comments

vedantroy commented Feb 19, 2021

SimonSapin commented Feb 19, 2021 • edited

vedantroy commented Feb 19, 2021

SimonSapin commented Jan 21, 2023

SimonSapin commented Feb 19, 2021 •

edited