How to get all text of document except text inside script/style/noscript tags? #85
Comments
Are you sure this code doesn’t remove one element, then stops? The tree traversal code inside At a high level this is an iterator invalidation bug. This same kind of bug could happen for example with a It’s been a while since I wrote this code, but maybe here Lines 288 to 290 in f652e38
node.parent().map(…) returning None when there’s no parent, we should use expect to panic because reaching a node without a parent without reaching the next == next_back case above indicates that the tree has been modified like in your case and the traversal is now incorrect. Maybe. I’m not sure. This way your program would panic with a message that explains about mutating the tree while iterating it.
Anyway, back to your issue, consider first accumulating the nodes to detach into a |
I ended up using a recursive function as follows: fn get_visible_text(root: NodeRef, processed_text: &mut String) {
for child in root.children() {
if let Some(el) = child.as_element() {
let tag_name = &el.name.local;
if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
return;
}
get_visible_text(child, processed_text);
} else if let Some(text_node) = child.as_text() {
let text = text_node.borrow();
processed_text.push_str(&text);
}
}
} but I'll try out your solution too. |
I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived |
Kind of a cross-post of this.
I'm trying to get all visible text in a document (text that is not part of script/style/noscript tags).
I've come up with the following algo:
However, this doesn't seem to work.
parser.text_contents()
still returns inline Javascript in style tags.Am I using the
detach
API incorrectly?The text was updated successfully, but these errors were encountered: