New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to configure Stork to ignore <pre> tags? #279
Comments
I also tried |
Hmm, that's not great. I'll take a look and see if -James |
It seems to happen when there are more than one Here's a reproducible test case: [input]
exclude_html_selector = '.noindex'
html_selector = 'main'
files = [
{ path = 'index.html', url = '/', title = 'Index' },
] <main>
<p>DO_INDEX</p>
<p class='noindex'>DO_NOT_INDEX</p>
<p class='noindex'>DO_NOT_INDEX</p>
</main> When you run Possibly related to kuchiki-rs/kuchiki#81 and Stork's usage below? Lines 50 to 54 in 4d301cd
Also kuchiki-rs/kuchiki#85 (comment) is worth reading too. |
Here's a failing test for #[test]
fn test_html_content_extraction_with_multiple_excluded_selectors() {
run_html_parse_test(
"This content should be indexed This content should also be indexed",
Some(".yes"),
Some(".no"),
r#"
<html>
<head></head>
<body>
<h1>This is a title</h1>
<main>
<section class="yes" id="first">
<p>This content should be indexed</p>
<p id="second">This content should also be indexed</p>
<p class="no">This content should not be indexed</p>
<p class="no">This content should also not be indexed</p>
</section>
</main>
</body>
</html>"#,
)
}
|
Thanks for the fix, @jameslittle230! All |
That's great to hear! |
My company's documentation has a lot of
<pre>
tags containing code snippets, and Stork seems to be indexing all of these. Is there a way for me to configure Stork to ignore<pre>
and possibly even<code>
tags? I tried settingexclude_html_selector = 'pre'
and also triedexclude_html_selector = 'pre, code'
but neither seem to have an effect.I'm guessing this is also why our search index is about 120MB uncompressed, which I'd really like to lower. 😃
The text was updated successfully, but these errors were encountered: