Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to configure Stork to ignore <pre> tags? #279

Closed
ezekg opened this issue Mar 29, 2022 · 6 comments · Fixed by #281
Closed

How to configure Stork to ignore <pre> tags? #279

ezekg opened this issue Mar 29, 2022 · 6 comments · Fixed by #281
Assignees

Comments

@ezekg
Copy link

ezekg commented Mar 29, 2022

My company's documentation has a lot of <pre> tags containing code snippets, and Stork seems to be indexing all of these. Is there a way for me to configure Stork to ignore <pre> and possibly even <code> tags? I tried setting exclude_html_selector = 'pre' and also tried exclude_html_selector = 'pre, code' but neither seem to have an effect.

I'm guessing this is also why our search index is about 120MB uncompressed, which I'd really like to lower. 😃

@ezekg
Copy link
Author

ezekg commented Mar 29, 2022

I also tried exclude_html_selector = '.noindex', but I still see code blocks with that class showing up in the index. I’ll try to put together a reproducible example tomorrow morning.

@jameslittle230
Copy link
Owner

Hmm, that's not great. I'll take a look and see if exclude_html_selector stopped working at some point, and make sure it works with pre tags. Thanks for the report :)

-James

@ezekg
Copy link
Author

ezekg commented Mar 30, 2022

It seems to happen when there are more than one .noindex tag on a page. Only the first tag is excluded.

Here's a reproducible test case:

[input]
exclude_html_selector = '.noindex'
html_selector = 'main'
files = [
  { path = 'index.html', url = '/', title = 'Index' },
]
<main>
  <p>DO_INDEX</p>
  <p class='noindex'>DO_NOT_INDEX</p>
  <p class='noindex'>DO_NOT_INDEX</p>
</main>

When you run stork test --config stork.toml, the word DO_NOT_INDEX will be indexed once.

Possibly related to kuchiki-rs/kuchiki#81 and Stork's usage below?

if let Ok(excluded_elements) = as_node.select(exclude_selector) {
for excluded_element in excluded_elements {
excluded_element.as_node().detach();
}
}

Also kuchiki-rs/kuchiki#85 (comment) is worth reading too.

@ezekg
Copy link
Author

ezekg commented Mar 30, 2022

Here's a failing test for stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs:

#[test]
fn test_html_content_extraction_with_multiple_excluded_selectors() {
    run_html_parse_test(
        "This content should be indexed This content should also be indexed",
        Some(".yes"),
        Some(".no"),
        r#"
    <html>
        <head></head>
        <body>
            <h1>This is a title</h1>
            <main>
                <section class="yes" id="first">
                    <p>This content should be indexed</p>
                    <p id="second">This content should also be indexed</p>
                    <p class="no">This content should not be indexed</p>
                    <p class="no">This content should also not be indexed</p>
                </section>
            </main>
        </body>
    </html>"#,
    )
}
failures:

---- index_v3::build::fill_intermediate_entries::word_list_generators::html_word_list_generator::tests::test_html_content_extraction_with_multiple_excluded_selectors stdout ----
thread 'index_v3::build::fill_intermediate_entries::word_list_generators::html_word_list_generator::tests::test_html_content_extraction_with_multiple_excluded_selectors' panicked at 'assertion failed: `(left == right)`
  left: `"This content should be indexed This content should also be indexed"`,
 right: `"This content should be indexed This content should also be indexed This content should also not be indexed"`: expected: This content should be indexed This content should also be indexed
computed: This content should be indexed This content should also be indexed This content should also not be indexed', stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs:172:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@ezekg
Copy link
Author

ezekg commented Apr 4, 2022

Thanks for the fix, @jameslittle230! All .noindex tags are now ignored, and our search index is down to 25MB uncompressed (607 KB compressed with Brotli). 🖖

@jameslittle230
Copy link
Owner

That's great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants