Skip to content

Commit

Permalink
[docs-only] Update search README.md
Browse files Browse the repository at this point in the history
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
  • Loading branch information
mmattel committed Nov 3, 2023
1 parent 0df009e commit ed17f31
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions services/search/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,7 @@ When the search service can reach Tika, it begins to read out the content on dem

Content extraction and handling the extracted content can be very resource intensive. Content extraction is therefore limited to files with a certain file size. The default limit is 20MB and can be configured using the `SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT` variable.

When extracting the content you can specify whether filler words are ignored or not.
To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to false.
When extracting content, you can specify whether [stop words](https://en.wikipedia.org/wiki/Stop_word) like `I`, `you`, `the` are ignored or not. Noramlly, these stop words are removed automatically. To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to `false`.

When using the Tika container and docker-compose, consider the following:

Expand Down

0 comments on commit ed17f31

Please sign in to comment.