Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: improve content extraction stop word cleaning #7553

Merged
merged 3 commits into from Oct 23, 2023

Conversation

fschade
Copy link
Contributor

@fschade fschade commented Oct 20, 2023

Description

So far it has not been possible to determine whether
the content for the search should be cleaned of stop words or not.

This can now be set with the newly introduced settings option SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS=false
which is enabled by default.

In addition, the stop word cleanup is no longer as aggressive and now ignores numbers, urls,
basically everything except the defined stop words.

Related Issue

Motivation and Context

..., no bugs, ... yummy
ezgif-2-09128abecb

How Has This Been Tested?

  • unit tests

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation ticket raised:

Co-authored-by: Martin <github@diemattels.at>
@sonarcloud
Copy link

sonarcloud bot commented Oct 23, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@fschade fschade merged commit cdd2100 into owncloud:master Oct 23, 2023
2 checks passed
ownclouders pushed a commit that referenced this pull request Oct 23, 2023
* enhancement: improve content extraction stop word cleaning

* fix: cleanup documentation

Co-authored-by: Martin <github@diemattels.at>

* fix: failing tika stop word unit tests

---------

Co-authored-by: Martin <github@diemattels.at>
nabim777 pushed a commit that referenced this pull request Oct 26, 2023
* enhancement: improve content extraction stop word cleaning

* fix: cleanup documentation

Co-authored-by: Martin <github@diemattels.at>

* fix: failing tika stop word unit tests

---------

Co-authored-by: Martin <github@diemattels.at>
mmattel added a commit that referenced this pull request Nov 3, 2023
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
fschade pushed a commit that referenced this pull request Nov 3, 2023
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
ownclouders pushed a commit that referenced this pull request Nov 3, 2023
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
ownclouders pushed a commit that referenced this pull request Nov 4, 2023
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
ownclouders pushed a commit that referenced this pull request Nov 5, 2023
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
ownclouders pushed a commit that referenced this pull request Nov 6, 2023
References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full-search. No result on some words or characters
3 participants