New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Search common words assigned too much weight #12450

Merged
merged 4 commits into from Jan 8, 2018

Conversation

Projects
None yet
9 participants
@chrisdavenport
Contributor

chrisdavenport commented Oct 17, 2016

Common words (also known as "stop words") in Smart Search were being assigned too much weight in search queries and were not flagged as being common. The reason turned out to be that the default language "*" was not being recognised as matching the language code used in the common words table ("en"). Thus common words were simply not being recognised as such.

Note that this only affects English because no other language has common words at the present time (unless you added them to the database yourself).

Summary of Changes

This PR amends the FinderIndexerHelper::isCommon method so that "*" is recognised as shorthand for the default language.

Testing Instructions

It's difficult to construct a search query where it makes a significant difference to the outcome. In fact, I've given up trying! The easiest way is to simply look in the #__finder_terms table and notice that the word "the" has dropped from a weight of 0.2 to 0.025 and it has a 1 in the "common" column. Prior to applying this PR there wouldn't be any terms flagged as common.

Note that you will need to purge and re-index after applying this PR. Re-indexing without purging will not force the weights to be recalculated.

Fixing this bug is really about correctly labelling common words so as to pave the way for more sophisticated ranking algorithms in the future.

Documentation Changes Required

None. This is a bug fix.

@brianteeman

This comment has been minimized.

Contributor

brianteeman commented Oct 17, 2016

Note that you will need to purge and re-index after applying this PR.
Re-indexing without purging will not force the weights to be recalculated.

I assume "purge" means using the "Clear Index" button?

Will this not need to be documented in the upgrade notes as people dont
usually "clear the index" do they (I dont use Smart Search)?

On 17 October 2016 at 23:26, Chris Davenport notifications@github.com
wrote:

Common words (also known as "stop words") in Smart Search were being
assigned too much weight in search queries and were not flagged as being
common. The reason turned out to be that the default language "*" was not
being recognised as matching the language code used in the common words
table ("en"). Thus common words were simply not being recognised as such.

Note that this only affects English because no other language has common
words at the present time (unless you added them to the database yourself).
Summary of Changes

This PR amends the FinderIndexerHelper::isCommon method so that "*" is
recognised as shorthand for the default language.
Testing Instructions

It's difficult to construct a search query where it makes a significant
difference to the outcome. In fact, I've given up trying! The easiest way
is to simply look in the #__finder_terms table and notice that the word
"the" has dropped from a weight of 0.2 to 0.025 and it has a 1 in the
"common" column. Prior to applying this PR there wouldn't be any terms
flagged as common.

Note that you will need to purge and re-index after applying this PR.
Re-indexing without purging will not force the weights to be recalculated.

Fixing this bug is really about correctly labelling common words so as to
pave the way for more sophisticated ranking algorithms in the future.
Documentation Changes Required

None. This is a bug fix.

You can view, comment on, or merge this pull request online at:

#12450
Commit Summary

  • Smart Search common words assigned too much weight

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#12450, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPH8QW-vp97S_t17Vnw8Vel45Q0FDlzks5q0_YhgaJpZM4KZLYS
.

Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
https://brian.teeman.net/ http://brian.teeman.net/

@chrisdavenport

This comment has been minimized.

Contributor

chrisdavenport commented Oct 18, 2016

Yes, purge = clear index. It's still --purge on the cli.

For anything other than testing this PR re-indexing isn't important. As I noted in the summary, I very much doubt anyone will notice the difference anyway. I wouldn't want people to think they have to re-index, but the next time they do, they'll get the new weights.

@piotr-cz

This comment has been minimized.

Contributor

piotr-cz commented Oct 18, 2016

I know it's OT, but how is the table #__finder_terms_common populated?
AFAIK only on new installation (see com_finder installation file), but I think this should be part of language xml manifest.
There is also no user interface to manage the list.

@chrisdavenport

This comment has been minimized.

Contributor

chrisdavenport commented Oct 18, 2016

@piotr-cz You are correct and I would like to change that.

There needs to be a mechanism for including a common words table in language packs. We also need a mechanism for overriding entries for a specific website so that it can be tuned for the particular statistical distribution of words found on that site. And we need a more sophisticated mechanism for site administrators to influence the ranking calculations. Do we even need a common words database table? We could just load the common words into memory from a JSON file in the language pack as needed, then load an override file from another location to override/customise it. There are many possibilities and your suggestions are welcome.

But, one step at a time. :-)

@alikon

This comment has been minimized.

Contributor

alikon commented Oct 23, 2016

before patch
bfp
after patch
afp

@alikon

This comment has been minimized.

Contributor

alikon commented Oct 23, 2016

I have tested this item successfully on 590939b


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

@brianteeman

This comment has been minimized.

Contributor

brianteeman commented Nov 14, 2016

I have tested this item successfully on 98ef480

database changes observed


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

@franz-wohlkoenig

This comment has been minimized.

franz-wohlkoenig commented Jan 3, 2017

I have tested this item 🔴 unsuccessfully on 98ef480

Test on Joomla! 3.7.0-alpha1. Looked for german-lang "die" (similar to "the"), before and after Patch weight: 0.2.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

@piotr-cz

This comment has been minimized.

Contributor

piotr-cz commented Jan 3, 2017

@franz-wohlkoenig I'm afraid that this works only for English language at the moment, see #12450 (comment)

@franz-wohlkoenig

This comment has been minimized.

franz-wohlkoenig commented Jan 3, 2017

Thanks for information @piotr-cz

@brianteeman

This comment has been minimized.

Contributor

brianteeman commented May 19, 2017

@franz-wohlkoenig could you reset your test result please


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

@franz-wohlkoenig

This comment has been minimized.

franz-wohlkoenig commented May 20, 2017

I have not tested this item.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

@franz-wohlkoenig

This comment has been minimized.

franz-wohlkoenig commented May 20, 2017

@brianteeman reset on "not tested". Will test using English lang.

@franz-wohlkoenig

This comment has been minimized.

franz-wohlkoenig commented May 20, 2017

couldn't test cause Issue

@brianteeman

This comment has been minimized.

Contributor

brianteeman commented Jan 4, 2018

Please mark RTC as it has two good tests

@Quy

This comment has been minimized.

Contributor

Quy commented Jan 4, 2018

RTC

@joomla-cms-bot joomla-cms-bot added the RTC label Jan 4, 2018

@mbabker mbabker added this to the Joomla 3.8.4 milestone Jan 8, 2018

@mbabker mbabker merged commit f3db025 into joomla:staging Jan 8, 2018

3 of 4 checks passed

continuous-integration/jenkins/pr-merge This commit cannot be built
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/drone/pr the build was successful
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@joomla-cms-bot joomla-cms-bot removed the RTC label Jan 8, 2018

@chrisdavenport chrisdavenport deleted the chrisdavenport:ss-common-weights branch Jan 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment