Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Smart Search common words assigned too much weight #12450
Common words (also known as "stop words") in Smart Search were being assigned too much weight in search queries and were not flagged as being common. The reason turned out to be that the default language "*" was not being recognised as matching the language code used in the common words table ("en"). Thus common words were simply not being recognised as such.
Note that this only affects English because no other language has common words at the present time (unless you added them to the database yourself).
Summary of Changes
This PR amends the FinderIndexerHelper::isCommon method so that "*" is recognised as shorthand for the default language.
It's difficult to construct a search query where it makes a significant difference to the outcome. In fact, I've given up trying! The easiest way is to simply look in the #__finder_terms table and notice that the word "the" has dropped from a weight of 0.2 to 0.025 and it has a 1 in the "common" column. Prior to applying this PR there wouldn't be any terms flagged as common.
Note that you will need to purge and re-index after applying this PR. Re-indexing without purging will not force the weights to be recalculated.
Fixing this bug is really about correctly labelling common words so as to pave the way for more sophisticated ranking algorithms in the future.
Documentation Changes Required
None. This is a bug fix.
I assume "purge" means using the "Clear Index" button?
Will this not need to be documented in the upgrade notes as people dont
On 17 October 2016 at 23:26, Chris Davenport firstname.lastname@example.org
Yes, purge = clear index. It's still --purge on the cli.
For anything other than testing this PR re-indexing isn't important. As I noted in the summary, I very much doubt anyone will notice the difference anyway. I wouldn't want people to think they have to re-index, but the next time they do, they'll get the new weights.
@piotr-cz You are correct and I would like to change that.
There needs to be a mechanism for including a common words table in language packs. We also need a mechanism for overriding entries for a specific website so that it can be tuned for the particular statistical distribution of words found on that site. And we need a more sophisticated mechanism for site administrators to influence the ranking calculations. Do we even need a common words database table? We could just load the common words into memory from a JSON file in the language pack as needed, then load an override file from another location to override/customise it. There are many possibilities and your suggestions are welcome.
But, one step at a time. :-)