New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.0] Refactoring com_search & com_finder #12664
Conversation
So, I've been looking into the tokenizer of com_finder and did some testing. The tokenizer of com_finder splits strings in steps of 2kb at a time and then processes it. Quite frankly, I think that is bullshit. We are not working with Arduinos that virtualise a CPU and what not, so that we have to expect to exhaust our resources with 2kb of data. So I tried a tokenizer that does all the splitting in a better way and at the same time does not split this up in god knows how many steps. I then took a stemmer into the equation and stemmed all words and measured the time all of this took. To test this, I took a 6MB file with several books in it (Sherlock Holmes, War & Peace, The History of the USA, 1.1 million words). Tokenizing that into single words takes 1.4 seconds on my machine, sending that array of words through array_unique() (down to 32000 words) pushes that to 4 seconds and sending all of this through a Porter Stemmer brings us to a total of 12 seconds. Again calling array_unique() on this didn't change the execution time (it actually consistently pushed this down from 12s to 11.5s) but brought the word count down to 20000 words. So I'm going to remove such pseudo-optimizations from the codebase. Joomla should try to be grounded in reality and yes, there might be websites out there that have gigantic texts that are even larger than this, but I doubt that those will be done with Joomla. Now my text was plaintext without any HTML tags. Adding those in would most likely inflate the text by 20-30%. Adding another 100% onto this for good measure and we are still below the magic number of 30s for execution time. But we are at the limit of the #__content introtext column, which takes 16MB max. I guess it is safe to say that 99.9% of all content in a single article in Joomla sites is going to be smaller than War and Peace and definitely smaller than this test-file that I used. So I'm pretty sure that we can take the naive approach here. There can't be that crappy hosts out there with such big sites that this is going to be a problem. |
Doing all of this to just War and Peace alone brings this down to 5s. The file would still be 3.2 MB |
Implementing language based tokenizer Moving stuff around
…4search3 Conflicts: administrator/components/com_finder/helpers/indexer/stemmer.php administrator/components/com_search/helpers/indexer/indexer.php administrator/components/com_search/helpers/indexer/query.php administrator/components/com_search/views/index/tmpl/default.php components/com_finder/views/search/view.html.php components/com_search/models/search_old.php
|
||
// Check if a parser exists for the format. | ||
if (file_exists($path)) | ||
{ | ||
// Instantiate the parser. | ||
JLoader::register($class, $path); | ||
include_once $path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this change please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do. That slipped in when pulling in the changes from origin...
@@ -336,7 +334,7 @@ protected function tokenizeToDb($input, $context, $lang, $format) | |||
$string = substr($buffer, 0, $ls); | |||
|
|||
// Adjust the buffer based on the last space for the next iteration and trim. | |||
$buffer = StringHelper::trim(substr($buffer, $ls)); | |||
$buffer = JString::trim(substr($buffer, $ls)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise to the above
@@ -10,20 +10,19 @@ | |||
defined('_JEXEC') or die; | |||
|
|||
use Joomla\Registry\Registry; | |||
use Joomla\String\StringHelper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this (and all uses) here.
Merging these two components just doesn't feel right. Ya, it's an interesting situation shipping two search components as part of core, but the two components take such vastly different approaches to the concept of search that I don't think you can sanely manage both of those through a single component and use a plugin to "bridge" the more advanced and extremely technical approach into the "simple" component. The other changes look promising and should be continued on, it's just this one aspect that I'm not quite comfortable with at the moment. |
Fixing indexing for strict mysql
@Hackwar could you resolve the merge conflicts please |
I will need a lot more time to work on this. I'm not sure if we should close this for now or want to keep it open as a reminder... Fixing the merge conflicts right now does not really help, since this is a work in progress. (Yes, I know, half a year without any action is a long time.) |
As an end user I had dropped using Finder on my multilanguage site of 30K+ articles for the following reasons: |
In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes and stops: The table '#__finder_tokens' is full. Re-indexing does not add new previously not indexed articles. It stops at the exact location of the previous run. This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12664. |
Speed of the indexer (smart search) is mostly effected by
Triggering content plugins is considered "necessary" to create the real text for indexing but it can
i think it should be configurable to triggering or not content plugins, it many websites do not need to add text that depends on plugins or in J4.0 we could require that plugins have a different method for the search indexer ? @onderzoekspraktijk |
@ggppdk: |
I'm still trying to find the time to work on this for Joomla 4.0. Wish me luck |
I do wish you luck, and time, and if it would help I also wish you a great group of sponsors! |
This is being made obsolete by among other things this: #20185 |
Joomla has two search components since Joomla 1.6, com_search and com_finder. Both have several issues and both lack features and useability. This PR is an attempt to largely rewrite both and their accompanying plugins and code. This is an ongoing development effort and submitted as a work in progress so that others can comment and help steer the direction.
Planned changes
I will update this list and mark done parts accordingly
Remove the partitioning tables
Finder currently has 16 tables to store the mapping between term and content. This was done for performance reasons, since these tables might become really big really fast and for large sites it would be better to have several smaller tables, or at least that was the theory. That theory might have been correct in 2008/2009, when Finder was written, (I don't know enough about this to judget that) but it is not correct in 2016 anymore.
Finder basically implemented database table partitioning in PHP, which I would call a bad idea. MySQL supports partitioning since 5.6. I would argue that we should drop this PHP implementation for several reasons:
To me it at least feels as if with these changes, indexing the sample data is quicker than before. Searching should also be a bit quicker. This will require some updates to the documentation. We have several places, where we talk about the 16 mapping tables.
Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search
The main goal is to merge the two components. com_search is needed, because there are several extensions out there that only provide a search plugin for com_search. com_finder on the other hand does provide the indexed search. The solution will be to move the necessary code from com_finder into com_search and change the search-code of com_finder (basically the search model of com_finder) into a search plugin. I would like to research the possibilities in using new plugin events and providing interfaces for a search plugin to implement, to both merge the "search" and "finder" plugins into one plugin and provide something that developers can work with. Old search plugins could be still used by providing a wrapper-legacy-plugin like we have for the legacy component routers.
Introduce result weighting and an overhaul of the search result data structure for com_search search plugins
Right now, com_search does not have any weighting done for the search results. Results simply come out the way the search plugins push them in, which could simply be the way they have been added to the database or completely random. For most cases however, we want search results that are ordered by relevance. For that, each result should have a weight that describes how relevant this result is and by which we can sort the whole result set. To prevent developers from messing with this by simply setting the weight to some ridicoulus number, so that the results from their plugin is always at the top, we limit the weights to a value between 0 and 1, using a default of 0.5 when no weight is given.
Improve indexing and searching in smart search
Instead of splitting words by spaces (and thus preventing languages like chinese from being indexed), we will have to provide a better, language specific tokenizer. Instead of hardcoding a few stemmers into the core codebase, we will "push" this to the translation teams. They will have to provide a stemmer in their language pack to stem the words properly. I will try to provide a few stemmers where I can get a hold of them. Instead of indexing tuples of 1-3 words, we are going to index just single words per default and longer tuples only on request. To be honest, I would remove multiple-word-indexes altogether. Searching should also be improved to not only return results for consecutive words (=3-word-tuples), but for words all over the content. Searching for "Beginner Joomla" in the sample data should return the "Beginners" article...
Improve filter and content map feature of Finder
I personally don't understand that feature yet and I think it is something that a lot of people have issues with. Maybe we can find some way to make this useable.
Code cleanup
The code is from somewhere around 2009 and while it is of high quality, some of it has been superseeded by changes in our framework. Also, with all the above changes, some cleanup will be necessary.
Notes
This PR is based on #12592. Since writing a dozen PRs up and down, I opted for creating just one big one instead.
I invite you to help me to make this happen. Feel free to comment or write code yourself and send a PR against my branch to get this done. 😄