Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Refactoring com_search & com_finder #12664

Closed
wants to merge 9 commits into from

Conversation

Hackwar
Copy link
Member

@Hackwar Hackwar commented Oct 31, 2016

Joomla has two search components since Joomla 1.6, com_search and com_finder. Both have several issues and both lack features and useability. This PR is an attempt to largely rewrite both and their accompanying plugins and code. This is an ongoing development effort and submitted as a work in progress so that others can comment and help steer the direction.

Planned changes

  • Remove the partitioning tables
  • Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search
  • Introduce result weighting and an overhaul of the search result data structure for com_search search plugins
  • Improve indexing and searching in smart search by using proper stemmers, configurable tuple length, etc. In this step, both the stemmer and tokenizer have to move into the language packs, thus allowing support for chinese, etc. Also reviewing the search process itself, since I think there are some flaws in there...
  • Understand the filter and content map feature of Finder and see how this could be improved and maybe made more understandable.
  • Code cleanup.

I will update this list and mark done parts accordingly

Remove the partitioning tables

Finder currently has 16 tables to store the mapping between term and content. This was done for performance reasons, since these tables might become really big really fast and for large sites it would be better to have several smaller tables, or at least that was the theory. That theory might have been correct in 2008/2009, when Finder was written, (I don't know enough about this to judget that) but it is not correct in 2016 anymore.

Finder basically implemented database table partitioning in PHP, which I would call a bad idea. MySQL supports partitioning since 5.6. I would argue that we should drop this PHP implementation for several reasons:

  1. The performance gain for this partitioning only really has an effect on large sites. Those sites however should rather implement real partitioning in the database than using this here.
  2. For smaller sites it actually has a detrimental effect
  3. It is a lot of code and additional tables that confuse people, including me.

To me it at least feels as if with these changes, indexing the sample data is quicker than before. Searching should also be a bit quicker. This will require some updates to the documentation. We have several places, where we talk about the 16 mapping tables.

Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search

The main goal is to merge the two components. com_search is needed, because there are several extensions out there that only provide a search plugin for com_search. com_finder on the other hand does provide the indexed search. The solution will be to move the necessary code from com_finder into com_search and change the search-code of com_finder (basically the search model of com_finder) into a search plugin. I would like to research the possibilities in using new plugin events and providing interfaces for a search plugin to implement, to both merge the "search" and "finder" plugins into one plugin and provide something that developers can work with. Old search plugins could be still used by providing a wrapper-legacy-plugin like we have for the legacy component routers.

Introduce result weighting and an overhaul of the search result data structure for com_search search plugins

Right now, com_search does not have any weighting done for the search results. Results simply come out the way the search plugins push them in, which could simply be the way they have been added to the database or completely random. For most cases however, we want search results that are ordered by relevance. For that, each result should have a weight that describes how relevant this result is and by which we can sort the whole result set. To prevent developers from messing with this by simply setting the weight to some ridicoulus number, so that the results from their plugin is always at the top, we limit the weights to a value between 0 and 1, using a default of 0.5 when no weight is given.

Improve indexing and searching in smart search

Instead of splitting words by spaces (and thus preventing languages like chinese from being indexed), we will have to provide a better, language specific tokenizer. Instead of hardcoding a few stemmers into the core codebase, we will "push" this to the translation teams. They will have to provide a stemmer in their language pack to stem the words properly. I will try to provide a few stemmers where I can get a hold of them. Instead of indexing tuples of 1-3 words, we are going to index just single words per default and longer tuples only on request. To be honest, I would remove multiple-word-indexes altogether. Searching should also be improved to not only return results for consecutive words (=3-word-tuples), but for words all over the content. Searching for "Beginner Joomla" in the sample data should return the "Beginners" article...

Improve filter and content map feature of Finder

I personally don't understand that feature yet and I think it is something that a lot of people have issues with. Maybe we can find some way to make this useable.

Code cleanup

The code is from somewhere around 2009 and while it is of high quality, some of it has been superseeded by changes in our framework. Also, with all the above changes, some cleanup will be necessary.

Notes

This PR is based on #12592. Since writing a dozen PRs up and down, I opted for creating just one big one instead.
I invite you to help me to make this happen. Feel free to comment or write code yourself and send a PR against my branch to get this done. 😄

@Hackwar
Copy link
Member Author

Hackwar commented Nov 10, 2016

So, I've been looking into the tokenizer of com_finder and did some testing. The tokenizer of com_finder splits strings in steps of 2kb at a time and then processes it. Quite frankly, I think that is bullshit. We are not working with Arduinos that virtualise a CPU and what not, so that we have to expect to exhaust our resources with 2kb of data. So I tried a tokenizer that does all the splitting in a better way and at the same time does not split this up in god knows how many steps. I then took a stemmer into the equation and stemmed all words and measured the time all of this took. To test this, I took a 6MB file with several books in it (Sherlock Holmes, War & Peace, The History of the USA, 1.1 million words). Tokenizing that into single words takes 1.4 seconds on my machine, sending that array of words through array_unique() (down to 32000 words) pushes that to 4 seconds and sending all of this through a Porter Stemmer brings us to a total of 12 seconds. Again calling array_unique() on this didn't change the execution time (it actually consistently pushed this down from 12s to 11.5s) but brought the word count down to 20000 words.

So I'm going to remove such pseudo-optimizations from the codebase. Joomla should try to be grounded in reality and yes, there might be websites out there that have gigantic texts that are even larger than this, but I doubt that those will be done with Joomla. Now my text was plaintext without any HTML tags. Adding those in would most likely inflate the text by 20-30%. Adding another 100% onto this for good measure and we are still below the magic number of 30s for execution time. But we are at the limit of the #__content introtext column, which takes 16MB max. I guess it is safe to say that 99.9% of all content in a single article in Joomla sites is going to be smaller than War and Peace and definitely smaller than this test-file that I used. So I'm pretty sure that we can take the naive approach here. There can't be that crappy hosts out there with such big sites that this is going to be a problem.

@Hackwar
Copy link
Member Author

Hackwar commented Nov 10, 2016

Doing all of this to just War and Peace alone brings this down to 5s. The file would still be 3.2 MB

Implementing language based tokenizer
Moving stuff around
…4search3

Conflicts:
	administrator/components/com_finder/helpers/indexer/stemmer.php
	administrator/components/com_search/helpers/indexer/indexer.php
	administrator/components/com_search/helpers/indexer/query.php
	administrator/components/com_search/views/index/tmpl/default.php
	components/com_finder/views/search/view.html.php
	components/com_search/models/search_old.php

// Check if a parser exists for the format.
if (file_exists($path))
{
// Instantiate the parser.
JLoader::register($class, $path);
include_once $path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. That slipped in when pulling in the changes from origin...

@@ -336,7 +334,7 @@ protected function tokenizeToDb($input, $context, $lang, $format)
$string = substr($buffer, 0, $ls);

// Adjust the buffer based on the last space for the next iteration and trim.
$buffer = StringHelper::trim(substr($buffer, $ls));
$buffer = JString::trim(substr($buffer, $ls));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise to the above

@@ -10,20 +10,19 @@
defined('_JEXEC') or die;

use Joomla\Registry\Registry;
use Joomla\String\StringHelper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this (and all uses) here.

@mbabker
Copy link
Contributor

mbabker commented Nov 12, 2016

Merging these two components just doesn't feel right. Ya, it's an interesting situation shipping two search components as part of core, but the two components take such vastly different approaches to the concept of search that I don't think you can sanely manage both of those through a single component and use a plugin to "bridge" the more advanced and extremely technical approach into the "simple" component. The other changes look promising and should be continued on, it's just this one aspect that I'm not quite comfortable with at the moment.

@brianteeman brianteeman changed the title Joomla 4.0: Refactoring com_search & com_finder [4.0] Refactoring com_search & com_finder May 20, 2017
@brianteeman
Copy link
Contributor

@Hackwar could you resolve the merge conflicts please

@Hackwar
Copy link
Member Author

Hackwar commented May 21, 2017

I will need a lot more time to work on this. I'm not sure if we should close this for now or want to keep it open as a reminder... Fixing the merge conflicts right now does not really help, since this is a work in progress. (Yes, I know, half a year without any action is a long time.)

@brianteeman brianteeman modified the milestone: Joomla 4.0 Jun 8, 2017
@humblehumanbeing
Copy link

As an end user I had dropped using Finder on my multilanguage site of 30K+ articles for the following reasons:
1- Finder indexer at backend was starting, doing some job but was unable to index all site in most cases because of PHP/MySQL restrictions. I needed to run indexer many times using backend or cli indexer.
2- Finder tables were topping a few gigs in total after indexing was finished in some way. It tries indexing every portion of a sentence and thus indexes unnecessary search terms. It certainly needs a vast list of terms in every language that will not be included in Finder tables.
3- Finder was unable to index some multi-byte languages such as Chinese

@onderzoekspraktijk
Copy link

In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes and stops: The table '#__finder_tokens' is full. Re-indexing does not add new previously not indexed articles. It stops at the exact location of the previous run.
This error message keeps appearing more quickly after every update.
Joomla once had professional search, I am sorry to see it deteriorate like this.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12664.

@ggppdk
Copy link
Contributor

ggppdk commented Dec 4, 2017

In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes

Speed of the indexer (smart search) is mostly effected by

  • content plugin triggering
  • db updating

It stops at the exact location of the previous run.

Triggering content plugins is considered "necessary" to create the real text for indexing
(the above is for the general case, many websites simply do not really need it),

but it can

  • severly effect speed depending on the (core or 3rd party) content/system plugins having code for the event for 'Content Preparing'
  • and it can cause errors, so if you check the server response at the network TAB of your browser console you will propably find the plugin causing this issue, also you might want to set error reporting to maximum and enable DEBUG mode

i think it should be configurable to triggering or not content plugins, it many websites do not need to add text that depends on plugins

or in J4.0 we could require that plugins have a different method for the search indexer ?
e.g. onContentPrepareSearchText

@onderzoekspraktijk
about debugging your issue, better post in forums, not in this PR

@onderzoekspraktijk
Copy link

@ggppdk:
Thanks for responding to my cri de coeur. Smart search in my opinion is beyond debugging, it is very unstable where once it was very robust. I have been told it is very complicated software that really nobody wants to rewrite. I fear to use it now, and I find this is a great loss to Joomla.

@Hackwar
Copy link
Member Author

Hackwar commented Dec 5, 2017

I'm still trying to find the time to work on this for Joomla 4.0. Wish me luck

@onderzoekspraktijk
Copy link

@Hackwar:

I do wish you luck, and time, and if it would help I also wish you a great group of sponsors!

@Hackwar
Copy link
Member Author

Hackwar commented Apr 17, 2018

This is being made obsolete by among other things this: #20185

@Hackwar Hackwar closed this Apr 17, 2018
@Hackwar Hackwar deleted the j4search3 branch May 18, 2019 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants