Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate crippling performance of content search plugin for large sites with custom fields #18915

Merged
merged 15 commits into from Dec 3, 2017

Conversation

@GeraintEdwards
Copy link
Contributor

@GeraintEdwards GeraintEdwards commented Nov 29, 2017

Pull Request for Issue # .

For MySQL based sites with a large number of articles (more than a few hundred) that have custom fields enabled the com_content search plugin ends up being so slow that it can cripple a site completely. For a test dataset (details below) of 3,900 articles with 2 custom fields each the database query for a search took between 700,000 and 1,300,000 milliseconds to return a set of search results (on an i7-6700k 4Ghz 8 core processor with 32GB or ram) . In addition the search results for matching all on a multi-select field incorrectly returns zero results.

Postgres performs a LOT better (query times in the region of 400 milliseconds) but the changes described below still cut the query times by 50%.

Summary of Changes

The plugin has been re-written and rather than joining all the custom fields rows and searching for custom fields at the same time as all the other content fields we now search custom fields and incorporate the results into the main query. In Postgres this can be done as a subquery but MySQL treats this independent subquery as a 'dependent subquery' with punishing performance implications. This may explain in part why MySQL performs so badly with the existing code.

Testing Instructions

This test requires a large dataset of articles/custom fields to test. It can be generated using com_overload ( Copyright (C) 2011 Nicholas K. Dionysopoulos) - an updated version is attached below that will work in Joomla 3.8 and that also creates 2 custom fields and populates these with the articles.

You should install this component and view it in the backend. A decent dataset to test with would use values of 3 categories, 3 levels, 100 articles -> this will give 3900 articles (WARNING this will take several minutes to generate the data)

This adds a text field with random values of "one" to "seven" as text and a multi-select list field with 2 random values from opt1, opt2. opt3, opt4 and op5.

Then use the search module/com_search in the frontend and search for

"seven" or "opt1" or "opt1 opt2" etc.

Expected result

1; The correct set of results within a couple of seconds max.

  1. Search for 'opt1 opt2' and match all and it should return the matching set of articles

Actual result

  1. Go and make a cup of coffee and then drink it, call your friends, have another cup of coffee - depending on your processor you are likely to be waiting 20 minutes for the results to return.

  2. The search for 'opt1 opt2' and match all returns NO values which is NOT correct since there are several hundred matches

If you implement the patch you will get a VERY FAST result for 1. and the correct results for 2.

Discussion Point

I have tested the changes on Postgres and it supports a subquery rather than having a separate list of ids from a different query. I have not been able to test this on SQL server.

Documentation Changes Required

None

geraintedwards and others added 5 commits Oct 3, 2017
@GeraintEdwards
Copy link
Contributor Author

@GeraintEdwards GeraintEdwards commented Nov 29, 2017

This is version of com_overload that can generate the test data set
com_overload.zip

Copy link
Member

@nibra nibra left a comment

Good approach, actually exactly what I was thinking of. Just a few things:

  1. Use Joomla's Code Style (fx, the diff is broken because you're not formatting your code correctly)
  2. Remove unused code instead of commenting it out
  3. In the cleanup loop, you have all the article ids for the custom fields already, so you could retrieve all of the custom fields at once outside of the loop.
@GeraintEdwards
Copy link
Contributor Author

@GeraintEdwards GeraintEdwards commented Nov 29, 2017

Agree entirely about the article ids and fetching the custom fields in one go. We need to do the same elsewhere in custom fields too. I'll look at the code formatting - but my IDE is setup to match Joomla's standards already

@mbabker
Copy link
Contributor

@mbabker mbabker commented Nov 29, 2017

I'll look at the code formatting - but my IDE is setup to match Joomla's standards already

There's an extra indent on the method's code. Take that out and it should be mostly OK at a cursory glance.

@GeraintEdwards
Copy link
Contributor Author

@GeraintEdwards GeraintEdwards commented Nov 29, 2017

Thanks @mbabker - not sure where the extra tab came from. That had tidied up the diff a lot.

Do we really care about aligning assignment operators?

@nibra I'll do a new PR later to implement the idea of picking up all the field values in one pass. We need to do the same elsewhere in the core (in my opinion too) instead of duplicated or repetitive queries.

@brianteeman
Copy link
Member

@brianteeman brianteeman commented Nov 29, 2017

Do we really care about aligning assignment operators?

it does make code more readable

}

if ($serverType == "mysql")
{

This comment has been minimized.

@csthomas

csthomas Nov 29, 2017
Contributor

IMO we always should use this variant, means 2 queries. Complex query take up too much memory.

This comment has been minimized.

@GeraintEdwards

GeraintEdwards Nov 30, 2017
Author Contributor

In an ideal world the subquery approach should be the fastest since the database engine can do its own optimisations. Unfortunately MySQL had a bug where the independent sub query is treated as if it were dependent and gets executed once for every article searched (and a ridiculous performance hit).

Postgres treats the query correctly and is about twice as fast as a subquery compared to passing the field ids in as a string.

So its a trade off between easier to read code and a small performance hit in Postgres.

@alikon
Copy link
Contributor

@alikon alikon commented Nov 29, 2017

i've made a quick test on mysql and after apply pr no need to

make a cup of coffee and then drink it, call your friends, have another cup of coffee

to get results 👍

i'll make some test on postgresql too


if ($serverType == "mysql")
{

This comment has been minimized.

@Quy

Quy Nov 29, 2017
Contributor

Remove blank line.

$fieldids = $db->loadColumn();
if (count($fieldids))
{

This comment has been minimized.

@Quy

Quy Nov 29, 2017
Contributor

Remove blank line.


if ($serverType == "mysql")
{

This comment has been minimized.

@Quy

Quy Nov 29, 2017
Contributor

Remove blank line.

$fieldids = $db->loadColumn();
if (count($fieldids))
{

This comment has been minimized.

@Quy

Quy Nov 29, 2017
Contributor

Remove blank line.

->where('(f.context IS NULL OR f.context = ' . $db->q('com_content.article') . ')')
->where('(f.state IS NULL OR f.state = 1)')
->where('(f.access IS NULL OR f.access IN (' . $groups . '))');
*/

This comment has been minimized.

@nibra

nibra Nov 29, 2017
Member

Remove this dead code

This comment has been minimized.

@Quy

Quy Nov 30, 2017
Contributor

Do you still need the comment? If yes, neded is misspelled.

removing commented out code
@angieradtke
Copy link
Contributor

@angieradtke angieradtke commented Nov 30, 2017

I tested you changes at a real project: 8 Fields per article within 6000 articles in 4 different languages.
It seems to work .-)
Thanks Angie

@angieradtke
Copy link
Contributor

@angieradtke angieradtke commented Nov 30, 2017

One more idea, maybe we should use a param; search in fields as well, so it is more flexible

@laoneo
Copy link
Member

@laoneo laoneo commented Nov 30, 2017

Thanks @GeraintEdwards for the proper test instructions. Now I could easily reproduce the issue on my dev server.
But is there really no way to do that server type independent and in a more easy way? I tried a different approach in #18929 to split up the queries. What do you guys think?

@GeraintEdwards
Copy link
Contributor Author

@GeraintEdwards GeraintEdwards commented Nov 30, 2017

Hi @laoneo - see my comments on your other PR.

Re the server type dependent code - as I explained in my comment in the code the subquery approach should be the fastest since the database engine can do its own optimisations. Unfortunately MySQL had a bug where the independent sub query is treated as if it were dependent and gets executed once for every article searched (and a ridiculous performance hit).

Postgres treats the query correctly and is about twice as fast as a subquery compared to passing the field ids in as a string.

So its a trade off between easier to read code and a small performance hit in Postgres.

@GeraintEdwards
Copy link
Contributor Author

@GeraintEdwards GeraintEdwards commented Nov 30, 2017

@angieradtke I suppose it does make sense to be able to mark specific fields as 'searchable' and other not. That would require a new config option for individual fields - @laoneo what do you think?

@laoneo
Copy link
Member

@laoneo laoneo commented Nov 30, 2017

What's the reason to not search in a field? If you want such an option, then I would add it as a parameter in the search plugin in which fields to search.

@laoneo
Copy link
Member

@laoneo laoneo commented Nov 30, 2017

So its a trade off between easier to read code and a small performance hit in Postgres.

I'm more worried about other extension developers who want to implement a search plugin. Mostly the content search plugin acts as example. It is already very complex.

@alikon
Copy link
Contributor

@alikon alikon commented Dec 2, 2017

the offending statement is when we have a.id IN (cfv.item_id)
cause we are comparing the id field of #__content table wich is a SERIAL with the item_id field of #__fields_values table wich is a VARCHAR

we need an explicit cast

from
$wheres2[] = 'a.id IN( ' . (string) $subQuery . ')';
to
$wheres2[] = 'CAST(a.id AS VARCHAR) IN(' . (string) $subQuery . ')';

lines 125,177,216

Copy link
Contributor

@alikon alikon left a comment

use CAST for postgresql

}
else
{
$wheres[] = 'a.id IN( ' . (string) $subQuery . ')';

This comment has been minimized.

@alikon

alikon Dec 2, 2017
Contributor

use CAST for postgresql

}
else
{
$wheres2[] = 'a.id IN( ' . (string) $subQuery . ')';

This comment has been minimized.

@alikon

alikon Dec 2, 2017
Contributor

use CAST for postgresql

$fieldids = $db->loadColumn();
if (count($fieldids))
{
$wheres2[] = 'a.id IN(' . implode(",", $fieldids) . ')';

This comment has been minimized.

@alikon

alikon Dec 2, 2017
Contributor

use CAST for postgresql

@alikon
Copy link
Contributor

@alikon alikon commented Dec 2, 2017

with this changes in place even on postgresql the performance are better

@wilsonge wilsonge removed the RTC label Dec 2, 2017
@wilsonge
Copy link
Contributor

@wilsonge wilsonge commented Dec 2, 2017

Removing RTC so we can fix postgres :)


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

@GeraintEdwards
Copy link
Contributor Author

@GeraintEdwards GeraintEdwards commented Dec 3, 2017

Really sorry about the Postgres - I had 2 development sites running and fixed the casting in the postgres version but hadn't committed from there.

JDatabaseQuery doesn't implement castAsInteger which would be the most efficient casting/matching so I'm using castAsChar or a.id for the match instead.

@alikon
Copy link
Contributor

@alikon alikon commented Dec 3, 2017

JDatabaseQuery doesn't implement castAsInteger which would be the most efficient casting/matching so I'm using castAsChar or a.id for the match instead.

good catch, i hope #18961 will fill the gap

we should remeber to use it, when and if #18961 will be merged, but for now we should go even without it 😍

@alikon
Copy link
Contributor

@alikon alikon commented Dec 3, 2017

I have tested this item successfully on 351b0c6

mysql & postrgesql


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

@nibra
Copy link
Member

@nibra nibra commented Dec 3, 2017

@angieradtke, @tonypartridge, could you please re-test?

@angieradtke
Copy link
Contributor

@angieradtke angieradtke commented Dec 3, 2017

Second retest succesful on a live project .-)

Copy link
Member

@nibra nibra left a comment

Only one CS issue left: Please add spaces around concat operators. Then all checks will pass and this is ready for commit.

}
else
{
$wheres2[] = $subQuery->castAsChar('a.id').' IN( ' . (string) $subQuery . ')';

This comment has been minimized.

@nibra

nibra Dec 3, 2017
Member

Please add spaces around concat operator

}
else
{
$wheres2[] = $subQuery->castAsChar('a.id').' IN( ' . (string) $subQuery . ')';

This comment has been minimized.

@nibra

nibra Dec 3, 2017
Member

Please add spaces around concat operator

}
else
{
$wheres[] = $subQuery->castAsChar('a.id').' IN( ' . (string) $subQuery . ')';

This comment has been minimized.

@nibra

nibra Dec 3, 2017
Member

Please add spaces around concat operator

CS
@ghost
Copy link

@ghost ghost commented Dec 3, 2017

@angieradtke please mark your Test as successfully:

  • open Issue Tracker
  • Login with your github-Account
  • Click on blue "Test this"-Button above Authors-Picture
  • mark your Test as successfully
  • hit "submit test result"
@angieradtke
Copy link
Contributor

@angieradtke angieradtke commented Dec 3, 2017

I have tested this item successfully on 6e1c893


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

@nibra
nibra approved these changes Dec 3, 2017
Copy link
Member

@nibra nibra left a comment

Thank you!

@wilsonge wilsonge merged commit 50e1d61 into joomla:staging Dec 3, 2017
1 of 4 checks passed
1 of 4 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/drone/pr this build is pending
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
hound No violations found. Woof!
@wilsonge
Copy link
Contributor

@wilsonge wilsonge commented Dec 3, 2017

I tweaked a couple of small c/s issues and merged. Thankyou so much for this :)

@wilsonge wilsonge added this to the Joomla 3.8.3 milestone Dec 3, 2017
alikon added a commit to alikon/joomla-cms that referenced this pull request Dec 5, 2017
@joomlabeat
Copy link
Contributor

@joomlabeat joomlabeat commented Jan 18, 2018

Do we actually need the GROUP BY statement in the plugin's query? It doesn't seem it's needful here and taking it off it results to an extra ~30% gain in performance.

@zero-24
Copy link
Member

@zero-24 zero-24 commented Jan 18, 2018

@joomlabeat as you can see this is a old issue / merged PR. Can you please open a new issue linking to this one? Else this get missed fast. Thanks!

@nibra
Copy link
Member

@nibra nibra commented Jan 18, 2018

Yes, the GROUP BY is required by the SQL standard (MySQL can deal without, others can't). The drivers are AFAIK not building the GROUP BY causes transparently if needed (which is what they should do).

@csthomas
Copy link
Contributor

@csthomas csthomas commented Jan 19, 2018

@joomlabeat is right, line 288 can be completely removed.

There is only (#__content INNER JOIN #__categories), there are no duplicates and the query has no aggregate function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet