New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate crippling performance of content search plugin for large sites with custom fields #18915

Merged
merged 15 commits into from Dec 3, 2017

Conversation

Projects
None yet
@GeraintEdwards
Contributor

GeraintEdwards commented Nov 29, 2017

Pull Request for Issue # .

For MySQL based sites with a large number of articles (more than a few hundred) that have custom fields enabled the com_content search plugin ends up being so slow that it can cripple a site completely. For a test dataset (details below) of 3,900 articles with 2 custom fields each the database query for a search took between 700,000 and 1,300,000 milliseconds to return a set of search results (on an i7-6700k 4Ghz 8 core processor with 32GB or ram) . In addition the search results for matching all on a multi-select field incorrectly returns zero results.

Postgres performs a LOT better (query times in the region of 400 milliseconds) but the changes described below still cut the query times by 50%.

Summary of Changes

The plugin has been re-written and rather than joining all the custom fields rows and searching for custom fields at the same time as all the other content fields we now search custom fields and incorporate the results into the main query. In Postgres this can be done as a subquery but MySQL treats this independent subquery as a 'dependent subquery' with punishing performance implications. This may explain in part why MySQL performs so badly with the existing code.

Testing Instructions

This test requires a large dataset of articles/custom fields to test. It can be generated using com_overload ( Copyright (C) 2011 Nicholas K. Dionysopoulos) - an updated version is attached below that will work in Joomla 3.8 and that also creates 2 custom fields and populates these with the articles.

You should install this component and view it in the backend. A decent dataset to test with would use values of 3 categories, 3 levels, 100 articles -> this will give 3900 articles (WARNING this will take several minutes to generate the data)

This adds a text field with random values of "one" to "seven" as text and a multi-select list field with 2 random values from opt1, opt2. opt3, opt4 and op5.

Then use the search module/com_search in the frontend and search for

"seven" or "opt1" or "opt1 opt2" etc.

Expected result

1; The correct set of results within a couple of seconds max.

  1. Search for 'opt1 opt2' and match all and it should return the matching set of articles

Actual result

  1. Go and make a cup of coffee and then drink it, call your friends, have another cup of coffee - depending on your processor you are likely to be waiting 20 minutes for the results to return.

  2. The search for 'opt1 opt2' and match all returns NO values which is NOT correct since there are several hundred matches

If you implement the patch you will get a VERY FAST result for 1. and the correct results for 2.

Discussion Point

I have tested the changes on Postgres and it supports a subquery rather than having a separate list of ids from a different query. I have not been able to test this on SQL server.

Documentation Changes Required

None

geraintedwards and others added some commits Oct 3, 2017

@GeraintEdwards

This comment has been minimized.

Show comment
Hide comment
@GeraintEdwards

GeraintEdwards Nov 29, 2017

Contributor

This is version of com_overload that can generate the test data set
com_overload.zip

Contributor

GeraintEdwards commented Nov 29, 2017

This is version of com_overload that can generate the test data set
com_overload.zip

@nibra

nibra suggested changes Nov 29, 2017 edited

Good approach, actually exactly what I was thinking of. Just a few things:

  1. Use Joomla's Code Style (fx, the diff is broken because you're not formatting your code correctly)
  2. Remove unused code instead of commenting it out
  3. In the cleanup loop, you have all the article ids for the custom fields already, so you could retrieve all of the custom fields at once outside of the loop.
@GeraintEdwards

This comment has been minimized.

Show comment
Hide comment
@GeraintEdwards

GeraintEdwards Nov 29, 2017

Contributor

Agree entirely about the article ids and fetching the custom fields in one go. We need to do the same elsewhere in custom fields too. I'll look at the code formatting - but my IDE is setup to match Joomla's standards already

Contributor

GeraintEdwards commented Nov 29, 2017

Agree entirely about the article ids and fetching the custom fields in one go. We need to do the same elsewhere in custom fields too. I'll look at the code formatting - but my IDE is setup to match Joomla's standards already

@mbabker

This comment has been minimized.

Show comment
Hide comment
@mbabker

mbabker Nov 29, 2017

Member

I'll look at the code formatting - but my IDE is setup to match Joomla's standards already

There's an extra indent on the method's code. Take that out and it should be mostly OK at a cursory glance.

Member

mbabker commented Nov 29, 2017

I'll look at the code formatting - but my IDE is setup to match Joomla's standards already

There's an extra indent on the method's code. Take that out and it should be mostly OK at a cursory glance.

@GeraintEdwards

This comment has been minimized.

Show comment
Hide comment
@GeraintEdwards

GeraintEdwards Nov 29, 2017

Contributor

Thanks @mbabker - not sure where the extra tab came from. That had tidied up the diff a lot.

Do we really care about aligning assignment operators?

@nibra I'll do a new PR later to implement the idea of picking up all the field values in one pass. We need to do the same elsewhere in the core (in my opinion too) instead of duplicated or repetitive queries.

Contributor

GeraintEdwards commented Nov 29, 2017

Thanks @mbabker - not sure where the extra tab came from. That had tidied up the diff a lot.

Do we really care about aligning assignment operators?

@nibra I'll do a new PR later to implement the idea of picking up all the field values in one pass. We need to do the same elsewhere in the core (in my opinion too) instead of duplicated or repetitive queries.

@brianteeman

This comment has been minimized.

Show comment
Hide comment
@brianteeman

brianteeman Nov 29, 2017

Contributor

Do we really care about aligning assignment operators?

it does make code more readable

Contributor

brianteeman commented Nov 29, 2017

Do we really care about aligning assignment operators?

it does make code more readable

}
if ($serverType == "mysql")
{

This comment has been minimized.

@csthomas

csthomas Nov 29, 2017

Contributor

IMO we always should use this variant, means 2 queries. Complex query take up too much memory.

@csthomas

csthomas Nov 29, 2017

Contributor

IMO we always should use this variant, means 2 queries. Complex query take up too much memory.

This comment has been minimized.

@GeraintEdwards

GeraintEdwards Nov 30, 2017

Contributor

In an ideal world the subquery approach should be the fastest since the database engine can do its own optimisations. Unfortunately MySQL had a bug where the independent sub query is treated as if it were dependent and gets executed once for every article searched (and a ridiculous performance hit).

Postgres treats the query correctly and is about twice as fast as a subquery compared to passing the field ids in as a string.

So its a trade off between easier to read code and a small performance hit in Postgres.

@GeraintEdwards

GeraintEdwards Nov 30, 2017

Contributor

In an ideal world the subquery approach should be the fastest since the database engine can do its own optimisations. Unfortunately MySQL had a bug where the independent sub query is treated as if it were dependent and gets executed once for every article searched (and a ridiculous performance hit).

Postgres treats the query correctly and is about twice as fast as a subquery compared to passing the field ids in as a string.

So its a trade off between easier to read code and a small performance hit in Postgres.

@alikon

This comment has been minimized.

Show comment
Hide comment
@alikon

alikon Nov 29, 2017

Contributor

i've made a quick test on mysql and after apply pr no need to

make a cup of coffee and then drink it, call your friends, have another cup of coffee

to get results 👍

i'll make some test on postgresql too

Contributor

alikon commented Nov 29, 2017

i've made a quick test on mysql and after apply pr no need to

make a cup of coffee and then drink it, call your friends, have another cup of coffee

to get results 👍

i'll make some test on postgresql too

Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
code styling changes (as per comments)
removing commented out code
@angieradtke

This comment has been minimized.

Show comment
Hide comment
@angieradtke

angieradtke Nov 30, 2017

Contributor

I tested you changes at a real project: 8 Fields per article within 6000 articles in 4 different languages.
It seems to work .-)
Thanks Angie

Contributor

angieradtke commented Nov 30, 2017

I tested you changes at a real project: 8 Fields per article within 6000 articles in 4 different languages.
It seems to work .-)
Thanks Angie

@angieradtke

This comment has been minimized.

Show comment
Hide comment
@angieradtke

angieradtke Nov 30, 2017

Contributor

One more idea, maybe we should use a param; search in fields as well, so it is more flexible

Contributor

angieradtke commented Nov 30, 2017

One more idea, maybe we should use a param; search in fields as well, so it is more flexible

@laoneo

This comment has been minimized.

Show comment
Hide comment
@laoneo

laoneo Nov 30, 2017

Member

Thanks @GeraintEdwards for the proper test instructions. Now I could easily reproduce the issue on my dev server.
But is there really no way to do that server type independent and in a more easy way? I tried a different approach in #18929 to split up the queries. What do you guys think?

Member

laoneo commented Nov 30, 2017

Thanks @GeraintEdwards for the proper test instructions. Now I could easily reproduce the issue on my dev server.
But is there really no way to do that server type independent and in a more easy way? I tried a different approach in #18929 to split up the queries. What do you guys think?

@GeraintEdwards

This comment has been minimized.

Show comment
Hide comment
@GeraintEdwards

GeraintEdwards Nov 30, 2017

Contributor

Hi @laoneo - see my comments on your other PR.

Re the server type dependent code - as I explained in my comment in the code the subquery approach should be the fastest since the database engine can do its own optimisations. Unfortunately MySQL had a bug where the independent sub query is treated as if it were dependent and gets executed once for every article searched (and a ridiculous performance hit).

Postgres treats the query correctly and is about twice as fast as a subquery compared to passing the field ids in as a string.

So its a trade off between easier to read code and a small performance hit in Postgres.

Contributor

GeraintEdwards commented Nov 30, 2017

Hi @laoneo - see my comments on your other PR.

Re the server type dependent code - as I explained in my comment in the code the subquery approach should be the fastest since the database engine can do its own optimisations. Unfortunately MySQL had a bug where the independent sub query is treated as if it were dependent and gets executed once for every article searched (and a ridiculous performance hit).

Postgres treats the query correctly and is about twice as fast as a subquery compared to passing the field ids in as a string.

So its a trade off between easier to read code and a small performance hit in Postgres.

@GeraintEdwards

This comment has been minimized.

Show comment
Hide comment
@GeraintEdwards

GeraintEdwards Nov 30, 2017

Contributor

@angieradtke I suppose it does make sense to be able to mark specific fields as 'searchable' and other not. That would require a new config option for individual fields - @laoneo what do you think?

Contributor

GeraintEdwards commented Nov 30, 2017

@angieradtke I suppose it does make sense to be able to mark specific fields as 'searchable' and other not. That would require a new config option for individual fields - @laoneo what do you think?

@laoneo

This comment has been minimized.

Show comment
Hide comment
@laoneo

laoneo Nov 30, 2017

Member

What's the reason to not search in a field? If you want such an option, then I would add it as a parameter in the search plugin in which fields to search.

Member

laoneo commented Nov 30, 2017

What's the reason to not search in a field? If you want such an option, then I would add it as a parameter in the search plugin in which fields to search.

@laoneo

This comment has been minimized.

Show comment
Hide comment
@laoneo

laoneo Nov 30, 2017

Member

So its a trade off between easier to read code and a small performance hit in Postgres.

I'm more worried about other extension developers who want to implement a search plugin. Mostly the content search plugin acts as example. It is already very complex.

Member

laoneo commented Nov 30, 2017

So its a trade off between easier to read code and a small performance hit in Postgres.

I'm more worried about other extension developers who want to implement a search plugin. Mostly the content search plugin acts as example. It is already very complex.

@alikon

This comment has been minimized.

Show comment
Hide comment
@alikon

alikon Dec 2, 2017

Contributor

the offending statement is when we have a.id IN (cfv.item_id)
cause we are comparing the id field of #__content table wich is a SERIAL with the item_id field of #__fields_values table wich is a VARCHAR

we need an explicit cast

from
$wheres2[] = 'a.id IN( ' . (string) $subQuery . ')';
to
$wheres2[] = 'CAST(a.id AS VARCHAR) IN(' . (string) $subQuery . ')';

lines 125,177,216

Contributor

alikon commented Dec 2, 2017

the offending statement is when we have a.id IN (cfv.item_id)
cause we are comparing the id field of #__content table wich is a SERIAL with the item_id field of #__fields_values table wich is a VARCHAR

we need an explicit cast

from
$wheres2[] = 'a.id IN( ' . (string) $subQuery . ')';
to
$wheres2[] = 'CAST(a.id AS VARCHAR) IN(' . (string) $subQuery . ')';

lines 125,177,216

@alikon

use CAST for postgresql

Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
@alikon

This comment has been minimized.

Show comment
Hide comment
@alikon

alikon Dec 2, 2017

Contributor

with this changes in place even on postgresql the performance are better

Contributor

alikon commented Dec 2, 2017

with this changes in place even on postgresql the performance are better

@wilsonge wilsonge removed the RTC label Dec 2, 2017

@wilsonge

This comment has been minimized.

Show comment
Hide comment
@wilsonge

wilsonge Dec 2, 2017

Contributor

Removing RTC so we can fix postgres :)


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

Contributor

wilsonge commented Dec 2, 2017

Removing RTC so we can fix postgres :)


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

@GeraintEdwards

This comment has been minimized.

Show comment
Hide comment
@GeraintEdwards

GeraintEdwards Dec 3, 2017

Contributor

Really sorry about the Postgres - I had 2 development sites running and fixed the casting in the postgres version but hadn't committed from there.

JDatabaseQuery doesn't implement castAsInteger which would be the most efficient casting/matching so I'm using castAsChar or a.id for the match instead.

Contributor

GeraintEdwards commented Dec 3, 2017

Really sorry about the Postgres - I had 2 development sites running and fixed the casting in the postgres version but hadn't committed from there.

JDatabaseQuery doesn't implement castAsInteger which would be the most efficient casting/matching so I'm using castAsChar or a.id for the match instead.

@alikon

This comment has been minimized.

Show comment
Hide comment
@alikon

alikon Dec 3, 2017

Contributor

JDatabaseQuery doesn't implement castAsInteger which would be the most efficient casting/matching so I'm using castAsChar or a.id for the match instead.

good catch, i hope #18961 will fill the gap

we should remeber to use it, when and if #18961 will be merged, but for now we should go even without it 😍

Contributor

alikon commented Dec 3, 2017

JDatabaseQuery doesn't implement castAsInteger which would be the most efficient casting/matching so I'm using castAsChar or a.id for the match instead.

good catch, i hope #18961 will fill the gap

we should remeber to use it, when and if #18961 will be merged, but for now we should go even without it 😍

@alikon

This comment has been minimized.

Show comment
Hide comment
@alikon

alikon Dec 3, 2017

Contributor

I have tested this item successfully on 351b0c6

mysql & postrgesql


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

Contributor

alikon commented Dec 3, 2017

I have tested this item successfully on 351b0c6

mysql & postrgesql


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

@nibra

This comment has been minimized.

Show comment
Hide comment
@nibra

nibra Dec 3, 2017

Member

@angieradtke, @tonypartridge, could you please re-test?

Member

nibra commented Dec 3, 2017

@angieradtke, @tonypartridge, could you please re-test?

@angieradtke

This comment has been minimized.

Show comment
Hide comment
@angieradtke

angieradtke Dec 3, 2017

Contributor

Second retest succesful on a live project .-)

Contributor

angieradtke commented Dec 3, 2017

Second retest succesful on a live project .-)

@nibra

Only one CS issue left: Please add spaces around concat operators. Then all checks will pass and this is ready for commit.

Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
Show outdated Hide outdated plugins/search/content/content.php Outdated
CS
@franz-wohlkoenig

This comment has been minimized.

Show comment
Hide comment
@franz-wohlkoenig

franz-wohlkoenig Dec 3, 2017

@angieradtke please mark your Test as successfully:

  • open Issue Tracker
  • Login with your github-Account
  • Click on blue "Test this"-Button above Authors-Picture
  • mark your Test as successfully
  • hit "submit test result"

franz-wohlkoenig commented Dec 3, 2017

@angieradtke please mark your Test as successfully:

  • open Issue Tracker
  • Login with your github-Account
  • Click on blue "Test this"-Button above Authors-Picture
  • mark your Test as successfully
  • hit "submit test result"
@angieradtke

This comment has been minimized.

Show comment
Hide comment
@angieradtke

angieradtke Dec 3, 2017

Contributor

I have tested this item successfully on 6e1c893


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

Contributor

angieradtke commented Dec 3, 2017

I have tested this item successfully on 6e1c893


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/18915.

@nibra

nibra approved these changes Dec 3, 2017

Thank you!

@wilsonge wilsonge merged commit 50e1d61 into joomla:staging Dec 3, 2017

1 of 4 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/drone/pr this build is pending
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
hound No violations found. Woof!
@wilsonge

This comment has been minimized.

Show comment
Hide comment
@wilsonge

wilsonge Dec 3, 2017

Contributor

I tweaked a couple of small c/s issues and merged. Thankyou so much for this :)

Contributor

wilsonge commented Dec 3, 2017

I tweaked a couple of small c/s issues and merged. Thankyou so much for this :)

@wilsonge wilsonge added this to the Joomla 3.8.3 milestone Dec 3, 2017

alikon added a commit to alikon/joomla-cms that referenced this pull request Dec 5, 2017

@joomlabeat

This comment has been minimized.

Show comment
Hide comment
@joomlabeat

joomlabeat Jan 18, 2018

Contributor

Do we actually need the GROUP BY statement in the plugin's query? It doesn't seem it's needful here and taking it off it results to an extra ~30% gain in performance.

Contributor

joomlabeat commented Jan 18, 2018

Do we actually need the GROUP BY statement in the plugin's query? It doesn't seem it's needful here and taking it off it results to an extra ~30% gain in performance.

@zero-24

This comment has been minimized.

Show comment
Hide comment
@zero-24

zero-24 Jan 18, 2018

Contributor

@joomlabeat as you can see this is a old issue / merged PR. Can you please open a new issue linking to this one? Else this get missed fast. Thanks!

Contributor

zero-24 commented Jan 18, 2018

@joomlabeat as you can see this is a old issue / merged PR. Can you please open a new issue linking to this one? Else this get missed fast. Thanks!

@nibra

This comment has been minimized.

Show comment
Hide comment
@nibra

nibra Jan 18, 2018

Member

Yes, the GROUP BY is required by the SQL standard (MySQL can deal without, others can't). The drivers are AFAIK not building the GROUP BY causes transparently if needed (which is what they should do).

Member

nibra commented Jan 18, 2018

Yes, the GROUP BY is required by the SQL standard (MySQL can deal without, others can't). The drivers are AFAIK not building the GROUP BY causes transparently if needed (which is what they should do).

@csthomas

This comment has been minimized.

Show comment
Hide comment
@csthomas

csthomas Jan 19, 2018

Contributor

@joomlabeat is right, line 288 can be completely removed.

There is only (#__content INNER JOIN #__categories), there are no duplicates and the query has no aggregate function.

Contributor

csthomas commented Jan 19, 2018

@joomlabeat is right, line 288 can be completely removed.

There is only (#__content INNER JOIN #__categories), there are no duplicates and the query has no aggregate function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment