New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332
Conversation
…eek, month, year, range data
… if the table contains visits
…ed + a couple other small changes
…ty string and order it before everything else if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI haven't fully reviewed it yet but did some profiling.
Here's the result for this PR . The peak memory was 163MB
And here the result for the 4.x-dev
branch where it used 347MB peak.
It also executes slightly faster with this PR 👍 🎉
This was for a date range of 2022-09-06,2022-11-26
I can't test it yet over a longer period but I think that would be a similar result.
The new method of querying archive data is only used for aggregating data. This is to avoid unintended performance regressions when just querying data from the API. Additionally, using a cursor would only be more memory efficient when aggregating datatables together, not when just querying them.
Looking at the test result there, maybe it could also be used in the API? That's where we're currently having problems. Or was the new implementation slower for you?
I didn't actually test it, but it shouldn't make a difference, it should only make a difference when aggregating (browser triggered or core:archive triggered). In this case, we fetch datatables for multiple dates and combine them, so the peak memory use can be lowered to one datatable tree. For just querying through the API, though, all the data is necessary to be returned, so it should make no difference if we use a cursor or fetch them all at once. Everything has to be in memory no matter what (unless we modify matomo renderers to stream data out instead of writing all at once, which would be an enormous change). Anyway, the SELECT query is far more complicated now and I don't know what that would do under load so I thought it was better to constrain the effects of this change for now. |
@diosmosis can you maybe send a link where this different behaviour is implemented? Looking at the PR I can't find different behaviour for API vs others. Or maybe I don't fully understand what is meant by API and how we need to fetch all records there anyway. |
matomo/core/DataAccess/ArchiveSelector.php Lines 142 to 244 in 46679d1
matomo/core/DataAccess/ArchiveSelector.php Lines 452 to 519 in 46679d1
matomo/core/DataAccess/ArchiveSelector.php Lines 559 to 565 in 46679d1
|
Hi @diosmosis . I understand now. Debugged the code and seeing now what's happening and understanding the extra queries. Also understanding the many more queries you mean. I was actually hoping there could be only one query but then in the generator itself we only call |
This should be what happens? There should only be one query per archive table. Can you elaborate? EDIT: when I said 'under load' I meant 'in production when used for every API blob query, not just when aggregating' and I was referring to the complicated ORDER BY thats used to order by idSubtable. Which isn't actually needed for API requests that just fetch data without doing any aggregation. |
@diosmosis I'm so sorry. It actually does work like that indeed. It only looked that way as for one of the date ranges it had no data and then it would always start again at the top of the function and issue the query again but this is actually not the case when there are matching rows. I completely missed it. From my perspective this looks all good 💯 Not sure if @matomo-org/core-reviewers have any other thoughts otherwise? Ideally, we could even merge this into a 4.14.0 release as it would be great to have this change on the Cloud already earlier as I believe there isn't a breaking change. |
Interesting, I think that probably shouldn't happen, I'll take a look there. EDIT: Oh I see, nevermind I thought you meant the loop would go on forever in that case, but it shouldn't :) |
@diosmosis it doesn't. It's all good. It was just querying it for different record names which be expected. I didn't realise the record names changed while debugging 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look through the code. To me the changes are looking reasonable.
@tsteur I didn't debug the code or checked any memory usage yet, as you already did that. Or shall I invest some additional time to do that as well?
tests/PHPUnit/System/expected/test_reportLimiting__Resolution.getConfiguration_month.xml
Show resolved
Hide resolved
I did check memory usage and it definitely improved and went down and also debugged the code etc. |
…ow in memory at a time (#20332) * proof of concept one blob table row at a time method of aggregating week, month, year, range data * sort blobs by subtable ID when chunk is being read * simplify code w/ generators * make sure single blob query is ordered by name correctly * REGEXP_SUBSTR() is only available in mysql 8 :/ * fix a couple test failures * by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits * try fixing random test failures * debug ci only error * undo debugging change * fixing some system tests * refactor ArchiveSelector code for more code reuse * add some code documentation * remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes * try debugging ci only random failure * remove previous debugging code * more debugging * more ci debugging * trigger build again and try to get more information for random failure * fix convoluted sql replacement for REGEXP_SUBSTRING * fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so * add log in case blob table order is incorrect * add tests for subtable extraction sql * remove unused import --------- Co-authored-by: Stefan Giehl <stefan@matomo.org>
…ow in memory at a time (#20332) * proof of concept one blob table row at a time method of aggregating week, month, year, range data * sort blobs by subtable ID when chunk is being read * simplify code w/ generators * make sure single blob query is ordered by name correctly * REGEXP_SUBSTR() is only available in mysql 8 :/ * fix a couple test failures * by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits * try fixing random test failures * debug ci only error * undo debugging change * fixing some system tests * refactor ArchiveSelector code for more code reuse * add some code documentation * remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes * try debugging ci only random failure * remove previous debugging code * more debugging * more ci debugging * trigger build again and try to get more information for random failure * fix convoluted sql replacement for REGEXP_SUBSTRING * fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so * add log in case blob table order is incorrect * add tests for subtable extraction sql * remove unused import --------- Co-authored-by: Stefan Giehl <stefan@matomo.org>
…ow in memory at a time (#20332) (#20512) * proof of concept one blob table row at a time method of aggregating week, month, year, range data * sort blobs by subtable ID when chunk is being read * simplify code w/ generators * make sure single blob query is ordered by name correctly * REGEXP_SUBSTR() is only available in mysql 8 :/ * fix a couple test failures * by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits * try fixing random test failures * debug ci only error * undo debugging change * fixing some system tests * refactor ArchiveSelector code for more code reuse * add some code documentation * remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes * try debugging ci only random failure * remove previous debugging code * more debugging * more ci debugging * trigger build again and try to get more information for random failure * fix convoluted sql replacement for REGEXP_SUBSTRING * fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so * add log in case blob table order is incorrect * add tests for subtable extraction sql * remove unused import --------- Co-authored-by: dizzy <diosmosis@users.noreply.github.com> Co-authored-by: Stefan Giehl <stefan@matomo.org>
Description:
Fixes #18295
Changes:
Db::fetchAll()
. The result is ordered so the datatables can be aggregated w/o needing the entire datatable tree in memory.Archive::querySingleBlob()
to query blob data and aggregate the result w/o loading entire trees in memory.Notes:
Review