when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332

diosmosis · 2023-02-07T19:26:54Z

Description:

Fixes #18295

Changes:

Add Archive::querySingleBlob() method that queries archive data for a single blob using a cursor, instead of Db::fetchAll(). The result is ordered so the datatables can be aggregated w/o needing the entire datatable tree in memory.
In ArchiveProcessor::aggregateDataTableRecords(), provide a default sort column if none is specified. If the table includes nb_visits, we sort by that before truncation. This ensures predictable results for blob report limiting.
In ArchiveProcessor, use Archive::querySingleBlob() to query blob data and aggregate the result w/o loading entire trees in memory.
Remove DataCollection::forEachBlobExpanded() since it is no longer used.

Notes:

The new method of querying archive data is only used for aggregating data. This is to avoid unintended performance regressions when just querying data from the API. Additionally, using a cursor would only be more memory efficient when aggregating datatables together, not when just querying them.

Review

…eek, month, year, range data

… if the table contains visits

…ed + a couple other small changes

…ty string and order it before everything else if so

tsteur

FYI haven't fully reviewed it yet but did some profiling.

Here's the result for this PR . The peak memory was 163MB

And here the result for the 4.x-dev branch where it used 347MB peak.

It also executes slightly faster with this PR 👍 🎉
This was for a date range of 2022-09-06,2022-11-26

I can't test it yet over a longer period but I think that would be a similar result.

The new method of querying archive data is only used for aggregating data. This is to avoid unintended performance regressions when just querying data from the API. Additionally, using a cursor would only be more memory efficient when aggregating datatables together, not when just querying them.

Looking at the test result there, maybe it could also be used in the API? That's where we're currently having problems. Or was the new implementation slower for you?

diosmosis · 2023-03-03T00:02:26Z

@tsteur

Looking at the test result there, maybe it could also be used in the API? That's where we're currently having problems. Or was the new implementation slower for you?

I didn't actually test it, but it shouldn't make a difference, it should only make a difference when aggregating (browser triggered or core:archive triggered). In this case, we fetch datatables for multiple dates and combine them, so the peak memory use can be lowered to one datatable tree. For just querying through the API, though, all the data is necessary to be returned, so it should make no difference if we use a cursor or fetch them all at once. Everything has to be in memory no matter what (unless we modify matomo renderers to stream data out instead of writing all at once, which would be an enormous change). Anyway, the SELECT query is far more complicated now and I don't know what that would do under load so I thought it was better to constrain the effects of this change for now.

tsteur · 2023-03-08T04:25:36Z

@diosmosis can you maybe send a link where this different behaviour is implemented? Looking at the PR I can't find different behaviour for API vs others. Or maybe I don't fully understand what is meant by API and how we need to fetch all records there anyway.

diosmosis · 2023-03-08T05:28:58Z

@tsteur

matomo/core/DataAccess/ArchiveSelector.php

Lines 142 to 244 in 46679d1

    
               public static function getArchiveIds($siteIds, $periods, $segment, $plugins, $includeInvalidated = true, $_skipSetGroupConcatMaxLen = false) 
        
               { 
        
                   $logger = StaticContainer::get(LoggerInterface::class); 
        
                   if (!$_skipSetGroupConcatMaxLen) { 
        
                       try { 
        
                           Db::get()->query('SET SESSION group_concat_max_len=' . (128 * 1024)); 
        
                       } catch (\Exception $ex) { 
        
                           $logger->info("Could not set group_concat_max_len MySQL session variable."); 
        
                       } 
        
                   } 
        
                   if (empty($siteIds)) { 
        
                       throw new \Exception("Website IDs could not be read from the request, ie. idSite="); 
        
                   } 
        
                   foreach ($siteIds as $index => $siteId) { 
        
                       $siteIds[$index] = (int) $siteId; 
        
                   } 
        
                   $getArchiveIdsSql = "SELECT idsite, date1, date2, 
        
                                               GROUP_CONCAT(CONCAT(idarchive,'|',`name`,'|',`value`) ORDER BY idarchive DESC SEPARATOR ',') AS archives 
        
                                          FROM %s 
        
                                         WHERE idsite IN (" . implode(',', $siteIds) . ") 
        
                                           AND " . self::getNameCondition($plugins, $segment, $includeInvalidated) . " 
        
                                           AND %s 
        
                                      GROUP BY idsite, date1, date2"; 
        
                   $monthToPeriods = array(); 
        
                   foreach ($periods as $period) { 
        
                       /** @var Period $period */ 
        
                       if ($period->getDateStart()->isLater(Date::now()->addDay(2))) { 
        
                           continue; // avoid creating any archive tables in the future 
        
                       } 
        
                       $table = ArchiveTableCreator::getNumericTable($period->getDateStart()); 
        
                       $monthToPeriods[$table][] = $period; 
        
                   } 
        
                   $db = Db::get(); 
        
                   // for every month within the archive query, select from numeric table 
        
                   $result = array(); 
        
                   foreach ($monthToPeriods as $table => $periods) { 
        
                       $firstPeriod = reset($periods); 
        
                       $bind = array(); 
        
                       if ($firstPeriod instanceof Range) { 
        
                           $dateCondition = "date1 = ? AND date2 = ?"; 
        
                           $bind[] = $firstPeriod->getDateStart()->toString('Y-m-d'); 
        
                           $bind[] = $firstPeriod->getDateEnd()->toString('Y-m-d'); 
        
                       } else { 
        
                           // we assume there is no range date in $periods 
        
                           $dateCondition = '('; 
        
                           foreach ($periods as $period) { 
        
                               if (strlen($dateCondition) > 1) { 
        
                                   $dateCondition .= ' OR '; 
        
                               } 
        
                               $dateCondition .= "(period = ? AND date1 = ? AND date2 = ?)"; 
        
                               $bind[] = $period->getId(); 
        
                               $bind[] = $period->getDateStart()->toString('Y-m-d'); 
        
                               $bind[] = $period->getDateEnd()->toString('Y-m-d'); 
        
                           } 
        
                           $dateCondition .= ')'; 
        
                       } 
        
                       $sql = sprintf($getArchiveIdsSql, $table, $dateCondition); 
        
                       $archiveIds = $db->fetchAll($sql, $bind); 
        
                       // get the archive IDs. we keep all archives until the first all plugins archive. 
        
                       // everything older than that one is discarded. 
        
                       foreach ($archiveIds as $row) { 
        
                           $dateStr = $row['date1'] . ',' . $row['date2']; 
        
                           $archives = $row['archives']; 
        
                           $pairs = explode(',', $archives); 
        
                           foreach ($pairs as $pair) { 
        
                               $parts = explode('|', $pair); 
        
                               if (count($parts) != 3) { // GROUP_CONCAT got cut off, have to ignore the rest 
        
                                   // note: in this edge case, we end up not selecting the all plugins archive because it will be older than the partials. 
        
                                   // not ideal, but it avoids an exception. 
        
                                   $logger->info("GROUP_CONCAT got cut off in ArchiveSelector." . __FUNCTION__ . ' for idsite = ' . $row['idsite'] . ', period = ' . $dateStr); 
        
                                   continue; 
        
                               } 
        
                               list($idarchive, $doneFlag, $value) = $parts; 
        
                               $result[$doneFlag][$dateStr][] = $idarchive; 
        
                               if (strpos($doneFlag, '.') === false // all plugins archive 
        
                                   // sanity check: DONE_PARTIAL shouldn't be used w/ done archives, but in case we see one, 
        
                                   // don't treat it like an all plugins archive 
        
                                   && $value != ArchiveWriter::DONE_PARTIAL 
        
                               ) { 
        
                                   break; // found the all plugins archive, don't need to look in older archives since we have everything here 
        
                               } 
        
                           } 
        
                       } 
        
                   } 
        
                   return $result; 
        
               }

(querying for archives to just return them) vs.

matomo/core/DataAccess/ArchiveSelector.php

Lines 452 to 519 in 46679d1

    
           public static function querySingleBlob(array $archiveIds, string $recordName) 
        
           { 
        
               $chunk = new Chunk(); 
        
               [$getValuesSql, $bind] = self::getSqlTemplateToFetchArchiveData( 
        
                   [$recordName], Archive::ID_SUBTABLE_LOAD_ALL_SUBTABLES, true); 
        
               $archiveIdsPerMonth = self::getArchiveIdsByYearMonth($archiveIds); 
        
               $periodsSeen = []; 
        
               // $yearMonth = "2022-11", 
        
               foreach ($archiveIdsPerMonth as $yearMonth => $ids) { 
        
                   $date = Date::factory($yearMonth . '-01'); 
        
                   $table = ArchiveTableCreator::getBlobTable($date); 
        
                   $ids      = array_map('intval', $ids); 
        
                   $sql      = sprintf($getValuesSql, $table, implode(',', $ids)); 
        
                   $cursor = Db::get()->query($sql, $bind); 
        
                   while ($row = $cursor->fetch()) { 
        
                       $period = $row['date1'] . ',' . $row['date2']; 
        
                       $recordName = $row['name']; 
        
                       // FIXME: This hack works around a strange bug that occurs when getting 
        
                       //         archive IDs through ArchiveProcessing instances. When a table 
        
                       //         does not already exist, for some reason the archive ID for 
        
                       //         today (or from two days ago) will be added to the Archive 
        
                       //         instances list. The Archive instance will then select data 
        
                       //         for periods outside of the requested set. 
        
                       //         working around the bug here, but ideally, we need to figure 
        
                       //         out why incorrect idarchives are being selected. 
        
                       if (empty($archiveIds[$period])) { 
        
                           continue; 
        
                       } 
        
                       // only use the first period/blob name combination seen (since we order by ts_archived descending) 
        
                       if (!empty($periodsSeen[$period][$recordName])) { 
        
                           continue; 
        
                       } 
        
                       $periodsSeen[$period][$recordName] = true; 
        
                       $row['value'] = ArchiveSelector::uncompress($row['value']); 
        
                       if ($chunk->isRecordNameAChunk($row['name'])) { 
        
                           // $blobs = array([subtableID] = [blob of subtableId]) 
        
                           $blobs = Common::safe_unserialize($row['value']); 
        
                           if (!is_array($blobs)) { 
        
                               yield $row; 
        
                           } 
        
                           ksort($blobs); 
        
                           // $rawName = eg 'PluginName_ArchiveName' 
        
                           $rawName = $chunk->getRecordNameWithoutChunkAppendix($row['name']); 
        
                           foreach ($blobs as $subtableId => $blob) { 
        
                               yield array_merge($row, [ 
        
                                   'value' => $blob, 
        
                                   'name' => ArchiveSelector::appendIdSubtable($rawName, $subtableId), 
        
                               ]); 
        
                           } 
        
                       } else { 
        
                           yield $row; 
        
                       } 
        
                   } 
        
               } 
        
           }

(querying for a single blob to aggregate them). The difference in the generated SQL is determined by

matomo/core/DataAccess/ArchiveSelector.php

Lines 559 to 565 in 46679d1

    
                       if ($orderBySubtableId && count($recordNames) == 1) { 
        
                           $idSubtableAsInt = self::getExtractIdSubtableFromBlobNameSql($chunk, $name); 
        
                           $orderBy = "ORDER BY date1 ASC, " . // ordering by date just so column order in tests will be predictable 
        
                               " $idSubtableAsInt ASC, 
        
                             ts_archived DESC"; // ascending order so we use the latest data found 
        
                       }

tsteur · 2023-03-14T20:47:21Z

Hi @diosmosis . I understand now. Debugged the code and seeing now what's happening and understanding the extra queries. Also understanding the many more queries you mean. I was actually hoping there could be only one query but then in the generator itself we only call $cursor->fetch() to get the next data result. I don't actually know if that would reduce memory or if in the background PHP would still hold everything in memory. Just thinking that this could improve the speed and reduce the amount of DB queries quite signficantly as these queries can get slow in production (although we improved their speed recently with a different index).

diosmosis · 2023-03-14T22:16:06Z

@tsteur

I was actually hoping there could be only one query but then in the generator itself we only call $cursor->fetch() to get the next data result.

This should be what happens? There should only be one query per archive table. Can you elaborate?

EDIT: when I said 'under load' I meant 'in production when used for every API blob query, not just when aggregating' and I was referring to the complicated ORDER BY thats used to order by idSubtable. Which isn't actually needed for API requests that just fetch data without doing any aggregation.

tsteur · 2023-03-14T22:34:36Z

@diosmosis I'm so sorry. It actually does work like that indeed. It only looked that way as for one of the date ranges it had no data and then it would always start again at the top of the function and issue the query again but this is actually not the case when there are matching rows. I completely missed it. From my perspective this looks all good 💯

Not sure if @matomo-org/core-reviewers have any other thoughts otherwise? Ideally, we could even merge this into a 4.14.0 release as it would be great to have this change on the Cloud already earlier as I believe there isn't a breaking change.

diosmosis · 2023-03-14T23:35:16Z

It only looked that way as for one of the date ranges it had no data and then it would always start again at the top of the function and issue the query again but this is actually not the case when there are matching rows.

Interesting, I think that probably shouldn't happen, I'll take a look there.

EDIT: Oh I see, nevermind I thought you meant the loop would go on forever in that case, but it shouldn't :)

tsteur · 2023-03-15T00:57:30Z

@diosmosis it doesn't. It's all good. It was just querying it for different record names which be expected. I didn't realise the record names changed while debugging 👍

sgiehl

Had a look through the code. To me the changes are looking reasonable.
@tsteur I didn't debug the code or checked any memory usage yet, as you already did that. Or shall I invest some additional time to do that as well?

tests/PHPUnit/System/expected/test_reportLimiting__Resolution.getConfiguration_month.xml

core/ArchiveProcessor.php

tsteur · 2023-03-20T02:44:14Z

@tsteur I didn't debug the code or checked any memory usage yet, as you already did that. Or shall I invest some additional time to do that as well?

I did check memory usage and it definitely improved and went down and also debugged the code etc.

…ow in memory at a time (#20332) * proof of concept one blob table row at a time method of aggregating week, month, year, range data * sort blobs by subtable ID when chunk is being read * simplify code w/ generators * make sure single blob query is ordered by name correctly * REGEXP_SUBSTR() is only available in mysql 8 :/ * fix a couple test failures * by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits * try fixing random test failures * debug ci only error * undo debugging change * fixing some system tests * refactor ArchiveSelector code for more code reuse * add some code documentation * remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes * try debugging ci only random failure * remove previous debugging code * more debugging * more ci debugging * trigger build again and try to get more information for random failure * fix convoluted sql replacement for REGEXP_SUBSTRING * fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so * add log in case blob table order is incorrect * add tests for subtable extraction sql * remove unused import --------- Co-authored-by: Stefan Giehl <stefan@matomo.org>

…ow in memory at a time (#20332) (#20512) * proof of concept one blob table row at a time method of aggregating week, month, year, range data * sort blobs by subtable ID when chunk is being read * simplify code w/ generators * make sure single blob query is ordered by name correctly * REGEXP_SUBSTR() is only available in mysql 8 :/ * fix a couple test failures * by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits * try fixing random test failures * debug ci only error * undo debugging change * fixing some system tests * refactor ArchiveSelector code for more code reuse * add some code documentation * remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes * try debugging ci only random failure * remove previous debugging code * more debugging * more ci debugging * trigger build again and try to get more information for random failure * fix convoluted sql replacement for REGEXP_SUBSTRING * fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so * add log in case blob table order is incorrect * add tests for subtable extraction sql * remove unused import --------- Co-authored-by: dizzy <diosmosis@users.noreply.github.com> Co-authored-by: Stefan Giehl <stefan@matomo.org>

proof of concept one blob table row at a time method of aggregating w…

f000282

…eek, month, year, range data

diosmosis marked this pull request as draft February 7, 2023 19:27

diosmosis added 15 commits February 7, 2023 12:28

sort blobs by subtable ID when chunk is being read

4cadb45

simplify code w/ generators

6337807

make sure single blob query is ordered by name correctly

3e3188d

REGEXP_SUBSTR() is only available in mysql 8 :/

dcd7776

Merge branch '5.x-dev' into 18295-blob-querying-cursor

319aa2f

fix a couple test failures

d2eefe3

by default when aggregating tables in ArchiveProcessor sort by visits…

8688df2

… if the table contains visits

try fixing random test failures

deab8a2

debug ci only error

6bd303f

undo debugging change

bec7ef8

fixing some system tests

e10c3d4

refactor ArchiveSelector code for more code reuse

c0d1875

add some code documentation

dd6195d

Merge branch '5.x-dev' into 18295-blob-querying-cursor

69aeadc

remove DataCollection::forEachBlobExpanded() since it is no longer us…

39df69f

…ed + a couple other small changes

diosmosis changed the title ~~proof of concept one blob table row at a time method of aggregating w…~~ when aggregating non-day periods, only store one datatable and blob row in memory at a time Feb 18, 2023

diosmosis added the Needs Review PRs that need a code review label Feb 18, 2023

diosmosis added this to the 5.0.0 milestone Feb 18, 2023

diosmosis marked this pull request as ready for review February 18, 2023 21:53

diosmosis added 9 commits February 18, 2023 14:08

try debugging ci only random failure

8d93ce5

remove previous debugging code

a5f871c

more debugging

eed425e

more ci debugging

a8f2ecd

trigger build again and try to get more information for random failure

7a6e092

fix convoluted sql replacement for REGEXP_SUBSTRING

8e3c7f4

Merge branch '5.x-dev' into 18295-blob-querying-cursor

c66526f

fix idsubtable extraction, need to check if extracted value is an emp…

15011e2

…ty string and order it before everything else if so

add log in case blob table order is incorrect

df7a04d

add tests for subtable extraction sql

46679d1

tsteur reviewed Mar 2, 2023

View reviewed changes

tsteur mentioned this pull request Mar 9, 2023

Get all numerics in one go for faster goal conversion calculation #20438

Merged

11 tasks

Merge branch '5.x-dev' into 18295-blob-querying-cursor

83cb994

sgiehl requested changes Mar 17, 2023

View reviewed changes

tests/PHPUnit/System/expected/test_reportLimiting__Resolution.getConfiguration_month.xml Show resolved Hide resolved

core/ArchiveProcessor.php Outdated Show resolved Hide resolved

remove unused import

b14e691

Merge branch '5.x-dev' into 18295-blob-querying-cursor

288b977

sgiehl approved these changes Mar 20, 2023

View reviewed changes

sgiehl merged commit e3188be into 5.x-dev Mar 20, 2023
17 of 19 checks passed

sgiehl deleted the 18295-blob-querying-cursor branch March 20, 2023 09:32

tsteur mentioned this pull request Mar 23, 2023

Faster archiving for non-days periods by only storing one datatable and blob row at a time in memory #20512

Merged

11 tasks

sgiehl added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Apr 12, 2023

sgiehl mentioned this pull request Apr 12, 2023

When aggregating data tables for a week, month, year or range period, don't store all archive content in memory #18295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332

when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332

diosmosis commented Feb 7, 2023 •

edited

tsteur left a comment

diosmosis commented Mar 3, 2023

tsteur commented Mar 8, 2023

diosmosis commented Mar 8, 2023

tsteur commented Mar 14, 2023

diosmosis commented Mar 14, 2023 •

edited

tsteur commented Mar 14, 2023

diosmosis commented Mar 14, 2023 •

edited

tsteur commented Mar 15, 2023

sgiehl left a comment

tsteur commented Mar 20, 2023

when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332

when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332

Conversation

diosmosis commented Feb 7, 2023 • edited

Description:

Review

tsteur left a comment

Choose a reason for hiding this comment

diosmosis commented Mar 3, 2023

tsteur commented Mar 8, 2023

diosmosis commented Mar 8, 2023

tsteur commented Mar 14, 2023

diosmosis commented Mar 14, 2023 • edited

tsteur commented Mar 14, 2023

diosmosis commented Mar 14, 2023 • edited

tsteur commented Mar 15, 2023

sgiehl left a comment

Choose a reason for hiding this comment

tsteur commented Mar 20, 2023

diosmosis commented Feb 7, 2023 •

edited

diosmosis commented Mar 14, 2023 •

edited

diosmosis commented Mar 14, 2023 •

edited