Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when aggregating non-day periods, only store one datatable and blob row in memory at a time #20332

Merged
merged 29 commits into from Mar 20, 2023

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented Feb 7, 2023

Description:

Fixes #18295

Changes:

  • Add Archive::querySingleBlob() method that queries archive data for a single blob using a cursor, instead of Db::fetchAll(). The result is ordered so the datatables can be aggregated w/o needing the entire datatable tree in memory.
  • In ArchiveProcessor::aggregateDataTableRecords(), provide a default sort column if none is specified. If the table includes nb_visits, we sort by that before truncation. This ensures predictable results for blob report limiting.
  • In ArchiveProcessor, use Archive::querySingleBlob() to query blob data and aggregate the result w/o loading entire trees in memory.
  • Remove DataCollection::forEachBlobExpanded() since it is no longer used.

Notes:

  • The new method of querying archive data is only used for aggregating data. This is to avoid unintended performance regressions when just querying data from the API. Additionally, using a cursor would only be more memory efficient when aggregating datatables together, not when just querying them.

Review

@diosmosis diosmosis marked this pull request as draft February 7, 2023 19:27
@diosmosis diosmosis changed the title proof of concept one blob table row at a time method of aggregating w… when aggregating non-day periods, only store one datatable and blob row in memory at a time Feb 18, 2023
@diosmosis diosmosis added the Needs Review PRs that need a code review label Feb 18, 2023
@diosmosis diosmosis added this to the 5.0.0 milestone Feb 18, 2023
@diosmosis diosmosis marked this pull request as ready for review February 18, 2023 21:53
Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI haven't fully reviewed it yet but did some profiling.

Here's the result for this PR . The peak memory was 163MB

image

image

And here the result for the 4.x-dev branch where it used 347MB peak.

image

image

It also executes slightly faster with this PR 👍 🎉
This was for a date range of 2022-09-06,2022-11-26

I can't test it yet over a longer period but I think that would be a similar result.

The new method of querying archive data is only used for aggregating data. This is to avoid unintended performance regressions when just querying data from the API. Additionally, using a cursor would only be more memory efficient when aggregating datatables together, not when just querying them.

Looking at the test result there, maybe it could also be used in the API? That's where we're currently having problems. Or was the new implementation slower for you?

@diosmosis
Copy link
Member Author

@tsteur

Looking at the test result there, maybe it could also be used in the API? That's where we're currently having problems. Or was the new implementation slower for you?

I didn't actually test it, but it shouldn't make a difference, it should only make a difference when aggregating (browser triggered or core:archive triggered). In this case, we fetch datatables for multiple dates and combine them, so the peak memory use can be lowered to one datatable tree. For just querying through the API, though, all the data is necessary to be returned, so it should make no difference if we use a cursor or fetch them all at once. Everything has to be in memory no matter what (unless we modify matomo renderers to stream data out instead of writing all at once, which would be an enormous change). Anyway, the SELECT query is far more complicated now and I don't know what that would do under load so I thought it was better to constrain the effects of this change for now.

@tsteur
Copy link
Member

tsteur commented Mar 8, 2023

@diosmosis can you maybe send a link where this different behaviour is implemented? Looking at the PR I can't find different behaviour for API vs others. Or maybe I don't fully understand what is meant by API and how we need to fetch all records there anyway.

@diosmosis
Copy link
Member Author

@tsteur

public static function getArchiveIds($siteIds, $periods, $segment, $plugins, $includeInvalidated = true, $_skipSetGroupConcatMaxLen = false)
{
$logger = StaticContainer::get(LoggerInterface::class);
if (!$_skipSetGroupConcatMaxLen) {
try {
Db::get()->query('SET SESSION group_concat_max_len=' . (128 * 1024));
} catch (\Exception $ex) {
$logger->info("Could not set group_concat_max_len MySQL session variable.");
}
}
if (empty($siteIds)) {
throw new \Exception("Website IDs could not be read from the request, ie. idSite=");
}
foreach ($siteIds as $index => $siteId) {
$siteIds[$index] = (int) $siteId;
}
$getArchiveIdsSql = "SELECT idsite, date1, date2,
GROUP_CONCAT(CONCAT(idarchive,'|',`name`,'|',`value`) ORDER BY idarchive DESC SEPARATOR ',') AS archives
FROM %s
WHERE idsite IN (" . implode(',', $siteIds) . ")
AND " . self::getNameCondition($plugins, $segment, $includeInvalidated) . "
AND %s
GROUP BY idsite, date1, date2";
$monthToPeriods = array();
foreach ($periods as $period) {
/** @var Period $period */
if ($period->getDateStart()->isLater(Date::now()->addDay(2))) {
continue; // avoid creating any archive tables in the future
}
$table = ArchiveTableCreator::getNumericTable($period->getDateStart());
$monthToPeriods[$table][] = $period;
}
$db = Db::get();
// for every month within the archive query, select from numeric table
$result = array();
foreach ($monthToPeriods as $table => $periods) {
$firstPeriod = reset($periods);
$bind = array();
if ($firstPeriod instanceof Range) {
$dateCondition = "date1 = ? AND date2 = ?";
$bind[] = $firstPeriod->getDateStart()->toString('Y-m-d');
$bind[] = $firstPeriod->getDateEnd()->toString('Y-m-d');
} else {
// we assume there is no range date in $periods
$dateCondition = '(';
foreach ($periods as $period) {
if (strlen($dateCondition) > 1) {
$dateCondition .= ' OR ';
}
$dateCondition .= "(period = ? AND date1 = ? AND date2 = ?)";
$bind[] = $period->getId();
$bind[] = $period->getDateStart()->toString('Y-m-d');
$bind[] = $period->getDateEnd()->toString('Y-m-d');
}
$dateCondition .= ')';
}
$sql = sprintf($getArchiveIdsSql, $table, $dateCondition);
$archiveIds = $db->fetchAll($sql, $bind);
// get the archive IDs. we keep all archives until the first all plugins archive.
// everything older than that one is discarded.
foreach ($archiveIds as $row) {
$dateStr = $row['date1'] . ',' . $row['date2'];
$archives = $row['archives'];
$pairs = explode(',', $archives);
foreach ($pairs as $pair) {
$parts = explode('|', $pair);
if (count($parts) != 3) { // GROUP_CONCAT got cut off, have to ignore the rest
// note: in this edge case, we end up not selecting the all plugins archive because it will be older than the partials.
// not ideal, but it avoids an exception.
$logger->info("GROUP_CONCAT got cut off in ArchiveSelector." . __FUNCTION__ . ' for idsite = ' . $row['idsite'] . ', period = ' . $dateStr);
continue;
}
list($idarchive, $doneFlag, $value) = $parts;
$result[$doneFlag][$dateStr][] = $idarchive;
if (strpos($doneFlag, '.') === false // all plugins archive
// sanity check: DONE_PARTIAL shouldn't be used w/ done archives, but in case we see one,
// don't treat it like an all plugins archive
&& $value != ArchiveWriter::DONE_PARTIAL
) {
break; // found the all plugins archive, don't need to look in older archives since we have everything here
}
}
}
}
return $result;
}
(querying for archives to just return them) vs.
public static function querySingleBlob(array $archiveIds, string $recordName)
{
$chunk = new Chunk();
[$getValuesSql, $bind] = self::getSqlTemplateToFetchArchiveData(
[$recordName], Archive::ID_SUBTABLE_LOAD_ALL_SUBTABLES, true);
$archiveIdsPerMonth = self::getArchiveIdsByYearMonth($archiveIds);
$periodsSeen = [];
// $yearMonth = "2022-11",
foreach ($archiveIdsPerMonth as $yearMonth => $ids) {
$date = Date::factory($yearMonth . '-01');
$table = ArchiveTableCreator::getBlobTable($date);
$ids = array_map('intval', $ids);
$sql = sprintf($getValuesSql, $table, implode(',', $ids));
$cursor = Db::get()->query($sql, $bind);
while ($row = $cursor->fetch()) {
$period = $row['date1'] . ',' . $row['date2'];
$recordName = $row['name'];
// FIXME: This hack works around a strange bug that occurs when getting
// archive IDs through ArchiveProcessing instances. When a table
// does not already exist, for some reason the archive ID for
// today (or from two days ago) will be added to the Archive
// instances list. The Archive instance will then select data
// for periods outside of the requested set.
// working around the bug here, but ideally, we need to figure
// out why incorrect idarchives are being selected.
if (empty($archiveIds[$period])) {
continue;
}
// only use the first period/blob name combination seen (since we order by ts_archived descending)
if (!empty($periodsSeen[$period][$recordName])) {
continue;
}
$periodsSeen[$period][$recordName] = true;
$row['value'] = ArchiveSelector::uncompress($row['value']);
if ($chunk->isRecordNameAChunk($row['name'])) {
// $blobs = array([subtableID] = [blob of subtableId])
$blobs = Common::safe_unserialize($row['value']);
if (!is_array($blobs)) {
yield $row;
}
ksort($blobs);
// $rawName = eg 'PluginName_ArchiveName'
$rawName = $chunk->getRecordNameWithoutChunkAppendix($row['name']);
foreach ($blobs as $subtableId => $blob) {
yield array_merge($row, [
'value' => $blob,
'name' => ArchiveSelector::appendIdSubtable($rawName, $subtableId),
]);
}
} else {
yield $row;
}
}
}
}
(querying for a single blob to aggregate them). The difference in the generated SQL is determined by
if ($orderBySubtableId && count($recordNames) == 1) {
$idSubtableAsInt = self::getExtractIdSubtableFromBlobNameSql($chunk, $name);
$orderBy = "ORDER BY date1 ASC, " . // ordering by date just so column order in tests will be predictable
" $idSubtableAsInt ASC,
ts_archived DESC"; // ascending order so we use the latest data found
}

@tsteur
Copy link
Member

tsteur commented Mar 14, 2023

Hi @diosmosis . I understand now. Debugged the code and seeing now what's happening and understanding the extra queries. Also understanding the many more queries you mean. I was actually hoping there could be only one query but then in the generator itself we only call $cursor->fetch() to get the next data result. I don't actually know if that would reduce memory or if in the background PHP would still hold everything in memory. Just thinking that this could improve the speed and reduce the amount of DB queries quite signficantly as these queries can get slow in production (although we improved their speed recently with a different index).

image

@diosmosis
Copy link
Member Author

diosmosis commented Mar 14, 2023

@tsteur

I was actually hoping there could be only one query but then in the generator itself we only call $cursor->fetch() to get the next data result.

This should be what happens? There should only be one query per archive table. Can you elaborate?

EDIT: when I said 'under load' I meant 'in production when used for every API blob query, not just when aggregating' and I was referring to the complicated ORDER BY thats used to order by idSubtable. Which isn't actually needed for API requests that just fetch data without doing any aggregation.

@tsteur
Copy link
Member

tsteur commented Mar 14, 2023

@diosmosis I'm so sorry. It actually does work like that indeed. It only looked that way as for one of the date ranges it had no data and then it would always start again at the top of the function and issue the query again but this is actually not the case when there are matching rows. I completely missed it. From my perspective this looks all good 💯

Not sure if @matomo-org/core-reviewers have any other thoughts otherwise? Ideally, we could even merge this into a 4.14.0 release as it would be great to have this change on the Cloud already earlier as I believe there isn't a breaking change.

@diosmosis
Copy link
Member Author

diosmosis commented Mar 14, 2023

It only looked that way as for one of the date ranges it had no data and then it would always start again at the top of the function and issue the query again but this is actually not the case when there are matching rows.

Interesting, I think that probably shouldn't happen, I'll take a look there.

EDIT: Oh I see, nevermind I thought you meant the loop would go on forever in that case, but it shouldn't :)

@tsteur
Copy link
Member

tsteur commented Mar 15, 2023

@diosmosis it doesn't. It's all good. It was just querying it for different record names which be expected. I didn't realise the record names changed while debugging 👍

Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look through the code. To me the changes are looking reasonable.
@tsteur I didn't debug the code or checked any memory usage yet, as you already did that. Or shall I invest some additional time to do that as well?

@tsteur
Copy link
Member

tsteur commented Mar 20, 2023

@tsteur I didn't debug the code or checked any memory usage yet, as you already did that. Or shall I invest some additional time to do that as well?

I did check memory usage and it definitely improved and went down and also debugged the code etc.

@sgiehl sgiehl merged commit e3188be into 5.x-dev Mar 20, 2023
17 of 19 checks passed
@sgiehl sgiehl deleted the 18295-blob-querying-cursor branch March 20, 2023 09:32
tsteur pushed a commit that referenced this pull request Mar 22, 2023
…ow in memory at a time (#20332)

* proof of concept one blob table row at a time method of aggregating week, month, year, range data

* sort blobs by subtable ID when chunk is being read

* simplify code w/ generators

* make sure single blob query is ordered by name correctly

* REGEXP_SUBSTR() is only available in mysql 8 :/

* fix a couple test failures

* by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits

* try fixing random test failures

* debug ci only error

* undo debugging change

* fixing some system tests

* refactor ArchiveSelector code for more code reuse

* add some code documentation

* remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes

* try debugging ci only random failure

* remove previous debugging code

* more debugging

* more ci debugging

* trigger build again and try to get more information for random failure

* fix convoluted sql replacement for REGEXP_SUBSTRING

* fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so

* add log in case blob table order is incorrect

* add tests for subtable extraction sql

* remove unused import

---------

Co-authored-by: Stefan Giehl <stefan@matomo.org>
sgiehl added a commit that referenced this pull request Apr 12, 2023
…ow in memory at a time (#20332)

* proof of concept one blob table row at a time method of aggregating week, month, year, range data

* sort blobs by subtable ID when chunk is being read

* simplify code w/ generators

* make sure single blob query is ordered by name correctly

* REGEXP_SUBSTR() is only available in mysql 8 :/

* fix a couple test failures

* by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits

* try fixing random test failures

* debug ci only error

* undo debugging change

* fixing some system tests

* refactor ArchiveSelector code for more code reuse

* add some code documentation

* remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes

* try debugging ci only random failure

* remove previous debugging code

* more debugging

* more ci debugging

* trigger build again and try to get more information for random failure

* fix convoluted sql replacement for REGEXP_SUBSTRING

* fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so

* add log in case blob table order is incorrect

* add tests for subtable extraction sql

* remove unused import

---------

Co-authored-by: Stefan Giehl <stefan@matomo.org>
@sgiehl sgiehl added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Apr 12, 2023
sgiehl added a commit that referenced this pull request Apr 13, 2023
…ow in memory at a time (#20332) (#20512)

* proof of concept one blob table row at a time method of aggregating week, month, year, range data

* sort blobs by subtable ID when chunk is being read

* simplify code w/ generators

* make sure single blob query is ordered by name correctly

* REGEXP_SUBSTR() is only available in mysql 8 :/

* fix a couple test failures

* by default when aggregating tables in ArchiveProcessor sort by visits if the table contains visits

* try fixing random test failures

* debug ci only error

* undo debugging change

* fixing some system tests

* refactor ArchiveSelector code for more code reuse

* add some code documentation

* remove DataCollection::forEachBlobExpanded() since it is no longer used + a couple other small changes

* try debugging ci only random failure

* remove previous debugging code

* more debugging

* more ci debugging

* trigger build again and try to get more information for random failure

* fix convoluted sql replacement for REGEXP_SUBSTRING

* fix idsubtable extraction, need to check if extracted value is an empty string and order it before everything else if so

* add log in case blob table order is incorrect

* add tests for subtable extraction sql

* remove unused import

---------

Co-authored-by: dizzy <diosmosis@users.noreply.github.com>
Co-authored-by: Stefan Giehl <stefan@matomo.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
3 participants