log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535

airblag · 2019-06-17T15:01:23Z

Hi,

My matomo installation is "big" (log_link_visit_action is ~40GB big for 55 days retention, log_visit is ~4GB), and we see over 97% of temporary tables created on disk in our mysql monitoring.

After trying to tune mysql with increasing tmp_table_size without any sucess I noticed that the log_visit table has the referer_url field set as TEXT.

it blocks the use of the memory engine of mysql for tmp tables writing all of them to disk :

https://dev.mysql.com/doc/refman/5.7/en/blob.html

"[...]
Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types (see Section 8.4.4, “Internal Temporary Table Use in MySQL”). Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.
[...]"

I'm not familiar with the code of matomo to look for the exact query producing the tmp_tables...

Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?

I tried to move the tmpdir of mysql to ramdisk, but since I try to optimize the DB once a week, and optimizing innodb needs a lot of place on /tmp it's always failing to optimize log_link_action_visit except I give more 40GB for my ramdisk...

tsteur · 2019-06-18T06:19:41Z

@mattab that's interesting. Maybe would also help with log_action.name?

mattab · 2019-06-21T08:04:09Z

Thanks for the report @airblag

Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?

Reckon that it should be fine for more than 99.9% of cases to have only 65535 character long URLs.
@tsteur Maybe something worth investigating?

tsteur · 2019-06-21T08:05:35Z

@mattab text also has only 65,535 characters

tsteur · 2019-06-21T08:08:39Z

We will schedule this for Matomo 4. Thanks for this @airblag

tsteur · 2019-06-21T15:48:35Z

@mattab I'm thinking we could actually already apply this change for new installs and only include the update in Matomo 4. This will save many thousands of migrations later.

airblag · 2019-06-24T08:56:20Z

I think we should not make a varchar(65535) but something a bit smaller. The whole row in mysql can be maximum 65535 bytes long, so we need to let some place for the other fields in the row. We should make something like varchar(65535 - size(all other columns)) at maximum, and maybe let some place to extend the table with some extra rows for future schema changes.
In my installation, the biggest url was around 3500 chars. Since the RFC doesn't set any limit to an URL (as far as I understand), there might always be some special cases where the referer doesn't fit fully in the referer_url field. But i think that chomping a referer would almost never happen and should not break the stats that much...

tsteur · 2019-06-24T09:18:12Z

Yeah I reckon be better to set to something like 50 or 60K just to be safe.

tsteur · 2019-06-24T09:18:58Z

Let's change the type now for new installs, and already add an update for Matomo 4.0 which will be executed once they update to Matomo 4.

airblag · 2019-06-26T12:48:20Z

We converted both TEXT fields into varchar in our production setup. But it still doesn't seem to reduce anyhow the amount of tables on disk created.
seeing the rate of how the tables are created on disk, it seems to be produced not by archiving jobs but by every request the server receives.

I just checked all the tables of my piwik DB and grepped for text|blob and there are a lot more !

Maybe it would be good to check in the requests updating/inserting log data which text/blobs are used, and first replace only thoses fields with varchar.

I'll try to have a look, but since I never looked into the code of matomo, it might be easier for someone who is familiar with it :)

tsteur · 2019-06-26T13:02:55Z

Let us know what you find. Wouldn't think it creates a temporary table during tracking but never know.

airblag · 2019-06-27T17:34:16Z

So, it took my day of work, but I found out which query is producing the temporary table on disk !

+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 6     |
| Created_tmp_tables      | 8     |
+-------------------------+-------+
2 rows in set (0.00 sec)

mysql:piwik> SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash, name;
+----------+------+--------------------------+
| idaction | type | name                     |
+----------+------+--------------------------+
|  3869297 |   10 | taz.de/!5603531          |
|  1562941 |   10 | TZI                      |
|  1562939 |   11 | displayed                |
|  3131298 |   12 | ARTIKELAUFRUF_ohne_Layer |
+----------+------+--------------------------+
4 rows in set (0.00 sec)

mysql:piwik> SHOW STATUS LIKE 'Created_tmp%tables';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 7     |
| Created_tmp_tables      | 9     |
+-------------------------+-------+
2 rows in set (0.00 sec)

A few notes/questions about this request :

why is the crc32() of the name checked in addition of the name ? I removed it (and the GROUP BY hash part), and I get the same result.
why are we selecting MIN(idaction) instead of idaction ? selecting directly idaction gives me the same result and DON'T PRODUCE TEMPORARY TABLE !!!.
removing the name from GROUP BY gives also the same result and also DON'T PRODUCE TEMPORARY TABLE !!!.

I grepped through the code and it seems to be produced in core/Tracker/Model.php

@tsteur do you have an idea if 2. or 3. can be done without breaking something ?

Some notes about debugging

For getting it, I tried first to activate debugging and sql_profiling in matomo, but I hit the bug of logging to file which didn't worked. I applied manualy the patch for the https://patch-diff.githubusercontent.com/raw/matomo-org/matomo/pull/14296.patch on plugins/Monolog/config/config.php.

But I couldn't find the correct requests in the log file. Maybe because I use the QueuedTracking plugin ?

Then I just logged every query of mysql for a few seconds (just enough to let QueuedTracking write in the DB):

SET GLOBAL general_log_file =  '/var/log/mysql/general.log';
SET GLOBAL general_log = 'ON';
SELECT sleep(30);
SET GLOBAL general_log = 'OFF';

tsteur · 2019-06-28T05:02:40Z

why is the crc32() of the name checked in addition of the name ? I removed it (and the GROUP BY hash part), and I get the same result.

There is an index on hash but not on name (cause too long and starts with too similar values usually). The hash is only 10 characters so potentially it could generate the same value which is why it probably also adds the name into the query.

why are we selecting MIN(idaction) instead of idaction ? selecting directly idaction gives me the same result and DON'T PRODUCE TEMPORARY TABLE !!!.

There used to be cases where the same action was added twice due to some race condition issues or something. See #6436 . The PR was done in #7112 . Looking at this, it still seems needed. Can you confirm @diosmosis ? Simply because there is no unique key on the DB and duplicates can happen. Wouldn't really know how to avoid it but maybe it be faster to change the createAction() method to do INSERT INTO ACTION... ; and then SELECT where hash = ? type = ? name = ? If two results are there, we would need to remove the one with the highest ID, and return the lowest ID. As actions are inserted more rarely, and the needed select is fast, this may be an option?

Actually... seeing now this is already implemented. So the only reason the min(idaction) is still there because there could be older records from before this fix was applied that could have duplicate actions. Maybe it could be removed now? I think we had some logic to remove duplicate actions in a task or so? @diosmosis

removing the name from GROUP BY gives also the same result and also DON'T PRODUCE TEMPORARY TABLE !!!.

The group by is there for the min to work I reckon.

diosmosis · 2019-06-28T05:40:31Z

@tsteur We could just select the highest ID in PHP instead of in MySQL, would that work?

tsteur · 2019-06-28T05:42:42Z

@diosmosis that should work too. Considering there shouldn't be too many entries

tsteur · 2019-06-28T05:43:01Z

Does that mean we could remove the min() and the group by maybe?

diosmosis · 2019-06-28T06:02:31Z

I don't think we can just remove it w/o something that deals w/ duplicates (that come from old bugs and from any new bugs), but if it causes problems we shouldn't have it.

tsteur · 2019-06-28T06:12:32Z

I meant if we change

SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash, name;

to something like

SELECT idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )

then we could do some logic like

$idActions = array();
foreach ($rows as $row) {
   $idAction = $row['idaction'];
   $name = $row['name'];
   if (!isset($idActions[$name])) {
      $idActions[$name] = $row;
   } else if ($idAction < $idActions[$name]['idaction']) {
       $idActions[$name] = $row;
   }
}
return array_values($idActions);

Not tested or so... I simply meant if we don't fetch like crazy huge amount of values there, which we never do maybe I think, this could be faster and avoid tmp tables?

diosmosis · 2019-06-28T06:14:45Z

Yes that's what I meant too.

airblag · 2019-06-28T08:20:30Z

Ok, so the crc32 is there to make the index work better, but the probability of collision is to high, so we check both hash and name in the where clause ?
the min() is needed because entries could be duplicated.

What I was meaning in 3. was only to remove name from the group by part. It would look like this and not create tmp tables

SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash;

since we filtered out possible collisions with the WHERE clause, I guess only grouping by type,hash should work too, or I am still missing something ?

tsteur · 2019-06-28T08:23:07Z

If you group by type, hash only, then two entries with different names, but the same hash will be grouped together. Of course that would be only the case if in the same where we select two different names that result in the same hash which would be rather unlikely.

tsteur · 2019-06-28T09:39:45Z

Considering the chances are small that we are fetching a few values that result in the same hash we could apply the quick fix to simply remove the name from the group by. The type is still in the group by as well. So it would need to result in the same hash for a name with the same type. Just looked at a log_action table with 11M entries and there we had 7000 duplicate hashes with 2 different names for the same type. So while the changes that this happens is very low, it can still happen and be better to apply the proper fix mentioned in #14535 (comment) to avoid getting reports for weird issues later.

…d tmp tables refs #14535

tsteur · 2019-06-28T10:02:42Z

created PR #14584

@airblag can you otherwise confirm that changing the action name to varchar did not help?

airblag · 2019-06-28T10:09:52Z

created PR #14584

@airblag can you otherwise confirm that changing the action name to varchar did not help?

maybe it helps for other queries maybe in the archiving process, but it's not really to observe in our graphs with an average of 30 tmp tables/second.

tsteur · 2019-06-28T10:14:05Z

it be great if you could patch your Matomo with the referenced PR. Feel free to wait though until there was a review.

airblag · 2019-06-28T12:48:37Z

I just patched with the PR, and the tmp_tables dropped from 30 per seconds to around 0. Looks like it works for me :)

EDIT:
in my installation log_action.name is varchar(2048). Not sure if it would reacts differently with TEXT.

airblag · 2019-07-02T09:56:02Z

Just to confirm after a few days, it does the job:

tsteur · 2019-07-02T11:14:30Z

cheers 👍 thanks for letting us know. I reckon it will be maybe still a good idea to migrate these from text to varchar @mattab @airblag ?

airblag · 2019-07-02T14:54:41Z

When I look at this page, there are still a plenty of cases where temporary tables might be created : https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html

But there are only 3 cases where they have to be created on disk :

Presence of a BLOB or TEXT column in the table. This includes user-defined variables having a string value because they are treated as BLOB or TEXT columns, depending on whether their value is a binary or nonbinary string, respectively.
Presence of any string column in a GROUP BY or DISTINCT clause larger than 512 bytes for binary strings or 512 characters for nonbinary strings.
Presence of any string column with a maximum length larger than 512 (bytes for binary strings, characters for nonbinary strings) in the SELECT list, if UNION or UNION ALL is used.

So converting TEXT to VARCHAR might avoid also creating temporary tables on disk in queries done by future features/plugins, even if it's solved for the current release.

In this case it sounds like it was the "GROUP BY name"

…d tmp tables refs #14535 (#14584)

tsteur · 2019-08-15T02:16:38Z

Another reason to use varchar instead of text as it may prevent creating temporary tables in memory:

mattab · 2019-09-02T08:05:06Z

Be great to migrate these columns to VARCHAR:

log_action.name
log_visit.referer_url

and there is also:

log_conversion.url

refs #14535

tsteur · 2019-09-05T00:51:05Z

See #14859
I had to change to column size to 20000 as 65000 was too big and reverted the change for log_conversion.url

* Change column type for referer_url from text to varchar refs #14535 * Update ReferrerUrl.php

tsteur · 2020-06-19T06:31:28Z

This seems done actually.

mattab · 2020-10-05T01:28:39Z

FYI we decided to truncate URLs after 4096 characters. Hopefully it won't break people's analytics. If you're concerned about this change, let us know!

we added As of Matomo 4.0.0, only the first 4096 characters of a page title / URL / any action name will be stored, even if page_maximum_length is set to a higher value. to https://matomo.org/faq/troubleshooting/faq_21239/ as part of #14859

tsteur · 2020-11-29T22:13:24Z

Note: Referrer urls will be truncated to 1500 characters

tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Jun 18, 2019

tsteur mentioned this issue Jun 18, 2019

Enable segment subquery cache (queries with Contains/Does not contain on log_action) #8867

Closed

2 tasks

tsteur added this to the 4.0.0 milestone Jun 21, 2019

tsteur modified the milestones: 4.0.0, 3.11.0 Jun 24, 2019

tsteur added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Jun 28, 2019

tsteur self-assigned this Jun 28, 2019

tsteur added a commit that referenced this issue Jun 28, 2019

avoid using group by and select min for tracker actions query to avoi…

235138f

…d tmp tables refs #14535

tsteur mentioned this issue Jun 28, 2019

avoid using group by and select min for tracker actions query to avoid tmp tables refs #14535 #14584

Merged

tsteur added a commit that referenced this issue Jul 5, 2019

avoid using group by and select min for tracker actions query to avoi…

aed8c10

…d tmp tables refs #14535 (#14584)

mattab modified the milestones: 3.11.0, 3.12.0 Jul 23, 2019

mattab changed the title ~~log_visit.referer_url as TEXT makes a lot of tmp table on disk~~ log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk Sep 2, 2019

tsteur added a commit that referenced this issue Sep 3, 2019

Change column type for referer_url from text to varchar

69c193c

refs #14535

This was referenced Sep 3, 2019

Change column type for referer_url from text to varchar #14847

Merged

Change mysql column type for new installs for log_action.name and log_conversion.url #14848

Merged

tsteur modified the milestones: 3.12.0, 4.0.0 Sep 3, 2019

tsteur mentioned this issue Sep 5, 2019

Smaller varchar size for log_action.name #14859

Merged

tsteur added a commit that referenced this issue Jan 15, 2020

Change column type for referer_url from text to varchar (#14847)

fcafc03

* Change column type for referer_url from text to varchar refs #14535 * Update ReferrerUrl.php

tsteur closed this as completed Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535

log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535

airblag commented Jun 17, 2019 •

edited

tsteur commented Jun 18, 2019

mattab commented Jun 21, 2019

tsteur commented Jun 21, 2019

tsteur commented Jun 21, 2019

tsteur commented Jun 21, 2019

airblag commented Jun 24, 2019

tsteur commented Jun 24, 2019

tsteur commented Jun 24, 2019

airblag commented Jun 26, 2019

tsteur commented Jun 26, 2019

airblag commented Jun 27, 2019

tsteur commented Jun 28, 2019

diosmosis commented Jun 28, 2019

tsteur commented Jun 28, 2019

tsteur commented Jun 28, 2019

diosmosis commented Jun 28, 2019

tsteur commented Jun 28, 2019

diosmosis commented Jun 28, 2019

airblag commented Jun 28, 2019

tsteur commented Jun 28, 2019

tsteur commented Jun 28, 2019

tsteur commented Jun 28, 2019

airblag commented Jun 28, 2019

tsteur commented Jun 28, 2019

airblag commented Jun 28, 2019 •

edited

airblag commented Jul 2, 2019

tsteur commented Jul 2, 2019

airblag commented Jul 2, 2019

tsteur commented Aug 15, 2019

mattab commented Sep 2, 2019

tsteur commented Sep 5, 2019

tsteur commented Jun 19, 2020

mattab commented Oct 5, 2020

tsteur commented Nov 29, 2020

log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535

log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535

Comments

airblag commented Jun 17, 2019 • edited

tsteur commented Jun 18, 2019

mattab commented Jun 21, 2019

tsteur commented Jun 21, 2019

tsteur commented Jun 21, 2019

tsteur commented Jun 21, 2019

airblag commented Jun 24, 2019

tsteur commented Jun 24, 2019

tsteur commented Jun 24, 2019

airblag commented Jun 26, 2019

tsteur commented Jun 26, 2019

airblag commented Jun 27, 2019

A few notes/questions about this request :

Some notes about debugging

tsteur commented Jun 28, 2019

diosmosis commented Jun 28, 2019

tsteur commented Jun 28, 2019

tsteur commented Jun 28, 2019

diosmosis commented Jun 28, 2019

tsteur commented Jun 28, 2019

diosmosis commented Jun 28, 2019

airblag commented Jun 28, 2019

tsteur commented Jun 28, 2019

tsteur commented Jun 28, 2019

tsteur commented Jun 28, 2019

airblag commented Jun 28, 2019

tsteur commented Jun 28, 2019

airblag commented Jun 28, 2019 • edited

airblag commented Jul 2, 2019

tsteur commented Jul 2, 2019

airblag commented Jul 2, 2019

tsteur commented Aug 15, 2019

mattab commented Sep 2, 2019

tsteur commented Sep 5, 2019

tsteur commented Jun 19, 2020

mattab commented Oct 5, 2020

tsteur commented Nov 29, 2020

airblag commented Jun 17, 2019 •

edited

airblag commented Jun 28, 2019 •

edited