Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535

Open
airblag opened this issue Jun 17, 2019 · 31 comments

Comments

@airblag
Copy link

commented Jun 17, 2019

Hi,

My matomo installation is "big" (log_link_visit_action is ~40GB big for 55 days retention, log_visit is ~4GB), and we see over 97% of temporary tables created on disk in our mysql monitoring.

After trying to tune mysql with increasing tmp_table_size without any sucess I noticed that the log_visit table has the referer_url field set as TEXT.

it blocks the use of the memory engine of mysql for tmp tables writing all of them to disk :

https://dev.mysql.com/doc/refman/5.7/en/blob.html

"[...]
Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types (see Section 8.4.4, “Internal Temporary Table Use in MySQL”). Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.
[...]"

I'm not familiar with the code of matomo to look for the exact query producing the tmp_tables...

Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?

I tried to move the tmpdir of mysql to ramdisk, but since I try to optimize the DB once a week, and optimizing innodb needs a lot of place on /tmp it's always failing to optimize log_link_action_visit except I give more 40GB for my ramdisk...

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

@mattab that's interesting. Maybe would also help with log_action.name?

@mattab

This comment has been minimized.

Copy link
Member

commented Jun 21, 2019

Thanks for the report @airblag

Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?

Reckon that it should be fine for more than 99.9% of cases to have only 65535 character long URLs.
@tsteur Maybe something worth investigating?

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 21, 2019

@mattab text also has only 65,535 characters

@tsteur tsteur added this to the 4.0.0 milestone Jun 21, 2019

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 21, 2019

We will schedule this for Matomo 4. Thanks for this @airblag

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 21, 2019

@mattab I'm thinking we could actually already apply this change for new installs and only include the update in Matomo 4. This will save many thousands of migrations later.

@airblag

This comment has been minimized.

Copy link
Author

commented Jun 24, 2019

I think we should not make a varchar(65535) but something a bit smaller. The whole row in mysql can be maximum 65535 bytes long, so we need to let some place for the other fields in the row. We should make something like varchar(65535 - size(all other columns)) at maximum, and maybe let some place to extend the table with some extra rows for future schema changes.
In my installation, the biggest url was around 3500 chars. Since the RFC doesn't set any limit to an URL (as far as I understand), there might always be some special cases where the referer doesn't fit fully in the referer_url field. But i think that chomping a referer would almost never happen and should not break the stats that much...

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 24, 2019

Yeah I reckon be better to set to something like 50 or 60K just to be safe.

@tsteur tsteur modified the milestones: 4.0.0, 3.11.0 Jun 24, 2019

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 24, 2019

Let's change the type now for new installs, and already add an update for Matomo 4.0 which will be executed once they update to Matomo 4.

@airblag

This comment has been minimized.

Copy link
Author

commented Jun 26, 2019

We converted both TEXT fields into varchar in our production setup. But it still doesn't seem to reduce anyhow the amount of tables on disk created.
seeing the rate of how the tables are created on disk, it seems to be produced not by archiving jobs but by every request the server receives.

I just checked all the tables of my piwik DB and grepped for text|blob and there are a lot more !

My dirty way of finding out :
echo show tables|mysql piwik|while read table ; do echo "desc $table;"|mysql piwik|grep -iE --color '(TEXT|BLOB)' && echo "======> $table";done

Maybe it would be good to check in the requests updating/inserting log data which text/blobs are used, and first replace only thoses fields with varchar.

I'll try to have a look, but since I never looked into the code of matomo, it might be easier for someone who is familiar with it :)

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

Let us know what you find. Wouldn't think it creates a temporary table during tracking but never know.

@airblag

This comment has been minimized.

Copy link
Author

commented Jun 27, 2019

So, it took my day of work, but I found out which query is producing the temporary table on disk !

+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 6     |
| Created_tmp_tables      | 8     |
+-------------------------+-------+
2 rows in set (0.00 sec)

mysql:piwik> SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash, name;
+----------+------+--------------------------+
| idaction | type | name                     |
+----------+------+--------------------------+
|  3869297 |   10 | taz.de/!5603531          |
|  1562941 |   10 | TZI                      |
|  1562939 |   11 | displayed                |
|  3131298 |   12 | ARTIKELAUFRUF_ohne_Layer |
+----------+------+--------------------------+
4 rows in set (0.00 sec)

mysql:piwik> SHOW STATUS LIKE 'Created_tmp%tables';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 7     |
| Created_tmp_tables      | 9     |
+-------------------------+-------+
2 rows in set (0.00 sec)

A few notes/questions about this request :

  1. why is the crc32() of the name checked in addition of the name ? I removed it (and the GROUP BY hash part), and I get the same result.
  2. why are we selecting MIN(idaction) instead of idaction ? selecting directly idaction gives me the same result and DON'T PRODUCE TEMPORARY TABLE !!!.
  3. removing the name from GROUP BY gives also the same result and also DON'T PRODUCE TEMPORARY TABLE !!!.

I grepped through the code and it seems to be produced in core/Tracker/Model.php

@tsteur do you have an idea if 2. or 3. can be done without breaking something ?

Some notes about debugging

For getting it, I tried first to activate debugging and sql_profiling in matomo, but I hit the bug of logging to file which didn't worked. I applied manualy the patch for the https://patch-diff.githubusercontent.com/raw/matomo-org/matomo/pull/14296.patch on plugins/Monolog/config/config.php.

But I couldn't find the correct requests in the log file. Maybe because I use the QueuedTracking plugin ?

Then I just logged every query of mysql for a few seconds (just enough to let QueuedTracking write in the DB):

SET GLOBAL general_log_file =  '/var/log/mysql/general.log';
SET GLOBAL general_log = 'ON';
SELECT sleep(30);
SET GLOBAL general_log = 'OFF';
@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

why is the crc32() of the name checked in addition of the name ? I removed it (and the GROUP BY hash part), and I get the same result.

There is an index on hash but not on name (cause too long and starts with too similar values usually). The hash is only 10 characters so potentially it could generate the same value which is why it probably also adds the name into the query.

why are we selecting MIN(idaction) instead of idaction ? selecting directly idaction gives me the same result and DON'T PRODUCE TEMPORARY TABLE !!!.

There used to be cases where the same action was added twice due to some race condition issues or something. See #6436 . The PR was done in #7112 . Looking at this, it still seems needed. Can you confirm @diosmosis ? Simply because there is no unique key on the DB and duplicates can happen. Wouldn't really know how to avoid it but maybe it be faster to change the createAction() method to do INSERT INTO ACTION... ; and then SELECT where hash = ? type = ? name = ? If two results are there, we would need to remove the one with the highest ID, and return the lowest ID. As actions are inserted more rarely, and the needed select is fast, this may be an option?

Actually... seeing now this is already implemented. So the only reason the min(idaction) is still there because there could be older records from before this fix was applied that could have duplicate actions. Maybe it could be removed now? I think we had some logic to remove duplicate actions in a task or so? @diosmosis

removing the name from GROUP BY gives also the same result and also DON'T PRODUCE TEMPORARY TABLE !!!.

The group by is there for the min to work I reckon.

@diosmosis

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

@tsteur We could just select the highest ID in PHP instead of in MySQL, would that work?

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

@diosmosis that should work too. Considering there shouldn't be too many entries

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Does that mean we could remove the min() and the group by maybe?

@tsteur tsteur added the Major label Jun 28, 2019

@diosmosis

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

I don't think we can just remove it w/o something that deals w/ duplicates (that come from old bugs and from any new bugs), but if it causes problems we shouldn't have it.

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

I meant if we change

SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash, name;

to something like

SELECT idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  

then we could do some logic like

$idActions = array();
foreach ($rows as $row) {
   $idAction = $row['idaction'];
   $name = $row['name'];
   if (!isset($idActions[$name])) {
      $idActions[$name] = $row;
   } else if ($idAction < $idActions[$name]['idaction']) {
       $idActions[$name] = $row;
   }
}
return array_values($idActions);

Not tested or so... I simply meant if we don't fetch like crazy huge amount of values there, which we never do maybe I think, this could be faster and avoid tmp tables?

@diosmosis

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Yes that's what I meant too.

@airblag

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

Ok, so the crc32 is there to make the index work better, but the probability of collision is to high, so we check both hash and name in the where clause ?
the min() is needed because entries could be duplicated.

What I was meaning in 3. was only to remove name from the group by part. It would look like this and not create tmp tables

SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash;

since we filtered out possible collisions with the WHERE clause, I guess only grouping by type,hash should work too, or I am still missing something ?

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

If you group by type, hash only, then two entries with different names, but the same hash will be grouped together. Of course that would be only the case if in the same where we select two different names that result in the same hash which would be rather unlikely.

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Considering the chances are small that we are fetching a few values that result in the same hash we could apply the quick fix to simply remove the name from the group by. The type is still in the group by as well. So it would need to result in the same hash for a name with the same type. Just looked at a log_action table with 11M entries and there we had 7000 duplicate hashes with 2 different names for the same type. So while the changes that this happens is very low, it can still happen and be better to apply the proper fix mentioned in #14535 (comment) to avoid getting reports for weird issues later.

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

created PR #14584

@airblag can you otherwise confirm that changing the action name to varchar did not help?

@airblag

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

created PR #14584

@airblag can you otherwise confirm that changing the action name to varchar did not help?

maybe it helps for other queries maybe in the archiving process, but it's not really to observe in our graphs with an average of 30 tmp tables/second.

@tsteur

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

it be great if you could patch your Matomo with the referenced PR. Feel free to wait though until there was a review.

@airblag

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

I just patched with the PR, and the tmp_tables dropped from 30 per seconds to around 0. Looks like it works for me :)

EDIT:
in my installation log_action.name is varchar(2048). Not sure if it would reacts differently with TEXT.

@airblag

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

Just to confirm after a few days, it does the job:

mysql_tmp_tables-week

@tsteur

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

cheers 👍 thanks for letting us know. I reckon it will be maybe still a good idea to migrate these from text to varchar @mattab @airblag ?

@airblag

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

When I look at this page, there are still a plenty of cases where temporary tables might be created : https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html

But there are only 3 cases where they have to be created on disk :

Presence of a BLOB or TEXT column in the table. This includes user-defined variables having a string value because they are treated as BLOB or TEXT columns, depending on whether their value is a binary or nonbinary string, respectively.
Presence of any string column in a GROUP BY or DISTINCT clause larger than 512 bytes for binary strings or 512 characters for nonbinary strings.
Presence of any string column with a maximum length larger than 512 (bytes for binary strings, characters for nonbinary strings) in the SELECT list, if UNION or UNION ALL is used.

So converting TEXT to VARCHAR might avoid also creating temporary tables on disk in queries done by future features/plugins, even if it's solved for the current release.

In this case it sounds like it was the "GROUP BY name"

tsteur added a commit that referenced this issue Jul 5, 2019

@mattab mattab modified the milestones: 3.11.0, 3.12.0 Jul 23, 2019

@tsteur

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Another reason to use varchar instead of text as it may prevent creating temporary tables in memory:
image

@mattab mattab changed the title log_visit.referer_url as TEXT makes a lot of tmp table on disk log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk Sep 2, 2019

@mattab

This comment has been minimized.

Copy link
Member

commented Sep 2, 2019

Be great to migrate these columns to VARCHAR:

  • log_action.name
  • log_visit.referer_url

and there is also:

  • log_conversion.url
tsteur added a commit that referenced this issue Sep 3, 2019

@tsteur tsteur modified the milestones: 3.12.0, 4.0.0 Sep 3, 2019

@tsteur

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

See #14859
I had to change to column size to 20000 as 65000 was too big and reverted the change for log_conversion.url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.