Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Performance: don't store md5 as strings in the log tables #2007

Closed
mattab opened this Issue · 16 comments

2 participants

@mattab
Owner

MD5 are 128bit hashes commonly represented as 32 char strings. In the log_visit and log_link_visit_action tables, we store these hashs as string which is unfortunate since it is using a lot of space for nothing and makes inserting / querying less efficient. We are not doing cryptography so don't need something as complex as md5. We are just trying to avoid collisions, most of the time.

we should shorten the visitor ID to a much smaller value, at most unsigned BIGINT, and saves a ton of disk space (index space + data), as well as increase performance.

unsigned BIGINT goes from 0 to 18446744073709551615 - a long shot before we get collisions.

To summarize

  • Using only 8 bytes for these values, We would save 48bytes per visit and 24bytes per page view.
  • I'm not sure what would be the best hash function for 64bits uuid. In any case let's be careful with 64 bits manipulation in php
  • in the cookie, the uuid should be displayed in ascii.
  • I understand we will need the hash function to appear in both the JS (with 1st party cookie improvement) and in php (in case a user doesn't support/delete cookies)

Please comment if you have any idea on how to make this work well, thx

@robocoder
Collaborator

Developing a 64-bit, collision resistant hash is really out-of-scope.

That said, since the avalanche effect means a small change in the content often results in a significantly different hash, you might be able to simply truncate the md5 hash to 64 bits. (You can probably run a query on the database to see how many collisions result, if any.)

I'm currently using a sha1 hash in piwik.js because the code is shorter than for md5. If you don't want to use the full value on the server-side, I believe you can truncate it.

@mattab
Owner
mysql> select count(distinct(SUBSTRING(visitor_idcookie FROM 1 FOR 8))) as truncated, count(distinct visitor_idcookie) as cookie
    -> FRom piwik_log_visit
    -> ;
+-----------+---------+
| truncated | cookie  |
+-----------+---------+
|   7571288 | 7577884 |
+-----------+---------+

So we have a collision rate of .0008% or 0.08% (on a 4GB piwik_log_visit setup)

I'd say it's very much a good enough number

plus at the moment we only do uniques per month, which gives lower collision rates of 0.01%:

mysql> select *, 1-truncated/cookie from (select MONTH(visit_server_date),count(distinct(SUBSTRING(visitor_idcookie FROM 1 FOR 8))) as truncated, count(distinct visitor_idcookie) as cookie FRom piwik_log_visit group by MONTH(visit_server_date)) t1;
+--------------------------+-----------+---------+--------------------+
| MONTH(visit_server_date) | truncated | cookie  | 1-truncated/cookie |
+--------------------------+-----------+---------+--------------------+
|                        1 |    234263 |  234272 |             0.0000 |
|                        2 |    300337 |  300345 |             0.0000 |
|                        3 |    445592 |  445611 |             0.0000 |
|                        4 |    627290 |  627336 |             0.0001 |
|                        5 |    742757 |  742822 |             0.0001 |
|                        6 |    745563 |  745635 |             0.0001 |
|                        7 |    827379 |  827452 |             0.0001 |
|                        8 |    835817 |  835887 |             0.0001 |
|                        9 |    869653 |  869719 |             0.0001 |
|                       10 |    889817 |  889917 |             0.0001 |
|                       11 |   1010989 | 1011109 |             0.0001 |
|                       12 |    891054 |  891149 |             0.0001 |
+--------------------------+-----------+---------+--------------------+
12 rows in set (6 min 31.77 sec)

so yes I'd agree with the suggestion of keeping first 8 chars of a md5 hash. how much bigger is md5 in JS VS sha1?

@mattab
Owner

I'm not sure how we can convert a 8 byte string into a bigint to feed mysql... maybe we have to use CONV() ? http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_conv

I'm not sure if we can do it in php itself, since it can't handle 64 bits math operation by default, but maybe i'm missing something

@robocoder
Collaborator

Why not use a BINARY (or VARBINARY) column?

Given that the hash value has no sequence and there's no math, then there's no reason to treat it as an int. (plus, no type conversion or MySQL-isms would be required.)

@mattab
Owner

I think it might be easier to work with bigint VS binary? funny, i was researching this and stumbled upon an interesting report: http://bugs.mysql.com/bug.php?id=44478 ;)

@robocoder
Collaborator

Not really. As my "invalid" bug report shows, there are implied signed-ness in operations even when dealing with unsigned bigint.

Using a BINARY column and avoiding MySQL-specific functions in the queries will make it easier to support non-MySQL databases in the future.

@mattab
Owner

(In [3729]) Fixes #2007

  • stores idvisitor, config_id as BIGINT rather than char(32)
  • updates Tracking and Archiving code
@mattab
Owner

Anthon your concerns are valid, there are 2 functions that are mysql specific in Piwik_Tracker now, we can abstract these when need be. I already wrote the code when you suggested to use binary field and wasn't keen to go through it again, knowing that we don't support other than mysql at this time.

As a result of this change, as well as other performance change in trunk, I note a 33% performance improvement in Archiving time between 1.1 and 1.2 with the same dataset.

@robocoder
Collaborator

This really should be BINARY and use php's built-in pack/unpack to convert the 16 hexits to/from an 8 byte binary string. This wouldn't require any future abstraction.

@robocoder
Collaborator

Re-opening to investigate the spurious (but recurring) build failures.

983, 985, 987

Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_day.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(83) "ERROR FOR Actions.getPageTitles_day.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4240) " first page view 2 1 2 360 2 2 3 360 1 1 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 1 0 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 1 0 1 1 a new visit 1 1 1 360 1 1 2 360 0 Website 2 page view 1 1 1 0 1 1 1 0 1 1 1 0 100% 100% "
string(4240) " first page view 2 2 2 360 2 2 3 360 1 1 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 1 0 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 1 0 1 1 a new visit 1 1 1 360 1 1 2 360 0 Website 2 page view 1 1 1 0 1 1 1 0 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_week.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(84) "ERROR FOR Actions.getPageTitles_week.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4676) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4676) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_month.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(85) "ERROR FOR Actions.getPageTitles_month.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4428) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4428) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_year.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(84) "ERROR FOR Actions.getPageTitles_year.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4386) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4386) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% " 
@robocoder
Collaborator

Sorry... last commit message was incomplete (trying to edit while holding sleeping toddler).

In [3732], refs #2007 - use BINARY columns in database and in-memory to avoid MySQL-isms in the queries (and special cases); convert to/from hex for cookie; fix typo in ALTER statement

@mattab
Owner

(In [3741]) Refs #2007 Style & debug changes

Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html )
investigating...

@mattab
Owner

Closing for now... if the build starts to randomly failing again, we will reopen

@robocoder
Collaborator

Replying to matt:

Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html )

The failures were occuring before the BINARY change. I don't think my commit fixed whatever was causing it.

@mattab
Owner

Correct, and thanks for putting the binary change in!!

it seems the failed build happened in Build #1007 but I can't see why from the output: http://qa.piwik.org:8080/hudson/job/Piwik/1007/consoleFull

@robocoder
Collaborator

I looked in the artifacts .zip; it failed in the auto-update. I'm guessing the problem is because we disable JavaScript before clicking on the 'update now' link, but interrupting HtmlUnit's JavaScript thread while it's manipulating the DOM. (Disabling JavaScript and/or sleep()'ing are unreliable, so we're going to get spurious failures like these. If it becomes a more frequent problem, I'm going to switch to Selenium.)

@mattab mattab added this to the Piwik 1.2 milestone
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.