Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer overflow: how to track more than 2 billion actions with PiwiK? #3288

Closed
mattab opened this Issue Jul 30, 2012 · 15 comments

Comments

Projects
None yet
3 participants
@mattab
Copy link
Member

mattab commented Jul 30, 2012

Piwik Mysql tables use INT primary key, which cap at: 2^31-1=2147483647 when signed.

There was an ommission in log_link_visit_action where the primary key is not unsigned causing overflow to happen at 2 billion instead of 4 billion pages.

But, as part of this ticket, we should also figure out a strategy for "beyond 4294967295 page views".

  • Should we recommend user to UPDATE the tables to BIGINT fields?
  • Or to rename the tables and let Piwik re-create fresh tables?
@anonymous-piwik-user

This comment has been minimized.

Copy link

anonymous-piwik-user commented Aug 23, 2012

Replying to matt:

I think this is the same issue I have run in to.
What is the impact of dropping the table? or to your option "Or to rename the tables and let Piwik re-create fresh tables?"

what tracking stats will be lost if I drop and recreate the corrupt table, attempting repair now, but if it's the row limit wont it continue to fail?

Piwik Mysql tables use INT primary key, which cap at: 2^31-1=2147483647 when signed.

There was an ommission in log_link_visit_action where the primary key is not unsigned causing overflow to happen at 2 billion instead of 4 billion pages.

But, as part of this ticket, we should also figure out a strategy for "beyond 4294967295 page views".

  • Should we recommend user to UPDATE the tables to BIGINT fields?
  • Or to rename the tables and let Piwik re-create fresh tables?
@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Sep 10, 2012

Replying to Davidj:

Replying to matt:

I think this is the same issue I have run in to.
What is the impact of dropping the table? or to your option "Or to rename the tables and let Piwik re-create fresh tables?"

what tracking stats will be lost if I drop and recreate the corrupt table, attempting repair now, but if it's the row limit wont it continue to fail?

Sorry I missed your earlier comment!

how did you go with repairing the table ?

I haven't tested the procedure myself, but I think this should work:

Have you tested it, or something else? I'd really like to hear your experience, cheers

@anonymous-piwik-user

This comment has been minimized.

Copy link

anonymous-piwik-user commented Jun 26, 2014

I think you all should allow it to be an option in the install if a person wants primary keys of int or bigint. If we use the int primary key, then I will be limited to 10 years of data. I can update it myself, but other people should know about the limitation.

@mattab mattab added this to the 2.x - The Great Piwik 2.x Backlog milestone Jul 8, 2014

@mattab mattab removed P: normal labels Aug 3, 2014

@theyosh

This comment has been minimized.

Copy link
Contributor

theyosh commented Oct 23, 2014

The piwik database can't handle more than 4.294.967.295 loglines in total due to the fact that numeric fields are integers. And that number is the max. I have hit now for the second time, and the complete Piwik setup is useless. The reports now show that there are no page views anymore. So there are no updates anymore, and everything stays zero.

Please consider the change all numeric fields to either bit integers or longs. So that the limit is theoretically 18.446.744.073.709.551.615 loglines.

I am doing about 10.000.000 loglines a day, and with that amount, it will take me about 430 days to hit the max. That is almost 1,5 year. It cannot be that I have to delete the complete Piwik install every 1,5 year. With bigint the period will be 1.844.674.400.000 days.

The solution of making backup, dropping table and create it again, is not handy. Because you will lose the option to regenerate the reports, due to missing hits.
Changing to Bigint looks better. See the numbers in this post.

I am now trying to change the fields to Bigint, in order to preserve my history. But it takes very long time with 800.000.000+ loglines in the table.

Also I am surprised that this is already known for two years, and still not fixed...

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Aug 13, 2015

Maybe we could only add the index for new users, and do not run the ALTER for existing users?

@theyosh

This comment has been minimized.

Copy link
Contributor

theyosh commented Aug 13, 2015

Yeah, I think that an upgrade is taken to much time if you have much data. On the other hand, you could make a FAQ item about which fields you should change from INT to BIGINT. So it is to the administrator to decide if he changes the table structure. And warn for the downtime

For newly installs, I would suggest to use BIGINT for auto-number fields that could be big. I am not sure which fields that could be, so I have make almost every INT field to BIGINT :) Maybe a bit to much. The overhead of BIGINT compared to INT is for me not a reason not to do it. I have space enough on disk.

But there is also a code change needed. The code for cleaning up the log hits, uses a temporary table. That temporary table is created with a INT field. But if the auto-number fields are BIGINT, the temporary table should also use a BIGINT field.
This can be done regardless of the log hits auto-numer field type. In other words, it is no problem to put data from an INT field to a BIGINT field. Other way around is not possible.

So for now, I would suggest to alter the code for creating the temporary table in file core/DataAccess/RawLogDao.php:

private function createTempTableForStoringUsedActions()
    {
        $sql = "CREATE TEMPORARY TABLE " . Common::prefixTable(self::DELETE_UNUSED_ACTIONS_TEMP_TABLE_NAME) . " (
                                        idaction BIGINT,
                                        PRIMARY KEY (idaction)
                                )";
        Db::query($sql);
    }

I changed INT(11) to BIGINT

@mattab mattab modified the milestones: 3.0.0, Mid term Aug 13, 2015

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Aug 13, 2015

Adding to 3.0.0 milestone.

Steps:

  • For new & all future users: Make BIGINT the field type for new users (could even be in LTS 2.15.1?)
  • For existing users not yet using BIGINT: a console command exists that converts the needed database table columns to use BIGINT and store 2 billion+ rows
  • add some automated tests that verify tracking + reporting behavior works with very high values of ids in log_visit, log_link_visit_action, log conversion, etc. and verify values are stored correctly and the api data reports is accurate

Optional

  • For existing users: a new System check entry shows whether it's supported Database supports 2+ billion records in one table: Yes/No

@mattab mattab added the Major label Oct 23, 2015

@mattab mattab modified the milestones: 3.0.0-b1, 3.0.0 Nov 25, 2015

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Nov 25, 2015

Adding to 3.0.0-b1: my suggestion is to make Piwik by default use BIGINT fields, to prevent any issues in the future, for all users. Downside is that more disk space and index space will be used, but I think this should be acceptable?

@theyosh

This comment has been minimized.

Copy link
Contributor

theyosh commented Nov 25, 2015

I think that the more space it needs is less of an issue than recreating the tables when the max INT has been hit.

I have already changed to bigint more then a year ago, and I don't notice any down site effects.
Also, if you need this BIGINT due to the amount of records, you can imagine that you need for sure a heavy server. So a big CPU and SSD drives should already been used, and in that case, the overhead is little in my opinion. Specially for me, where I had to update the database structure every 6 months due to the max INT issue. So the extra overhead of bigint is totally not an issue compared to the downtime and work time it would take to fix.

At this time, my auto increment value is 11,763,937,301 for the piwik_log_link_visit_action table :)

A side step. Make the INT unsigned, and you can have the double amount of data. Because auto increment fields are never negative, so there is no need for a sign bit. This will increase the max from 2+ billion, to 4+ billion. So maby an extra option, use unsigned intergers of big integers for auto increment fields

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Nov 25, 2015

At this time, my auto increment value is 11,763,937,301 for the piwik_log_link_visit_action table :)

We are very impressed @theyosh - this validates the power and scalability of Piwik 🚀

@theyosh

This comment has been minimized.

Copy link
Contributor

theyosh commented Nov 28, 2015

Extra info: this is on a TokuDB enabled MySQL server. Which is faster then the regular MySQL server.
And the database is now 1 year old. So that's about 11 billion hits a year :D

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Sep 23, 2016

Currently in Piwik, we have many fields set to auto increment. But the auto increment fields which are likely to hit the 4 billion unsigned int limit, are located in the log_* tables, and are the following:

  • log_visit.idvisit
  • log_link_visit_action.idlink_va
  • log_action.idaction

We need to

I propose that in Piwik 3 we actually change the fields even for existing users, so to have a clean slate. It will take forever to run the upgrade, but there are also many other schema changes that will take a while, so it may not add much overhead to the upgrade process.

I'll work on this improvement next week 👍

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Sep 23, 2016

Fixed in PR: #10548

@mattab mattab closed this Sep 26, 2016

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Sep 26, 2016

We just realised that InnoDB will book the row space (ie. 8 bytes per BIGINT) even for NULLable columns. therefore we decide that it's not needed for now, to make log_action.idaction BIGINT, as it would add a large overhead of (10 idaction* fields * 4 bytes overhead) per action, ie. an overhead of 10 x 4 bytes = 40b per action. Not willing to add such overhead for all users when only < 0.01% will have have a log_action table with more than 4 billion entries. So i'll partially revert the changes and only make BIGINT the primary/foreign keys log_visit.idvisit and log_link_visit_action.idlink_va

@mattab mattab reopened this Sep 26, 2016

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Sep 27, 2016

Second PR reverting partially the first PR #10569

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.