New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved archive cleanup with 60% in time! #11988

Open
wants to merge 1 commit into
base: 3.x-dev
from

Conversation

Projects
None yet
4 participants
@theyosh

theyosh commented Aug 30, 2017

Changed only the function insertActionsToKeep

Speed improvements:

  • Get all the needed ID field values in a single run per table in stead of amount of ID field times per table (3-5 times faster)
  • Start with the lowest ID instead of zero. With auto numbering and cleanup, the first valid ID will be higher and higher. So why start at zero if the first ID is somewhat about 106644353. This will safe a lot of empty selects
  • Use extra temporary tables for data pivoting
Improved archive cleanup with 60% in time!
Changed only the function insertActionsToKeep

Speed improvements:
- Get all the needed ID field values in a single run per table in stead of amount of ID field times per table (3-5 times faster)
- Start with the lowest ID instead of zero. With auto numbering and cleanup, the first valid ID will be higher and higher. So why start at zero if the first ID is somewhat about 106644353. This will safe a lot of empty selects
- Use extra temporary tables for data pivoting
@sgiehl

This comment has been minimized.

Show comment
Hide comment
@sgiehl

sgiehl Aug 30, 2017

Member

Note: Test failures are unrelated to this changes

Member

sgiehl commented Aug 30, 2017

Note: Test failures are unrelated to this changes

@sgiehl sgiehl added the Needs Review label Aug 30, 2017

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Sep 13, 2017

Member

We will need to have a good look at it and test it etc. I noticed the current version is super slow and takes ages to purge data even for a fairly small site. cc @mattab

Member

tsteur commented Sep 13, 2017

We will need to have a good look at it and test it etc. I noticed the current version is super slow and takes ages to purge data even for a fairly small site. cc @mattab

@mattab mattab added this to the 3.2.0 milestone Sep 13, 2017

@theyosh

This comment has been minimized.

Show comment
Hide comment
@theyosh

theyosh Sep 14, 2017

If you need more info, I am happy to provide.

theyosh commented Sep 14, 2017

If you need more info, I am happy to provide.

// Now we loop over the result set of max $insertIntoTempIterationStep rows and create insert queries
$keep_values = [];
foreach ($result as $row) {
$keep_values = array_merge($keep_values,array_filter(array_values($row), "is_numeric"));

This comment has been minimized.

@tsteur

tsteur Sep 22, 2017

Member

is_numeric is important here to avoid SQL injection 👍 maybe we could also add a little comment for this.

@tsteur

tsteur Sep 22, 2017

Member

is_numeric is important here to avoid SQL injection 👍 maybe we could also add a little comment for this.

This comment has been minimized.

@theyosh

theyosh Sep 22, 2017

This was not the first intention, because the data is coming from the database. So there is no sql injection possible. The data that is handled here, should already be checked on SQL injections (by the tracker or log importer)

My idea was that there are null values in the database. So with is_numeric I will filter them out. And that will reduce the amount of inserts that are not needed.

@theyosh

theyosh Sep 22, 2017

This was not the first intention, because the data is coming from the database. So there is no sql injection possible. The data that is handled here, should already be checked on SQL injections (by the tracker or log importer)

My idea was that there are null values in the database. So with is_numeric I will filter them out. And that will reduce the amount of inserts that are not needed.

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Sep 22, 2017

Member

Code looks good but would need to test it and then also compare performance. Biggest performance problem is the start from min(id) I reckon 👍

Did you test whether the insert you did is faster than the insert ...select?

Member

tsteur commented Sep 22, 2017

Code looks good but would need to test it and then also compare performance. Biggest performance problem is the start from min(id) I reckon 👍

Did you test whether the insert you did is faster than the insert ...select?

@theyosh

This comment has been minimized.

Show comment
Hide comment
@theyosh

theyosh Sep 22, 2017

@tsteur Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production. Do not know why. It is a long time. But still when the minimal value is somewhat like 1200000 it will save about 12 X amount of fields off queries running through the hole database without finding anything. So the min(id) is to reduce the amount of 'empty' result queries due to increased auto numbering field idlink_va

But lets take the following example. We have the table piwik_log_link_visit_action. There all the ID fields are queried. So there is a loop over the fields idsite,idvisitor,idvisit,idaction_url_ref,etc.. That will request only one field from the table. And then walks through the complete table (of which we have about 556970018 records) The code will loop amount of fields X (total records in table / query limit). This will take at least 1 week to process. Almost 2 weeks.

So what I did was removing the loop over the fields idsite,idvisitor,idvisit,idaction_url_ref,etc.. and created a single query that is requesting all those fields at once. As the data is (big)int, it cannot be that big, and should not give memory errors. Now you will only have 1 X (total records in table / query limit) actions. This is the big time saver.

So I get a result set back that contains all the data that should be saved in the temporary table. In order to do this, the following query needs to be created 'INSERT IGNORE INTO [TEMPTABLE] VALUES (idsite),(idvisitor),(idvisit),(idaction_url_ref)'. So I create a single query that insert 1000 records at once. This 1000 is just a number. It could be that 10000 is also possible. But that is depending on the amount of available memory.

My insert query is not faster then insert ...select. But that is not the case here. The case is, that I have reduced the total amount of queries that is run during the cleanup. Because you are using insert ...select you are basically forced to select only one field at a time. And this will force multiple loops over the same data set which is a killer.

We have noticed that this fix will use a lot of disk. Our SSD showed a 70% utilization when running this new cleanup code. So it will take more resources to cleanup, but now it only takes about 2 days in stead of 2 weeks. So we are happy with it

theyosh commented Sep 22, 2017

@tsteur Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production. Do not know why. It is a long time. But still when the minimal value is somewhat like 1200000 it will save about 12 X amount of fields off queries running through the hole database without finding anything. So the min(id) is to reduce the amount of 'empty' result queries due to increased auto numbering field idlink_va

But lets take the following example. We have the table piwik_log_link_visit_action. There all the ID fields are queried. So there is a loop over the fields idsite,idvisitor,idvisit,idaction_url_ref,etc.. That will request only one field from the table. And then walks through the complete table (of which we have about 556970018 records) The code will loop amount of fields X (total records in table / query limit). This will take at least 1 week to process. Almost 2 weeks.

So what I did was removing the loop over the fields idsite,idvisitor,idvisit,idaction_url_ref,etc.. and created a single query that is requesting all those fields at once. As the data is (big)int, it cannot be that big, and should not give memory errors. Now you will only have 1 X (total records in table / query limit) actions. This is the big time saver.

So I get a result set back that contains all the data that should be saved in the temporary table. In order to do this, the following query needs to be created 'INSERT IGNORE INTO [TEMPTABLE] VALUES (idsite),(idvisitor),(idvisit),(idaction_url_ref)'. So I create a single query that insert 1000 records at once. This 1000 is just a number. It could be that 10000 is also possible. But that is depending on the amount of available memory.

My insert query is not faster then insert ...select. But that is not the case here. The case is, that I have reduced the total amount of queries that is run during the cleanup. Because you are using insert ...select you are basically forced to select only one field at a time. And this will force multiple loops over the same data set which is a killer.

We have noticed that this fix will use a lot of disk. Our SSD showed a 70% utilization when running this new cleanup code. So it will take more resources to cleanup, but now it only takes about 2 days in stead of 2 weeks. So we are happy with it

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Sep 24, 2017

Member

Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production.

Interesting, as there is an index on this field (first column and on log_visit table even primary key) I would also have expected this query to be very quite fast.

Member

tsteur commented Sep 24, 2017

Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production.

Interesting, as there is an index on this field (first column and on log_visit table even primary key) I would also have expected this query to be very quite fast.

@theyosh

This comment has been minimized.

Show comment
Hide comment
@theyosh

theyosh Nov 12, 2017

Ping? any updates on this?

theyosh commented Nov 12, 2017

Ping? any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment