New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL query to find websites with traffic since last successful archiving can take 6+ hours #8066

Closed
mattab opened this Issue Jun 9, 2015 · 14 comments

Comments

Projects
None yet
3 participants
@mattab
Member

mattab commented Jun 9, 2015

Here is an example query from one of our Enterprise Piwik instances:

   Id: 57460643
   User: piwik
   Host: localhost:xxx
     db: piwikdb
Command: Query
   Time: 23731
  State: Sending data
   Info: SELECT idsite FROM piwik_site s
            WHERE EXISTS (
                SELECT 1
                FROM piwik_log_visit v
                WHERE v.idsite = s.idsite
                AND visit_last_action_time > '2015-06-03 14:16:44'
                AND visit_last_action_time <= '2015-06-08 14:39:08'
                LIMIT 1)
  • Got: it takes more than 6 hours to find websites with traffic since last successful archiving
  • Expected: it should take a few seconds or minutes to find websites with traffic since last archiving

The goal of this issue is to change this implementation so the core:archive is fast, even when there are thousands of websites. For example we can use a more efficient SQL query or break this query down to thousands of smaller ones for each website.

@mattab

This comment has been minimized.

Show comment
Hide comment
@mattab

mattab Jun 9, 2015

Member

Temptatively adding to 2.14.0 in case something could be quickly done about this issue.

Member

mattab commented Jun 9, 2015

Temptatively adding to 2.14.0 in case something could be quickly done about this issue.

@mattab mattab added this to the 2.14.0 milestone Jun 9, 2015

@mattab mattab modified the milestones: 2.14.1, 2.14.0 Jun 16, 2015

@mattab

This comment has been minimized.

Show comment
Hide comment
@mattab

mattab Jun 18, 2015

Member

mysqldumpslow output for this query on one client instance, shows how slow this is (on average the query takes one hour)

Count: 291  Time=3351.59s (975313s)  Lock=197.57s (57492s)  Rows=1686.1 (490643), piwik[piwik]@localhost
  SELECT idsite FROM piwik_site s
  WHERE EXISTS (
  SELECT N
  FROM piwik_log_visit v
  WHERE v.idsite = s.idsite
  AND visit_last_action_time > 'S'
  AND visit_last_action_time <= 'S'
  LIMIT N)

(AFAIK the query was executed 291 times over a 3 months period ie. 3 times a day)

Member

mattab commented Jun 18, 2015

mysqldumpslow output for this query on one client instance, shows how slow this is (on average the query takes one hour)

Count: 291  Time=3351.59s (975313s)  Lock=197.57s (57492s)  Rows=1686.1 (490643), piwik[piwik]@localhost
  SELECT idsite FROM piwik_site s
  WHERE EXISTS (
  SELECT N
  FROM piwik_log_visit v
  WHERE v.idsite = s.idsite
  AND visit_last_action_time > 'S'
  AND visit_last_action_time <= 'S'
  LIMIT N)

(AFAIK the query was executed 291 times over a 3 months period ie. 3 times a day)

@mattab mattab modified the milestones: 2.14.1, 2.15.0 Jul 10, 2015

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Aug 7, 2015

Member

Is there an instance to reproduce this and to test different queries / possible fixes?

Member

tsteur commented Aug 7, 2015

Is there an instance to reproduce this and to test different queries / possible fixes?

@mattab

This comment has been minimized.

Show comment
Hide comment
@mattab

mattab Aug 7, 2015

Member

@tsteur Let's figure this out next week in the office, where we will have access to the instance to reproduce

Member

mattab commented Aug 7, 2015

@tsteur Let's figure this out next week in the office, where we will have access to the instance to reproduce

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Aug 10, 2015

Member

It seems to be using the correct index etc. and couldn't figure out how to make it faster. It might be faster to do a separate query for each site before starting the archiver for a site instead of one big query to get them all. This would be also more accurate as there might be new visits by the time the archiver actually gets to archive a specific site

Member

tsteur commented Aug 10, 2015

It seems to be using the correct index etc. and couldn't figure out how to make it faster. It might be faster to do a separate query for each site before starting the archiver for a site instead of one big query to get them all. This would be also more accurate as there might be new visits by the time the archiver actually gets to archive a specific site

@quba

This comment has been minimized.

Show comment
Hide comment
@quba

quba Aug 10, 2015

Contributor

@tsteur +1

I wonder if it's really needed to go x days back? I noticed that if there has been no archiving process running for a few days, this query goes back to the timestamp saved in piwik_option table. Maybe it's possible to simplify it a little bit, e.g. doing a select with limit 1 or check only current day (in other cases it always makes sense to archive - just to make sure that daily archives are marked as valid).

Contributor

quba commented Aug 10, 2015

@tsteur +1

I wonder if it's really needed to go x days back? I noticed that if there has been no archiving process running for a few days, this query goes back to the timestamp saved in piwik_option table. Maybe it's possible to simplify it a little bit, e.g. doing a select with limit 1 or check only current day (in other cases it always makes sense to archive - just to make sure that daily archives are marked as valid).

@quba

This comment has been minimized.

Show comment
Hide comment
@quba

quba Aug 10, 2015

Contributor

But this approach requires some investigation, e.g. how will this one work with multi-thread archiving? Should we assume that it iterates through all the websites (same as --force-all-websites)?

Contributor

quba commented Aug 10, 2015

But this approach requires some investigation, e.g. how will this one work with multi-thread archiving? Should we assume that it iterates through all the websites (same as --force-all-websites)?

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Aug 10, 2015

Member

limit 1 is already there or not?

are you saying if piwik_option is set to yesterday this check is basically not needed as we have to run archiver anyway to mark older archives as valid? I'm not so much into this topic.

Member

tsteur commented Aug 10, 2015

limit 1 is already there or not?

are you saying if piwik_option is set to yesterday this check is basically not needed as we have to run archiver anyway to mark older archives as valid? I'm not so much into this topic.

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Aug 10, 2015

Member

Re multi-threading: If --force-all-websites is set, it does not perform this query and just uses all websites. If this option is not used, we could just add all websites to the "to be archived queue" and before actually archiving a specific site execute above query but just for one site eg SELECT 1 FROM piwik_log_visit Where visit_last_action_time > '2015-06-03 14:16:44' AND visit_last_action_time <= '2015-06-08 14:39:08' and idsite = 1 LIMIT 1;

Member

tsteur commented Aug 10, 2015

Re multi-threading: If --force-all-websites is set, it does not perform this query and just uses all websites. If this option is not used, we could just add all websites to the "to be archived queue" and before actually archiving a specific site execute above query but just for one site eg SELECT 1 FROM piwik_log_visit Where visit_last_action_time > '2015-06-03 14:16:44' AND visit_last_action_time <= '2015-06-08 14:39:08' and idsite = 1 LIMIT 1;

@quba

This comment has been minimized.

Show comment
Hide comment
@quba

quba Aug 10, 2015

Contributor

We have limit 1 but it's slow. It wouldn't be the case when using standard query for one single site (which uses idsite-last_action_time index).

Regarding second question - in the query provided by Matt, there is a WHERE clause checking a few days back. I'm not an expert but I think that it makes no sense in this case.

Contributor

quba commented Aug 10, 2015

We have limit 1 but it's slow. It wouldn't be the case when using standard query for one single site (which uses idsite-last_action_time index).

Regarding second question - in the query provided by Matt, there is a WHERE clause checking a few days back. I'm not an expert but I think that it makes no sense in this case.

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Aug 10, 2015

Member

We have limit 1 but it's slow. It wouldn't be the case when using standard query for one single site (which uses idsite-last_action_time index).

Just FYI: It should use this index no matter if it's for a single site or for all.

Whether we have to look a few days back or not can be probably only answered by @mattab

Member

tsteur commented Aug 10, 2015

We have limit 1 but it's slow. It wouldn't be the case when using standard query for one single site (which uses idsite-last_action_time index).

Just FYI: It should use this index no matter if it's for a single site or for all.

Whether we have to look a few days back or not can be probably only answered by @mattab

@mattab

This comment has been minimized.

Show comment
Hide comment
@mattab

mattab Aug 10, 2015

Member

Whether we have to look a few days back or not can be probably only answered by @mattab

I thought about this and I think that our current logic can be improved: this is OPTION A)

The current logic is to select all websites that had visits since the last run of archiving. AFAIK this is not needed to select all sites with traffic "since the last run". Instead we could "simply" select all websites with traffic "today" in the website's timezone. This would be good enough because all websites with traffic "yesterday or earlier" would anyway be processed by core:archive, because the command will always process all websites where the day in the website timezone's was ended since the last run (see output eg. - Will process 99 other websites because the last time they were archived was on a different day (in the website's timezone)).

It might be faster to do a separate query for each site before starting the archiver for a site instead of one big query to get them all. This would be also more accurate as there might be new visits by the time the archiver actually gets to archive a specific site

If this will be faster than +1 to try this approach. This is OPTION B)

So I guess we have two options to try:

A) Restrict the SQL query and make the timestamp range smaller. (AND visit_last_action_time > 'S' AND visit_last_action_time <= 'S'). I guess this won't fix the issue and SQL query will still be slow when thousands of websites.

B) Change the logic and instead of addWebsiteIdsWithVisitsSinceLastRun we would loop through all websites and check whether this website had data today. I wonder if such request limited to one site (instead of currently all websites) would be thousands times faster?

Member

mattab commented Aug 10, 2015

Whether we have to look a few days back or not can be probably only answered by @mattab

I thought about this and I think that our current logic can be improved: this is OPTION A)

The current logic is to select all websites that had visits since the last run of archiving. AFAIK this is not needed to select all sites with traffic "since the last run". Instead we could "simply" select all websites with traffic "today" in the website's timezone. This would be good enough because all websites with traffic "yesterday or earlier" would anyway be processed by core:archive, because the command will always process all websites where the day in the website timezone's was ended since the last run (see output eg. - Will process 99 other websites because the last time they were archived was on a different day (in the website's timezone)).

It might be faster to do a separate query for each site before starting the archiver for a site instead of one big query to get them all. This would be also more accurate as there might be new visits by the time the archiver actually gets to archive a specific site

If this will be faster than +1 to try this approach. This is OPTION B)

So I guess we have two options to try:

A) Restrict the SQL query and make the timestamp range smaller. (AND visit_last_action_time > 'S' AND visit_last_action_time <= 'S'). I guess this won't fix the issue and SQL query will still be slow when thousands of websites.

B) Change the logic and instead of addWebsiteIdsWithVisitsSinceLastRun we would loop through all websites and check whether this website had data today. I wonder if such request limited to one site (instead of currently all websites) would be thousands times faster?

@quba

This comment has been minimized.

Show comment
Hide comment
@quba

quba Aug 10, 2015

Contributor

Option A will be still slow. I checked this one and it was slow even when going back by 3 hours.

So I'd personally try option B.

Contributor

quba commented Aug 10, 2015

Option A will be still slow. I checked this one and it was slow even when going back by 3 hours.

So I'd personally try option B.

@tsteur tsteur self-assigned this Aug 12, 2015

tsteur added a commit that referenced this issue Aug 12, 2015

@tsteur

This comment has been minimized.

Show comment
Hide comment
@tsteur

tsteur Aug 12, 2015

Member

I will go for option A and B

A) should make the query a bit faster
B) has couple of advantages. Eg if we have to archive a site anyway because of invalidation or ran past midnight, then we do not have to run the query for checking for sites. Also usually each thread would perform the query to find all the websites with traffic but actually we need to execute it only before archiving the site. So if there were 8 threads, we performed this slow query 8 times. Now we will do it at most once.

I made some changes so far here but it is really hard to not regress / to make sure everything still works: https://github.com/piwik/piwik/compare/8066?expand=1

Member

tsteur commented Aug 12, 2015

I will go for option A and B

A) should make the query a bit faster
B) has couple of advantages. Eg if we have to archive a site anyway because of invalidation or ran past midnight, then we do not have to run the query for checking for sites. Also usually each thread would perform the query to find all the websites with traffic but actually we need to execute it only before archiving the site. So if there were 8 threads, we performed this slow query 8 times. Now we will do it at most once.

I made some changes so far here but it is really hard to not regress / to make sure everything still works: https://github.com/piwik/piwik/compare/8066?expand=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment