Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

New optimized archive.php script for faster and optimized archiving when hundreds/thousands of websites #2327

Closed
mattab opened this Issue · 52 comments

3 participants

@mattab
Owner

If you run archive.sh with a lot of empty sites, it takes 200ms per request on average. When archiving 1000 empty sites, for day/week/month/year periods, for N segments, that is already: 800 * N seconds.

The problem is that then it takes a long time to reach these websites that have traffic, when most sites don't have traffic.

I am not sure what the best solution is, but some ideas are:

  • profile the code and make an empty site archiving request faster (most of the time is spent in PHP, not SQL, so there is probably optimization there)

Archive.sh modifications:

  • could remember last time the archive.sh ran till the end, then run it the next time replacing "last52" with "last2" for example
  • Could run it multithreading, triggering archiving for multiple sites on each core #2563
  • could run, first, the websites that have traffic (requires modification in the SitesManager API or a new API to return sites "in order of importance")
  • we could run archiving only for websites that received some data since the last archiving run
    • when there are segments to pre-process (see [Segments] in config file for more info): we could only process the list of segments, if there are some visits when for the request without segment (otherwise we know in advance there is no data for the segments)
  • could archive first, sites that have been queried via the API recently (add a new "set flag" in the API Proxy to say "this site data was requested")
  • ...?
@robocoder
Collaborator

Maybe we could preselect the sites to archive, e.g.,

SELECT idsite, num_visits FROM
    (
        SELECT idsite, COUNT(idvisit) AS num_visits FROM piwik_log_visit
            GROUP BY idsite
    ) AS t
    ORDER BY num_visits DESC;

(Implicitly, num_visits > 0.)

@mattab
Owner

(In [5087]) Refs #2327

  • Adding new archive.php optimized rewrite of archive.sh - see description below
  • Adding new API to return only active website ID with visits since $timestamp (which is used to get all sites with visits since last archive execution)
 * Description
 * This script is a much optimized rewrite of archive.sh in PHP 
 * allowing for more flexibility and better performance when Piwik tracks thousands of websites.
 * 
 * What this script does:
 * - Fetches Super User token_auth from config file
 * - Calls API to get the list of all websites Ids with new visits since the last archive.php succesful run
 * - Calls API to get the list of segments to pre-process
 * The script then loops over these websites & segments and calls the API to pre-process these reports.
 * It does try to achieve Near real time for "daily" reports, processing them as often as possible.
 * 
 * Notes about the algorithm:
 * - To improve performance, API is called with date=last1 whenever possible, instead of last52 
 * - The script will only process (or re-process) reports for Current week / Current month  
 *   or Current year at most once per hour. 
 *   You can change this timeout as a parameter of the archive.php script.
 *   The concept is to archive daily report as often as possible, to stay near real time on "daily" reports,
 *   while allowing less real time weekly/monthly/yearly reporting. 
 */

/**
 * TODO/Ideas
 * - Process first all period=day, then all other periods (less important)
 * - Ensure script can only run once at a time
 * - Add "report last processed X s ago" in UI grey box "About"
 * - piwik.org update FAQ / online doc
 * - check that when ran as crontab, it will email errors when there is any
@mattab
Owner

(In [5090]) Refs #2327

  • Tested & fixed behavior for bad use cases: wrong piwik server, apache shutdown, mysql shutdown
  • Now exceptions thrown during init() (DB connection errors etc) are thrown properly from piwik when piwik files used internally and where PIWIK_ENABLE_DISPATCH is off
  • Now authorizing archiving requests from this archive.php script, when "browser triggered archiving" is disabled, these http requests still will work (when super user authenticated + using &trigger=archivephp in http request as flag)
  • added help option which displays the doc
  • Ensures that script can only be executed from CLI

TODO

  • Check error handling in cron
  • Fix bug when initial run of the day and ignores inactive websites
  • Add sample output link
@mattab
Owner

(In [5095]) Refs #2327
Archive.php improvements

  • Added strong errorhandling, handling sql/php/network errors from the script itself or returned from the http requests
    If there is a critical error during script exec, such as wrong token_auth or mysql shutdown, then the fatal error is throw, PHP error as well, and the script exits directly. If there was any non critical errors during execution, the script simply logs errors on screen. Then at the end, it logs them all again on screen for summary then exits (and triggers a PHP error to ensure we trigger cron error handling & email message)
  • Added summary error logs at end of script output + other improvements in the output metrics and messaging
  • Added flags (a different one for days and periods, one per website) to record a website archiving as succesful and not re-trigger the http request when not necessary. Flags are maintained via the piwik_option lookup table.
  • archive.php is now consistently using direct calls to some internal APIs (those that are not processing data) rather than calling over http
@mattab
Owner

(In [5098]) Refs #2327

  • Added parameter <reset|forceall>: you can either specify
    • reset: the script will run as if it was never executed before, therefore will trigger archiving on all websites with some traffic in the last 7 days.
    • forceall: the script will trigger archiving on all websites for all periods, sequentially
@mattab
Owner

(In [5101]) When running the archive.php script as CLI, and that piwik files were upgraded, fail gracefully and report as a critical error. Refs #2327

@mattab
Owner

(In [5102]) Refs #2327

  • adding option forceall+reset which does imitate closely the current archive.sh behavior (with still the added bonus) Fixes #1938 added segment in lock name. I have tested the code path but haven't actually verifier that this improved performance
@mattab
Owner

(In [5110]) Refs #2327

  • Adding parameter to reset:
    <reset[window_back_seconds]|forceall>: you can either specify

    • reset: the script will run as if it was never executed before, therefore will trigger archiving on all websites with some traffic in the last 7 days. You can specify a number of seconds to use instead of 7 days window, for example call archive.php 1 reset 86400 to archive all reports for all sites that had visits in the last 24 hours
  • Also hopefully fixing client timeout error at 15s by default file_get_contents. Now using Piwik_Http with timeout 300 seconds, which should leave enough for websites to process.

@mattab
Owner

(In [5111]) REfs #2327

  • Now catching exception thrown by Piwik_Http and simply reporting as network error
  • Added one line output summary for easy grep

    Example output:
    10:11:12 [6.38 Mb done: 2/2 100%, 16 v, 2 wtoday, 2 wperiods, 24 req, 24019 ms, no error

    Example with error:
    10:25:16 [6.39 Mb done: 18/21 86%, 248 v, 18 wtoday, 18 wperiods, 216 req, 122250 ms, 9 errors. eg. 'Got invalid response f
    e=1&period=month&date=last52&format=php&token_auth=0b809661490&trigger=archivephp. Response was ' '

@mattab
Owner

(In [5185]) Refs #2327

  • BUG: one hour bug: Archiving was last executed without error 59 min 53s ago
  • BUG: noreply@localhost instead of proper domain in email from: in scheduled tasks
@mattab
Owner

(In [5186]) Refs #2327 last fix to noreply@localhost instead of proper domain in email from: in scheduled tasks

@anonymous-piwik-user

As I said in http://forum.piwik.org/read.php?2,82544 archive.php takes the same timestamp of last daily archive and periods archive. I made a small patch (beware: I'm no experienced programmer, very like this can have something wrong inside)

@anonymous-piwik-user

Ok, I tested my modification and seems to work. I also added some lines to create a better output for scheduled tasks execution (show what is executed like the old archive.sh), feel free to use it if you need.

@anonymous-piwik-user

May I ask why this script uses a http call to do the actual archiving?
This IMO shares the same issues as with auto archiving by user access as it hits the same memory limits set for the fastcgi or webserver and time limits as well while the cli php usually got far higher limits and at least can be configured separately.

Is there a way to just use cli php for the actual processing?

I don't really want to give my webaccessible php a memory limit of 2gb and max execution time of many minutes just for archiving with the new archive.php.

@mattab
Owner

When running via the archive.php script (only in this case!), Piwik will try to increase memory more than normal. It is set by the config parameter under section [General] called minimum_memory_limit_when_archiving set to 768M by default.

It requires your php to allow to run
ini_set('memory_limit', $minimumMemoryLimit.'M')

@anonymous-piwik-user

You should change the execution timeout too ;-).
In any case I'd say that the webserver or curl (does it have a timeout too?) timeout will kick in first.

Just as an example for the maximum execution timeout:

[2011-11-08 12:36:54] [05a7e3d7] [12.57 Mb] ERROR: Got invalid response from API request: http://et-test.xxxxx.de/index.php?module=API&method=VisitsSummary.getVisits&idSite=1&period=day&date=last52&format=php&token_auth=xxxxxxxx&trigger=archivephp. Response was '<br /> <b>Fatal error</b>:  Maximum execution time of 30 seconds exceeded in <b>/home/xxxxx/tracking-host-test/www/core/DataTable/Row.php</b> on line <b>247</b><br />'
[2011-11-08 12:36:54] [05a7e3d7] [12.56 Mb] WARNING: Empty or invalid response for website id 1, Time elapsed: 37.310s, skipping

I really don't know how long that will take for that large site, its the largest of my 6k sites and the full processing (of all sites) with archive.sh takes 280 minutes each day.
Maybe archive.php would be overall faster but it will hit more limits this way.

Couldn't there be a simple call to cli php instead of a web call, maybe even with forking and running a couple of processes in parallel?

@mattab
Owner

(In [5429]) Refs #2327 ts77, thanks for the tip. please try this patch. Does it work after?

@anonymous-piwik-user

I seem to get farther now but are hitting my memory limit again (and from the error message it seems to be my original memory limit from the php.ini - 512M - and not from minimum archiving memory limit from piwik - 768M). I even tried a test script to see if ini_set for memory limit is getting into effect for me: it does.

Any ideas? I'm still in favor of an (alternative?) cli version for large sites and I'm worried that it will give more support issues with larger sites.
Couldn't there be a commandline switch to do a command line call to php instead of a curl request but otherwise keeping the logic the same?

@mattab
Owner

what's the exact error message?

in your config.ini.php add under [General] minimum_memory_limit_when_archiving=1024

we would prefer to keep it http only, since it allows to use the multithread easily which makes the script much faster...

@anonymous-piwik-user

No dice, its not getting into effect, while the max execution timeout is working.

[2011-11-14 07:32:05] [07774e39] [12.75 Mb] Archived website id = 1, period = week, 4861859 visits, Time elapsed: 46.602s
[2011-11-14 07:32:29] [07774e39] [12.74 Mb] ERROR: Got invalid response from API request: http://et-test.xxx.de/index.php?module=API&method=VisitsSummary.getVisits&idSite=1&period=month&date=last52&format=php&token_auth=xxx&trigger=archivephp. Response was '<br /> <b>Fatal error</b>:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 8208 bytes) in <b>/home/xxx/tracking-host-test/www/core/DataTable.php</b> on line <b>1022</b><br />

What can I do for debugging it further?
We can continue this by email if you like, you know the address ;-).

@anonymous-piwik-user

I'm wondering if the error message is from a code path that is just NOT setting the memory limit?

@mattab
Owner

Ok, please try the patch:

Index: core/Piwik.php
===================================================================
--- core/Piwik.php  (revision 5455)
+++ core/Piwik.php  (working copy)
@@ -984,7 +984,10 @@
            $minimumMemoryLimitWhenArchiving = Zend_Registry::get('config')->General->minimum_memory_limit_when_archiving;
            if($memoryLimit < $minimumMemoryLimitWhenArchiving)
            {
-               return self::setMemoryLimit($minimumMemoryLimitWhenArchiving);
+               $return = self::setMemoryLimit($minimumMemoryLimitWhenArchiving);
+               echo "Memory limit status:" . $return;
+               echo " - Current memory value: ". Piwik::getMemoryLimitValue() . "M";
+               return $return;             
            }
            return false;
        }

What does it output now in archive.php run? Thanks for your tests!

@anonymous-piwik-user

Thanks. I had a similar code some lines above, before the condition and got no output for the error cases but I'm gonna try again with your patch and let you know.

@mattab
Owner

If you get no output try to put some debug code outside the IFs in this same function, maybe the code path isnt triggered? (which would be explain your issue)

@mattab
Owner

saldsl, what problem is your patch trying to fix? please explain

@mattab
Owner

(In [5467]) Refs #2327 Thanks for your tests, indeed one call was missing! please check with this patch if the script now executes ?

@anonymous-piwik-user

Replying to matt:

saldsl, what problem is your patch trying to fix? please explain
Archive.php is set to run the daily archiving "at most every 1800 seconds" (30 minutes) and weekly/monthly/yearly archiving "at most everu
y 6200 seconds" (103 minutes).
The problem is that both archiving operations check the last execution against the same timestamp. If you run cron the daily archiving (if executed) updates the timestamp every hour, so the weekly/monthly/yearly archiving is not run the second hour because the last execution is less than 103 minutes. My patch creates two timestamps, one for the last execution of the daily archiving and one for the weekly/monthly/yearly archiving.

It also add some line to add at the end of the output what scheduled jobs are executed (if any) rather than the actual "executing scheduled jobs.... done!" output.

@anonymous-piwik-user

Far better now, thanks!
The unserialization warning is probably from the debugging output added earlier which I removed now after the first site.

[2011-11-23 09:08:30] [aafd7be9] [12.71 Mb] Starting Piwik reports archiving...

Notice: unserialize(): Error at offset 0 of 174 bytes in /home/xxx/tracking-host-test/www/misc/cron/archive.php on line 486

Warning: end(): Passed variable is not an array or object in /home/xxx/tracking-host-test/www/misc/cron/archive.php on line 487

Warning: array_sum(): The argument should be an array in /home/xxx/tracking-host-test/www/misc/cron/archive.php on line 500
[2011-11-23 09:09:30] [aafd7be9] [12.72 Mb] Archived website id = 1, period = day, Time elapsed: 60.022s

Notice: unserialize(): Error at offset 0 of 176 bytes in /home/xxx/tracking-host-test/www/misc/cron/archive.php on line 620
[2011-11-23 09:10:30] [aafd7be9] [12.73 Mb] Archived website id = 1, period = week, 0 visits, Time elapsed: 60.035s

Notice: unserialize(): Error at offset 0 of 176 bytes in /home/xxx/tracking-host-test/www/misc/cron/archive.php on line 620
[2011-11-23 09:11:30] [aafd7be9] [12.73 Mb] Archived website id = 1, period = month, 0 visits, Time elapsed: 60.027s

Notice: unserialize(): Error at offset 0 of 176 bytes in /home/xxx/tracking-host-test/www/misc/cron/archive.php on line 620
[2011-11-23 09:12:30] [aafd7be9] [12.73 Mb] Archived website id = 1, period = year, 0 visits, Time elapsed: 60.030s
[2011-11-23 09:12:30] [aafd7be9] [12.72 Mb] Archived website id = 1, today =  visits, 4 API requests, Time elapsed: 240.117s [1/568 done]
@mattab
Owner

ts77 is it working on all sites after both patches?

@mattab
Owner

saldsl thanks for your patch & explanations!
Am I right that the bug fix can be summarized to this one line change? Changeset [5470]

@anonymous-piwik-user

Yeah, it has run through all sites now without error (just had to wait the hour it takes to run through them all ;)).

@anonymous-piwik-user

I just get two time-outs

09:39:58 [6.21 Mb Time elapsed: 315.415s
09:39:58 [6.21 Mb ---------------------------
09:39:58 [6.21 Mb SCHEDULED TASKS
09:39:58 [6.21 Mb Starting Scheduled tasks...
09:39:58 [6.21 Mb done
09:39:58 [6.21 Mb ---------------------------
09:39:58 [6.21 Mb SUMMARY OF ERRORS
09:39:58 [6.21 Mb Error: Got invalid response from API request: http://xxx/stats//index.php?module=API&method=VisitsSummary.getVisits&idSite=5&period=month&date=last2&format=php&token_auth=b81973cfe5c887599faf426971867e13&trigger=archivephp. Response was '<br /> <b>Fatal error</b>: Maximum execution time of 60 seconds exceeded in <b>xxx\piwik\core\DataTable.php</b> on line <b>1022</b><br />
09:39:58 [6.21 Mb Error: Got invalid response from API request: http://xxx/stats//index.php?module=API&method=VisitsSummary.getVisits&idSite=5&period=year&date=last2&format=php&token_auth=b81973cfe5c887599faf426971867e13&trigger=archivephp. Response was ''
09:39:58 [6.21 Mb 2 total errors during this script execution, please investigate and try and fix these errors
09:39:58 [6.21 Mb ERROR: 2 total errors during this script execution, please investigate and try and fix these errors. First error was: Got invalid response from API request: http://xxx/stats//index.php?module=API&method=VisitsSummary.getVisits&idSite=5&period=month&date=last2&format=php&token_auth=b81973cfe5c887599faf426971867e13&trigger=archivephp. Response was '<br /> <b>Fatal error</b>: Maximum execution time of 60 seconds exceeded in <b>xxx\piwik\core\DataTable.php</b> on line <b>1022</b><br />

Fatal error: 2 total errors during this script execution, please investigate and try and fix these errors. First error was: Got invalid response from API request: http://xxx/stats//index.php?module=API&method=VisitsSummary.getVisits&idSite=5&period=month&date=last2&format=php&token_auth=b81973cfe5c887599faf426971867e13&trigger=archivephp. Response was '<br />
<b>Fatal error</b>: Maximum execution time of 60 seconds exceeded in <b>xxx\piwik\core\DataTable.php</b> on line <b>1022</b><br />
in xxx\piwik\misc\cron\archive.php on line 179

@anonymous-piwik-user

did you add the patches discussed in this thread?

@anonymous-piwik-user

Replying to matt:

saldsl thanks for your patch & explanations!
Am I right that the bug fix can be summarized to this one line change? Changeset [5470]

Yes... the other changes are not strictly necessary.

@anonymous-piwik-user

I've changed this code to

protected function lastRunKey($idsite, $period)
{ return "lastRunArchive". $period ."_". $idsite; }

and now I get many

Notice: Undefined variable: period in xxx\piwik\misc\cron\archive.php on line 407

is it more than a 2 line update? Have I missed something else?

@anonymous-piwik-user

That part is not relevant to your timeout issues anyway.
You need
http://dev.piwik.org/trac/changeset/5429
http://dev.piwik.org/trac/changeset/5467
to hopefully fix the timeout issues for you.

@anonymous-piwik-user

ok taken the whole new file -will see how it runs tonight..thanks

@anonymous-piwik-user

Its two files btw. ;)

@anonymous-piwik-user

argghh -thanks, have patched Archive.php and Piwik.php

will report back any problems -thanks guys

@anonymous-piwik-user

Success! No errors. Am I right in thinking using running it every 24 hours with -86400 will do the job? I with these changes can it be executed more frequently without impact?

@anonymous-piwik-user

Replying to matt:

saldsl thanks for your patch & explanations!
Am I right that the bug fix can be summarized to this one line change? Changeset [5470]
matt, is it possible to add some more output to the scheduled tasks part? With archive.sh the output showed what tasks were executed, but archive.php doesn't show anything. In my patch I also tried to extract the tasks performed and put the list in the output, maybe there's a cleaner way to achieve that...

@anonymous-piwik-user

Replying to saldsl:

Replying to matt:

saldsl thanks for your patch & explanations!
Am I right that the bug fix can be summarized to this one line change? Changeset [5470]
matt, is it possible to add some more output to the scheduled tasks part? With archive.sh the output showed what tasks were executed, but archive.php doesn't show anything. In my patch I also tried to extract the tasks performed and put the list in the output, maybe there's a cleaner way to achieve that...
Opss... I didn't see that in rev 5474 you have already added this. Great!
To better formatting the output of tasks result may I propose this patch:

--- archive_orig.php    2011-11-26 16:17:59.229298023 +0100
+++ archive.php 2011-11-26 16:18:59.291871401 +0100
@@ -571,8 +571,18 @@
        if($tasksOutput == "No data available")
        {
            $tasksOutput = " No task to run";
+           $this->log($tasksOutput);
        }
-       $this->log($tasksOutput);
+       else
+       {
+           $tasksOutput = trim(str_replace("task,output","",$tasksOutput));
+           $tasksOutput = mb_split("\n",$tasksOutput);
+           foreach ($tasksOutput as $taskResult) 
+           {
+                     $this->log($taskResult);
+           }
+       }
+ 
        $this->log("done");

    }

This patch transforms the output string in an array to display tasks performed on new lines (and removes the "tasks,output" in the first line).

@mattab
Owner

Left TODO before 1.7 release:

  • archive.php should work without any argument by default (for ease of use)
    • detect the piwik URL automatically since we know it in piwik
  • Once an hour max, and on request: run archiving for previousN for websites which days have just finished in the last 2 hours in their timezones
    • then uncomment "TODO when implemented full archiving"
  • Update documentation and this faq
    • The goal would be that all new piwik users use this script from 1.7 onwards
  • The script should send an email to super user every time it is finished IF there is an error. Otherwise, only send if --email-superuser-summary
  • Allow to trigger from non CLI when SU token_auth is specified

Also, I will clean up the parameters and add named parameter. Currently it is a mess since the parameters are not named and must be ordered. Very confusing. So, it will break backward compatibility for those of you who are already using this script, but it won't be that bad ;)

@mattab
Owner

(In [5820])
Work in progress

  • refactored code & rewrote the command line parameter handling code
  • renamed parameters & updated doc
  • auto detect piwik URL (and use HTTPS URL if force_ssl is set)
  • Do not display the memory usage in the log output, easier on the eyes

Refs #2327

@mattab
Owner

(In [5822]) Refs #2327

  • we now check all websites that were last processed on a different day, and will process those. This ensures that even websites with no visits recently, will still have the week/month/year archives kept up to date. Use case: visit on Jan 5th/6th. Then no visit. Processing on Feb 1st: before it would not trigger January monthly archive, because there was no visit since last script run. Now it will trigger monthly archiving.
  • Added new API in SitesManager to fetch all websites which are set to specified timezones +tests
@mattab
Owner

(In [5823]) Refs #2327

  • archive.php can now be excecuter through the browser (ie. "WEB CRON") if the Super user token_auth is passed as a parameter This is to enable to run this script on some hosts / shared hosts where cron is not allowed but web cron is allowed.
@mattab
Owner

(In [5824]) Fixes #2327

  • AFAIK this is fixed. BOOM! Will test on demo.

Anyone listening here, testing archive.php from SVN trunk would be very appreciated :)

@mattab
Owner

(In [5860]) Refs #2327
Fixing bug ensuring all periods are processed for low traffic websites

@mattab
Owner

I updated the documentation for archive.php script cron piwik setup -- if you have any suggestion please comment here or on the form at the bottom of the page.

@anonymous-piwik-user

I am not sure that I should comment here when the ticket is closed or start a new one but we hit quite a big problem.

We have more than 3000 piwik sites in one installation with many sites with 100-500k views a day.

The archive proces for one site is not our biggest concern but error handling is.

When archive runs after 00:00 all websites are processed but the problem is when error occur on any of those 3k+ sites.

So when archive.php hit problem at siteID 2999 all sites are reprocessed at another archive.php run even there are no new visits and everything was ok.

@mattab
Owner

Hi John, why do errors occur on your websites? In general we try to prevent these errors as they are usually memory/CPU/misconfig. if you have further requirements let us know... or contact Piwik experts: http://piwik.org/consulting/#contact-consultant

@mattab mattab added this to the 1.7 Piwik 1.7 milestone
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.