Description
Currently data can be pushed in the database using the piwik.php script, called from the piwik.js tag.
For example an HTTP request that stores a page view in Piwik looks like:
```
http://piwik.org/demo/piwik.php?url=http%3A%2F%2Fpiwik.org%2F&action_name=&idsite=1&res=1440×900&h=16&m=22&s=20&fla=1&dir=0&qt=0&realp=0&pdf=1&wma=1&java=1&cookie=1&title=Piwik%20-%20Web%20analytics%20-%20Open%20source&urlref=
```
We want to improve the way Piwik logs data at tracking time.
- Piwik will now create ‘access logs’, similar to apache access logs, containing all the REQUEST details (url, the ‘piwik’ cookie which depends on #409, user agent, referer url, IP, language, POST data, other $SERVER, etc.).
- Every 10s (or 30s or 1min) the ‘Tracking Bulkloader’ will be triggered by the Maintenance task (see #1184).
– it will connect to the DB once, then read all log lines and process visits in memory,
– creating flat files for the DB updates,
– eventually using memcache for visits/pages/options/cookies data store,
– then bulk inserts/updates visits/pages/conversions and cookies
– The process would log reports`
2010-09-08 04:03:02 – Loaded 1405 visits, 45000 page views, 345 goals – Duration 32s – Logs since 2010-09-08 04:02:52`. How do we handle timeout errors? memory errors?
- Persists status of ‘bulk logs loaded’ in piwikoption
- disconnect DB
Speed of tracking data load would be greatly improved and would compensate the performance loss resulting from the cookie loss in #409.
The log replay script would work in several modes
- replay a single visit using a HTTP request containing the input data (the content of one log line from the log file).
– The script API to record stats would become public and be documented. This will allow any user to record stats in Piwik from any source (mobile phones, php apps, desktop apps, etc.).
- replay a set of visits given a log filename, will read the log file and load it in the DB
- Use cases for tracking logs loader
This way, we can use the log replay script in the following use case
- Tracker performance improvement
described above: Every 10s (or 30s or 1min) the ‘Tracking Bulkloader’ loads all logs at once, which is much faster than connecting to the DB/updates/inserts at every page view.
– There will also need to be a new ‘Super user’ setting
– Enable tracking logs bulk loader
– Record visits in DB every 10s
– Enable Memcached wrapper
- Performance testing
replaying existing real logs make it easier to test performance changes in different releases and when doing code updates. See #2000
Notes
- Maybe the tracking code has to be modified to remove the logic that ‘selects’ or tries to ‘match a visitor md5config to a previous visit’ as it can be done using the cookie store (#409) I believe.
- can we use an existing code to parse apache logs in an efficient way, and which would work with several log formats automatically (including windows IIS formats)?
- see also the [DB schema](http://dev.piwik.org/trac/wiki/DatabaseSchema) even if it’s outdated, most fields and tables are unchanged.