-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log analytics list of improvements #3163
Comments
Could I just re-ask an unanswered problem from ticket #703 - #703 ? If instead of specifying a file you do cat /path/to/log | log_import.py [options] - then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too). |
oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file? |
Counting Downloads: In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads. Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download. A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file. |
Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK. |
Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script. Could you update the readme so the middle of the regex changes from: \[(?P<date>.*?)\] to This will then make it all work. Thanks, Oliver. |
(In [6471]) Refs #3163 Fixed regexp in README. |
Oliver: indeed, I've just fixed it, thanks. |
I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts In the current situation launching this:
Just produces this:
By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues. |
I'm not a piwik dev, but what I think you're trying to do is: For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it. The way I'd do this is to write an import script which:
http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc). HTH (a real piwik person might have a better idea) Oliver. |
Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts. |
Attachment: Document the debian vhost_combined format |
It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required. The format is this:
You can see the latest version from debian's apache2.conf [http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD] See attached a small change to the README file. |
Attachment: Force hostname patch |
After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname. |
(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques. |
(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques. |
Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix. |
Great, this will be so useful for me :) |
Hi, im not sure that im right in here for a ticket or problem? Hi, im on a shared webspace with ssh support. I try your import script to analyse my apache logs. Example: 4349 lines parsed, 85 lines recorded, 0 records/sec 4349 lines parsed, 85 lines recorded, 0 records/sec 4349 lines parsed, 85 lines recorded, 0 records/sec 4349 lines parsed, 85 lines recorded, 0 records/sec Fatal error: Forbidden You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line. I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check? Regards |
Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why. |
How do you now that? Regards |
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks! |
Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:
I suppose we can do some basic test to see which value works best? |
Hi, ............. No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others. Replying to matt:
Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-) |
Attachment: |
Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server. Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far. Suggested patch attached to this ticket. |
Hi, Is there a doc about the regex format for import_logs.py ? We would like to import a file with awstat logFormat : %time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other Ludovic |
I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it? Here is the format of my log files |
Thanks a lot for all your great work! I am using a lighttpd server and added the Accept-Language header to accesslog.format: accesslog.format = "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Accept-Language}i"" I wonder if it would be possible to add support for the Accept-Language header to import_logs.py? |
Replying to Cyril:
Thanks for possibilities to import logs and also thanks for the log-hostname patch. |
Jadeham, Try setting --recorder-max-payload-size=1 . I remember having issues myself when testing with very small data sets (e.g. just 1 line). |
I have a similar problem to Jadeham. I have configured nginx to log with json format and created the following script that reads from access.log (with json format) and pass every line as stdin:
The problem that I'm having is that it seems that every record is saved but if I go to main panel, today's history it's not shown. This is the output when saving every line:
Besides that, when running archive.php, it's slower than parsing default nginx log format and a lot of lines are marked as invalid:
Is there any way to know why these records are not shown and which are the records that are being marked as invalid? |
Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters |
To see the data in the dashboard, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.
Sure, please create a new ticket for this bug and attach a log file with 1 line that showcases the bug. Thanks |
Replying to Hexxer:
Wow. 23 months have passed, and still no solution to this problem??? I'm getting the same error, and there's no docco anywhere to tell me how to fix it: The url is correct (I copy and paste it into my browser, and it gives me the Piwik login screen), and the apache error logs show nothing from today. Here's my console output: $./import_logs.py --url=https://www.mysite.com/pathto/piwik/ /var/log/apache/access.log --debug And of course, trying with --skip=5 produces the same error. I have googled, I have searched the archives, the bug tracker contains no clue. Would really appreciate some kind soul taking mercy on me here. |
Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?). |
Replying to matt:
Apache error log shows only a restart once every hour. I am unable to configure Apache directly, as I am running Piwik on Gandi.net's "Simple Hosting" service. I have repeatedly begged gandi support to look into this matter, but their attitude is (and not unreasonably) that their job is not to support user installation issues like this. If you can give me ammunition that shows it really is Gandi's fault, then maybe we can move forward here. Or maybe it's just a Piwik bug. Or I'm doing something wrong. I don't know. f |
@foobard I suggest you create a new ticket for your particular issue, and we will try help you troubleshoot it (maybe we need to get access to server to reproduce and investigate). Cheers! |
Please do not comment on this ticket anymore. instead, create a new ticket and assign it to "Component 'Log Analytics (import_logs.py)' Here is the list of all tickets related to Log Analytics improvements: http://dev.piwik.org/trac/query?status=!closed&component=Log+Analytics+(import_logs.py) |
Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues refs #7163 |
In Piwik 1.8 we released the great new feature to import access logs and generate statistics.
The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!
New features
Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:
After that bots & crawlers detection would be much better.
Support Accept-Language header and forward to piwik via the &lang= parameter. That might also be useful to some users who need to use this data in a custom plugin.
we could make it easy to delete logs for one day so to reimport one log file
This would be a new option to the python script. It would reuse the code from the Log Delete feature, but would only delete one day. The python script would call the CoreAdmin API for example, deleting this single day for a given website. This would allow to easily re-import data that didn't work the first time or was bogus.
Detect when log-lines are re-imported and only import them once.
By default this feature would be enabled only for "Log import" script,
PERFORMANCE'
How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.
Other tickets
The text was updated successfully, but these errors were encountered: