Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Log analytics list of improvements #3163

Closed
mattab opened this Issue · 162 comments

9 participants

@mattab
Owner

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

  • Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:

    • exclude visits with more than 150 actions per visitorID to block crawlers (detected at the python level by counting requests for that IP in the queue)
    • exclude visits that do not have User Agent or beyond the very basic ones used by all bots
    • exclude all requests when one of the first ones is for /robots.txt -- if we see a robots.txt in the middle we could stop tracking subsequent requests
    • check that /index.php?minimize_js=file.js is counted as a static file since it ends in .js

    After that bots & crawlers detection would be much better.

  • Support Accept-Language header and forward to piwik via the &lang= parameter. That might also be useful to some users who need to use this data in a custom plugin.

  • we could make it easy to delete logs for one day so to reimport one log file

    • This would be a new option to the python script. It would reuse the code from the Log Delete feature, but would only delete one day. The python script would call the CoreAdmin API for example, deleting this single day for a given website. This would allow to easily re-import data that didn't work the first time or was bogus.
  • Detect when log-lines are re-imported and only import them once.

    • Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
    • In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
    • After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
  • By default this feature would be enabled only for "Log import" script,
    • via a parameter that we know is the log import (&li=1 /import_logs=1)
    • but may be later useful to all users of Tracking API for general deduping service.

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

  • #3867 cannot resume with line number reported by skip for ncsa_extended log format
  • #4045 autodetection hangs on a weird formatted line
@oliverhumpage

Could I just re-ask an unanswered problem from ticket #703 - #703 ? If instead of specifying a file you do

cat /path/to/log | log_import.py [options] -

then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too).

@cbay
Collaborator

oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file?

@ddeimeke

Counting Downloads:

In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads.

Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download.

A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file.

@anonymous-piwik-user

Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK.

@oliverhumpage

@Cyril

Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script.

Could you update the readme so the middle of the regex changes from:

\\[(?P<date>.*?)\\]

to

\\[(?P<timezone>.?)\\]((?P<date>.?))

This will then make it all work.

Thanks,

Oliver.

@cbay
Collaborator

(In [6471]) Refs #3163 Fixed regexp in README.

@cbay
Collaborator

Oliver: indeed, I've just fixed it, thanks.

@anonymous-piwik-user

I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts
It's quite difficult in my case (using a control panel) to add the required %v:%p fields in the custom log format.
What I do have is a log for every domain, so being able to specify the hostname manually would do the trick for me.

In the current situation launching this:

python /var/www/piwik/misc/log-analytics/import_logs.py 
  --url=https://server.example.com/tools/piwik --recorders=4 --enable-http-errors 
  --enable-http-redirects --enable-static --enable-bots --add-sites-new-hosts  /var/log/apache2/example.com-combined.log

Just produces this:

Fatal error: the selected log format doesn't include the hostname: 
  you must specify the Piwik site ID with the --idsite argument

By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues.

@oliverhumpage

I'm not a piwik dev, but what I think you're trying to do is:

For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it.

The way I'd do this is to write an import script which:

  • looks at all logfile names
  • accesses the piwik API to get a list of all existing websites (URLs and IDs)
  • for any logfile which doesn't appear in the list uses another API call to create the site and get the ID of the newly created site
  • imports all the logfiles with --idsite set to the right value for the logfile

http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc).

HTH (a real piwik person might have a better idea)

Oliver.

@anonymous-piwik-user

Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts.
And thanks for the links too, I'll have look.

@anonymous-piwik-user

Attachment: Document the debian vhost_combined format
vhost_combined.patch

@anonymous-piwik-user

It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required.

The format is this:

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined

You can see the latest version from debian's apache2.conf [http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD]

See attached a small change to the README file.

@anonymous-piwik-user

Attachment: Force hostname patch
force_hostname.patch

@anonymous-piwik-user

After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname.
In case it's set, the value of host will be ALWAYS the one entered by --force-hostname.
This allows to deal with logfiles of ncsa_extended or common as if they were complete formats. (creating idsites when needed and so on..)

@cbay
Collaborator

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

@cbay
Collaborator

(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques.

@cbay
Collaborator

Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix.

@anonymous-piwik-user

Great, this will be so useful for me :)

@anonymous-piwik-user

Hi,

im not sure that im right in here for a ticket or problem?
I have a problem importing access_logs from my shared webspace. I copy test from here http://forum.piwik.org/read.php?2,90313


Hi,

im on a shared webspace with ssh support. I try your import script to analyse my apache logs.
I get it to work, but there are sometime some "Fatal errors" and i have no idea why. It is, if i restart it without "skip" every time the same "skip-line"

Example:

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

Fatal error: Forbidden

You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line.


I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check?

Regards

@cbay
Collaborator

Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why.

@anonymous-piwik-user

How do you now that?
It stops every time at the same line and if i skip that it runs 10 oder 15 minutes without a problem (up to this line it need 2 minutes or so).

Regards

@mattab
Owner

Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!

@mattab
Owner

Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:

["requests":[url1,url2,url3],"token_auth":"xyz"]

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

@anonymous-piwik-user

Hi,

.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

@oliverhumpage

Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server.

Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far.

Suggested patch attached to this ticket.

@anonymous-piwik-user

Hi,

Is there a doc about the regex format for import_logs.py ?

We would like to import a file with awstat logFormat :

%time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other
Thanks for your help,

Ludovic

@anonymous-piwik-user

I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it?

Here is the format of my log files
access.log.%Y-%m-%d.log

@anonymous-piwik-user

Thanks a lot for all your great work!
The server log file analytics works great on my server.

I am using a lighttpd server and added the Accept-Language header to accesslog.format:

accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Accept-Language}i\""
(see http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ModAccessLog)

I wonder if it would be possible to add support for the Accept-Language header to import_logs.py?
So that the country could then be guessed from the Accept-Language header when GeoIP isn't installed.

@anonymous-piwik-user

Replying to Cyril:

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

Thanks for possibilities to import logs and also thanks for the log-hostname patch.
Not sure whether it is the patch or it is caused by using --recorders > 1, but with the first run with --add-sites-new-host I got 13 sites for the same hostname created.

@anonymous-piwik-user

I'm having a similar problem to Hexxer. When I do a --dry-run I get no errors, but when adding to Piwik it falls over at about the same spot. It's not one offending log file or line of a log file that's causing it. I'll attach the output with debugging on below. I've run the script multiple times, by removing the line where the script has fallen over, removing the log file where it has fallen over etc. It always dies line ~9000-10000 in the 3rd log file.

I'm not sure if this is of interest but when doing a dry run the script does ~600lines/sec when importing to Piwik it does ~16,

@anonymous-piwik-user

The output file is here. Akismet was marking the attachment as spam

@cbay
Collaborator

(In [6509]) Refs #3163 updated README to suggest increasing the --recorders value.

@cbay
Collaborator

oliverhumpage: thanks, I've committed your diff.

ludopaquet: no doc yet, I suggest you take a look at the code, taking _COMMON_LOG_FORMAT as example.

lewmat21: I suppose each log line has its own date anyway, so it doesn't matter what the filename is.

sc_: I don't think using Accept-Language to guess the country is a good idea. As the header name says, it's about languages (locales), not countries. First, many languages are spoken in several countries. If the Accept-Language says you accept English, what country would you pick? Second, people can have Accept-Languages that don't match their country. I personnally surf using English as Accept-Languages, whereas I'm French and live in France.

law: can you reproduce the issue? If so, can you give me the access log as well as the full command line you used?

andrewc: can you edit line 741 and increase the 200 value to something like 10000? It will print the full error message instead of only the first 200 characters, which is not enough to get the Piwik error.

@anonymous-piwik-user

@Cyril:
As far as I know Piwik does the same when the GeoIP plugin isn't used:
http://piwik.org/faq/troubleshooting/#faq_65
The location is then guessed from en-us, fr-fr etc.

But the more important point is that it would be useful for website development to know what languages people use who visit my website. So it would be great if support for the Accept-Language header could be added.

@anonymous-piwik-user

Sorry for the wrong formatting (the preview didn't work).
Here is the correct link:

http://piwik.org/faq/troubleshooting/#faq_65

@anonymous-piwik-user

@Cyril:
Here's the output file with the full error messages.

@anonymous-piwik-user

Replying to andrewc:

@Cyril:
Here's the output file with the full error messages.
Sorry this is the link https://www.dropbox.com/sh/zat1m6lqphndpny/wH6n4mDaD6/output0907.txt

@cbay
Collaborator

sc_: OK, I didn't know about this. Considering GeoIP will be integrated into Piwik soon (see #1823), which is a much better solution, I don't think we should modify the import script to use Accept-Language headers.

andrewc: your Piwik install (the PHP part) is returning errors:
Only one usage of each socket address (protocol/network address/port) is normally permitted

You need to find out why and fix it. It's unrelated to the import script.

@anonymous-piwik-user

Thanks for your great work, we gave import logs some time now and have few ideas/problems.

I don't know, should I open new tickets or write here?

One major thing is how to bring number of visitors/unique visitors down to make it more similar to javascript tracking and google analytics.

I understand that we don't have cookies and other config information to identify visitor.

We've managed to bring number of pageviews/actions down few times (from 5 times to 2 times more than javascript tracking). Or many more in few cases (like from 100 times more than javascript).

Our ideas and changes include (we assumed that we should get numbers as close as in javascript tracking):

  • counting only GET requests (not HEAD and POST)

  • not counting visits without OS data (we've tracked 5 websites with 10k/day visits and 30k views and there are very small number of real users without OS)

few workarounds :)

  • limit to 100 actions per visitorID to block crawlers that are not on the list (that also works for ajax websites that are getting parts of website with GET and normal PHP files)

  • changed checking for static files to block all sort of minimizers example: /index.php?minimize_js=file.js (very common)

  • custom code for image thumbnails /img_thumb.php?file=picture.png&w=100&h=300

We ended up at number of actions (pageviews) about twice the number of javascript without influence number of visitors (about 50% bigger than javascript).

Our extreme case is 300 views (javascript tracking) and 30 000 views with import script after changes - about 570 views with import script.

@cbay
Collaborator

fjohn:

  • why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers

  • what kind of user-agent doesn't have OS data? Aren't they bots anyway?

  • limiting actions: that's on the PHP-side, I'll let matt answer this

Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:

  • it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is

  • there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)

  • most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).

So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.

What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?

@anonymous-piwik-user

Replying to Cyril:

fjohn:

  • why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers

But POST is also used by ajax requests all the time (and this is not what we would count with JS). We've just simplified that to drop anything other than GET.

  • what kind of user-agent doesn't have OS data? Aren't they bots anyway?

for me question is - does "real user" always send OS data. On our logs there were for example, curl, python libs, xrumer, scrapers and many more odd requests that weren't on the bot list.

  • limiting actions: that's on the PHP-side, I'll let matt answer this

yeap, it is. But we have a lot of bots that were not on list, don't know how it is working but they were in import log profile, not in javascript profile.

Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:

  • it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is

I agree with you, we identified 2 of them (thumbs and minimizers) and we have very universal code for it - example (if picture and &w and &h) those identify 3 most popular thumb scripts (including those in wordpress and oscommerce).

We did it because on oscommerce shop we had 1000 more page views than on javascript - should we accept that?

  • there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)

Does it with javascript tracking? From our tests not.

  • most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).

We have only 2 more "if statements" on current FOR loops. Still you're right, that can grow :)

So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.

What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?

That could be a good idea, would be nice to test this on larger number of websites/scripts, we've tested 5 regular websites and few other scripts.

@cbay
Collaborator

Ajax requests do not use POST all the time at all. For instance, jQuery (the most popular Javascript library) uses GET by default:
http://api.jquery.com/jQuery.ajax/

Regarding the rest of comments: just to make things clear, I wasn't advocating against what you did for your specific site, but against doing this by default in the script. I very much prefer to add options to the import script (it has quite many already) to allow users to customize it for their own needs rather than try to have sane defaults, which we really can't do as there's too much diversity on the Web :)

@anonymous-piwik-user

Cyril:

About Ajax - that is why we made limit of 100 page views per visitor. We found a case when one user made from 700 to 1000 views thanks to ajax by GET requests.

About whole thing. Sure, I understand that. But we wanted to use it for hosting company, and we are not making any "special case" we are trying to test log import on as many websites as we can.

So we just wanted to share some of our tests and ideas. In most cases everything works good, but wordpress or oscommerce are very popular.

Showing customers 30k views instead of 300 is not the best way to prove that import logs is working fine. On IPB forum we've had 5 times more pageviews, now less than twice JS.

@mattab
Owner

@oliverhumpage and to all listening in this ticket: is there any other pending bug or important missing feature in this script?

Are you all happy with it? Note: we are working on performance next.

@anonymous-piwik-user

My Apache log gives hostnames rather than IP addresses. It looks like the import script sends the hostname, which the server side tries to interpret as a numeric IP value, with the result that all hostnames translate to 0.0.0.0. I added a call to socket.gethostbyname() in the import script, but it's undone all the performance gains I got through the bulk request patch.

Is there some simple fix that I'm missing here?

@anonymous-piwik-user

Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.

This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.

@anonymous-piwik-user

Because Piwik lacks the capabilities of tracking news feed subscribers (and I don't want to use feedburner) I would like to import the particular information from the Apache logs. All other web requests are tracked successfully by Piwik and I want the feed users information merged into the same Piwik website. For instance my news feed is located at www.domain.com/rss.xml, how can I import only the particular information into Piwik?

@anonymous-piwik-user

Hi guys,

We found one odd case.

On 2 servers (one dedicated and one vps) each new visit = new idvisitor (despite same configId).

BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.

Do you have any ideas why and how it supposed to work? I've spent some time in visit.php and when no cookie and visit less than 30 minutes = new idvisitor.

@mattab
Owner

BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.

Could you somehow find an example of the log file showing the problem on both installations, with a few lines like 3 or 4 lines, to replicate the bug? this would help finding out the fix. Thanks

@anonymous-piwik-user

Yes matt , I will have them tomorrow (day off today) but how it should work? Should log parsing count unique visitors or not ?

@geosone

I have activated logimport via apache macro to have live stats but wee have a 20 sites with high load and the problem that we have now ist that the acces via the url is blocking (30 or more import_log.py accessing piwik)
could we get some direct logimport that si not going throug the http interface ? and directly trhoug a console php load ?

thx Mario
and keep up the great work

@aspectra

Hi @all

We are testing the python import_logs.py script. Actually we are not able to import IIS log files which are compressed by WinZip or 7zip. If we unzip the archive befor running the script it works quiet well.

It seems the python script is not able to uncompress the files...

Attached an example archive

@aspectra

Attachment: WinZip compressed file
u_ex120813.zip

@diosmosis
Collaborator

(In [6734]) Refs #3163, add integration tests (in PHP) for log importer.

@diosmosis
Collaborator

(In [6737]) Refs #3163, modified log importer to use bulk tracking capability.

Notes:

  • Added 'ua' & 'lang' tracker parameters to override user agent & language present in HTTP header.
  • Modified the tracker so if there's an error when doing bulk tracking, the number of succeeded requests is returned.
@mattab
Owner

(In [6739]) Refs #3163 - clarifying this option shouldn't be used by default

@diosmosis
Collaborator

(In [6740]) Refs #3163, made size of parsing chunk == to max payload size * recorder count.

@mattab
Owner

(In [6743]) Refs #3163

  • Fixing Log Analytics integration - adding a new index.php proxy to force to use test DB
  • Refactored call to get browser language forgotten earlier

TODO:

  • Benaka, could you please remove --tracker-url feature from the import_logs.py ? it's not used anymore
@mattab
Owner

(In [6745]) Fixing build? Refs #3163

@mattab
Owner

(In [6749]) Refs #3163

  • removing tracker-url param
  • fixing build?
@diosmosis
Collaborator

(In [6756]) Refs #3163, show average records/s along w/ current records/s in log importer.

@mattab
Owner

Replying to jamesvl011:

Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.

This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.

@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.

@mattab
Owner

Adding "Heuristics to not track bot visits" in the ticket description.

If you have a suggestion or request for the script - or any problem or bug, please post a new comment here.

@mattab
Owner
  • Adding "Support Accept-language" as feature request, since Piwik allows to define user language with the parameter &lang= so this should be easy and useful for some users.
@anonymous-piwik-user

Replying to matt:

@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.

Matt -

The regex for c-ip (line 134 of import_logs.py when I looked at svn) ought to be like the line for User-Agent, allowing any text string without spaces:

'c-ip': '(?P<ip>\S+)'

I'm assuming the Piwik API can handle an IP address input as host name? If not, Python will have to do hostname lookups (preferably with its own mini-cache) as it parses the file.

I'll attach a file to this ticket with an example IIS log file that you can use for testing - it will have four rows, three with host names in the c-ip field and one with an IP address.

@anonymous-piwik-user

Attachment: Sample IIS file for testing variations of c-ip field
test_c-ip_iis_log.log

@oliverhumpage

I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).

When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.

I'm sure this didn't happen in the older 1.7.2 version.

Can you replicate, and if so, is there an easy fix?

Thanks.

@diosmosis
Collaborator

(In [6824]) Refs #3163, fix concurrency bug in import script where sites get created more than once when --add-sites-new-hosts is used.

@diosmosis
Collaborator

Replying to oliverhumpage:

I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).

When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.

I'm sure this didn't happen in the older 1.7.2 version.

Can you replicate, and if so, is there an easy fix?

Just committed a fix for this bug. Can you use the file in svn?

@diosmosis
Collaborator

(In [6826]) Refs #3163, added more integration tests for log importer & removed some unnecessary xml files.

@oliverhumpage

Replying to capedfuzz:

Just committed a fix for this bug. Can you use the file in svn?

Perfect, that's fixed it - thank you.

Oliver.

@diosmosis
Collaborator

(In [Refs #3163, #3227, make sure no exception thrown in tracker when no 'ua' parameter & no HTTP_USER_AGENT. (fix for bug in 6737)).

@anonymous-piwik-user

I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:

Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.

When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)

C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.

The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?

Thanks a lot!
-Jo

@anonymous-piwik-user

Hi,

Nice module we're currently assessing.
I have 2 questions :

1/ We have several servers load balanced. Each server is generating its own log files, but for the same FQDN. How can we process and aggregate the log files within the same Website, as the log files need to be order by date ?

2/ Log files contain consumed bandwidth. Is it envisageable to enhance this module in order to parse and log this information ? Or if we need this information, should we consider to create a plugin ?

Thanks for your feedback.

@anonymous-piwik-user

The import_logs.py script should be able to handle and order the dates of your differents logs when computing statistics. It's the main purpose of the "invalidate" function within this script.

The best would be to import all your logs at once and then to run the archive job so that it can compute statistics for the "invalidated" dates.

@anonymous-piwik-user

Hi,

I try to use "import_logs.py" to parse the Java Play's log, log file sample as follows:

15.185.97.217 127.0.0.1 - - Sep 04 18:28:38 PDT 2012 "/facedetect?url_pic=http%3A%2F%2Ffarm4.staticflickr.com%2F3047%2F2699553168_325fb5509b.jpg" 200 345 "" "Jakarta Commons-HttpClient/3.1" 5683 ""

But the Python thrown: "invalid log lines"

Actually the Java Play's log file is similar with Lighttpd's access.log,Any easy way to adapter this Python file to parse other log file?

@anonymous-piwik-user

It was suggested by Matt that I add my issue to this ticket:

I'm running Piwik 1.8.3 on IIS 7. I've installed the GeoIP plugin, and also tweaked based on http://forum.piwik.org/read.php?2,71788. It is working. However, my installation is only tracking US-based visits.

My IIS instance archives its log hourly. I've attached one recent log for review, on the chance that it will contain clues as to why I'm only seeing US-based visits.

@anonymous-piwik-user

Attached log file is named u_ex12091212.log.

@diosmosis
Collaborator

[7030] refs this ticket.

@anonymous-piwik-user

I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.

188.165.230.147 www.website.com - -0400 "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"

I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.

Thanks

@mattab
Owner

To everyone with questions in this ticket, thank you for your bug reports. You can try to modify the script python to make it work for your log files. It's really simple code at the start of the script.

If you are stuck and need help, Piwik experts can help with any issue related to the log import script. Contact them at: http://piwik.org/consulting/

Otherwise, we may fix some of these requests posted here, but it might take a while..

We hope you enjoy Log Analytics!

@anonymous-piwik-user

Replying to jason:

I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.

188.165.230.147 www.website.com - -0400 "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"

I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.

Thanks

Last time, I have adapted the code base at "import_logs.py" esp for Java Play log file parsing successfully, then I think you should hard code to remove the hostname pattern with "http://" prefix, or string replace it.

@anonymous-piwik-user

Had a very minor problem with the script today:
I have daily log rotation enabled, and when no user visits a site on a given day, the log file for that day will be empty. This means the log format guessing fails, leading to an error.
Preferably, when a log file is empty, one would like to skip the file without throwing an error. This is easily achieved by changing the line that checks for log file existence to also check if the log file has contents:

        `if not os.path.exists(filename) or os.path.getsize(filename) == 0:`
@anonymous-piwik-user

Attachment: Log Parser README Update with Nginx Log Format for Common Complete
README_nginx_log_format.diff

@mattab
Owner

@cyril in the next update, can yu please include this patch from @phikai: "Log Parser README Update with Nginx Log Format for Common Complete"

To everyone else: please consider submitting patches, README improvements, or new log format in the script, we will make an update in a few days.

@mattab
Owner

(In [7313]) Refs #3163 Adding libwww in excluded user agents, since libwww-perl is a common bot
As reported in: http://forum.piwik.org/read.php?3,95844

@cbay
Collaborator

(In [7382]) Refs #3163: Log Parser README Update with Nginx Log Format for Common Complete, thanks to phikai.

@cbay
Collaborator

(In [7383]) Refs #3163: don't fail to autodetect the format for empty files.

@mattab
Owner

Hey guys, there have been many updates in trunk on the script, please let us know if your suggestion or report hasn't yet been committed.

Kuddos Cyril for the updates!

edit: Check also this ticket: #3558

@cbay
Collaborator

For the record, with the current trunk, I can sustain 2000 requests/second in dry-run mode on a Xeon 2.7 GHz. And 1000 requests/second without dry-run, with --recorder=10 and the default payload (Piwik is installed on another server, 4 cores).

Not to say that you should get the same numbers as it depends on a LOT of factors (raw processing power, number of recorders, payload, PHP configuration, log files, network, etc.), but if you only get 50 requests/second and you have a strong machine, something is probably wrong.

Running with --dry-run is a good way to know how fast the Python script can go without really importing to Piwik, which already excludes many factors.

@anonymous-piwik-user

I am running Piwik 1.9.2 on a RHEL 5.7 server running Apache.

I am trying to implement the Apache CustomLog that directly imports into Piwik as described in this [url=https://github.com/piwik/piwik/blob/master/misc/log-analytics/README]README[/url]. I am not sure if I have a problem with my configuration or if there is a potential bug in the Piwik import_logs.py script. After some poking around on the command line it seems that the script works perfectly when it is given an entire file but when you try to feed it a single line from a log file it crashes. I have included my cmd output below for you to view. Any help would be greatly appreciated. Also if you need any additional information please let me know!!

Firstly let me pull the first line of my logfile to show its syntax:

[katonj@mimir2:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log
boarddev-beta.teradyne.com 131.101.52.31 - - [12/Nov/2012:11:16:24 -0500] "GET /boarddev/ HTTP/1.1" 200 10541 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"

Now when I run the file as the apache configuration suggests I get the following (Note: if I do not put the "-" at the end of the command the line from the logfile is ignore and the script simply outputs the README file):

[katonj@mimir2:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log | ./import_logs.py  --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' -
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
Traceback (most recent call last):
  File "./import_logs.py", line 1462, in <module>
    main()
  File "./import_logs.py", line 1426, in main
    parser.parse(filename)
  File "./import_logs.py", line 1299, in parse
    file.seek(0)
IOError: [Errno 29] Illegal seek

And finally if I run the file itself through the script I get the following showing that it loves processing the logfile as long as it gets an entire file fed to it all at once:

[katonj@mimir2:log-analytics ] $ ./import_logs.py  --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' boarddev-beta.teradyne.com.log
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log boarddev-beta.teradyne.com.log...
Purging Piwik archives for dates: 2012-11-12
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.


Logs import summary
-------------------

    8 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    8 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 24.01 requests per second
@cbay
Collaborator

ottodude125: log detection + reading from stdin is actually not supported; you have to pick one. I'll fix the bug later on though.

@anonymous-piwik-user

When you setup the apache customlog you are piping the system log messages into the script as soon as they appear. This is the same as stdin right? I was just trying to simulate that process by running the head -1 on a log file to get a log message and piping that into the script.

@oliverhumpage

Since auto format detection relies on having several lines to decode, it doesn't work on stdin (it tries to seek to points in the file, hence the "bug" - seek obviously fails on stdin).

When using stdin as the log source you have to use either --log-format-name or --log-format-regex flags on the command line to force a particular format. You might find --log-format-name="common_vhost" is what you want.

@anonymous-piwik-user

So you are complete right. Adding --log-format-name='common_vhost' to the command now allows a logfile to be read in from stdin on the command line. So running the following command works great from the command line:

[katonj@mimir2:applications ] $ head -8 babyfat | /hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -

As a side note I've tried the common_complete name and I tried using the --log-format-regex included in the readme and neither of them had any magical side effects either

Unfortunately porting that exact same thing into the apache http.conf file does not work. I have the configuration below and while the logfile "babyfat" gets populated piwik doesnt seem to process any input.

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" baby

CustomLog "|/hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -" baby

CustomLog logs/babyfat baby

Lastly the output logfile junk.log gets input when the command is run from the command line but the only time it gets populated from apache is when you add several -d to the CustomLog command and restart apache and then you get:

2012-11-13 15:44:12,517: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:12,517: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:12,517: [DEBUG] No token-auth specified
2012-11-13 15:44:12,517: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:12,520: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:13,249: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:13,249: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:13,249: [DEBUG] No token-auth specified
2012-11-13 15:44:13,249: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:13,251: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:14,341: [DEBUG] Authentication token token_auth is: 582b588b9568840fa6f1e208a8702b93
2012-11-13 15:44:14,342: [DEBUG] Resolver: dynamic
2012-11-13 15:44:14,342: [DEBUG] Launched recorder
@mattab
Owner

(In [7490]) Fixes #3548 Refs #3163
Any visitor with a user agent containing "spider" will be classified a bot

@anonymous-piwik-user

I have the same issue as ottodude125. Piping one single line from the access.log into import_logs.py works but using the same command directly from apache nothing gets logged.

EDIT: I noticed the log messages are appearing in the import_logs log when I restart apache. So it seem like this triggers either apache to send the messages to stdin or import_logs to read from stdin.

2EDIT: CustomLog with rotatelog works. So the issue must be the import_logs.py

@oliverhumpage

@elm @ottodude125

I noticed in ottodude125's customlog, there's no path to the config file and no auth token: that would explain the errors shown in junk.log. You need to specify one or the other so that import_logs.py can authenticate itself to the piwik PHP scripts.

I'm wondering if the same problem is happening for elm's logs too? @elm, if that doesn't fix it, could you paste your customlog section here too?

@mattab
Owner

There was another user in the forums reporting an error: view post

Could we explain the bug when it happens, and fail with a relevant error/notice message?

@anonymous-piwik-user

Here is my CustomLog line (line breaks for better reading):

CustomLog "|/var/www/piwik.skweez.net/piwik/misc/log-analytics/import_logs.py
--url=http://piwik.skweez.net/ --add-sites-new-hosts
--output=/var/www/update.skweez.net/logs/piwik.log --recorders=4
--log-format-name=common_vhost -dd -" vhost_combined

Here is the log that is generated:

...
2012-11-23 22:35:07,759: [DEBUG] Launched recorder
2012-11-23 22:35:07,761: [DEBUG] Launched recorder
2012-11-23 22:35:07,762: [DEBUG] Launched recorder
2012-11-23 22:35:07,763: [DEBUG] Launched recorder
2012-11-24 06:30:01,375: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,378: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,633: [DEBUG] Accepted hostnames: all
2012-11-24 06:30:01,633: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-24 06:30:01,633: [DEBUG] No token-auth specified
2012-11-24 06:30:01,633: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-24 06:30:01,648: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-24 06:30:02,065: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-24 06:30:02,709: [DEBUG] Site ID for hostname update.skweez.net: 7
Purging Piwik archives for dates: 2012-11-23 2012-11-24
2012-11-24 06:30:02,935: [DEBUG] Authentication token token_auth is: ...
2012-11-24 06:30:02,935: [DEBUG] Resolver: dynamic
2012-11-24 06:30:02,936: [DEBUG] Launched recorder
2012-11-24 06:30:02,938: [DEBUG] Launched recorder
2012-11-24 06:30:02,940: [DEBUG] Launched recorder
2012-11-24 06:30:02,941: [DEBUG] Launched recorder

Logs import summary
-------------------

    5 requests imported successfully
    14 requests were downloads
    15 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        1 HTTP errors
        0 HTTP redirects
        14 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    5 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 28495 seconds
    Requests imported per second: 0.0 requests per second

2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,724: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:03,104: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,136: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,141: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,372: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:03,372: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:03,372: [DEBUG] No token-auth specified
2012-11-25 06:33:03,372: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:03,373: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:03,492: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:03,492: [DEBUG] Resolver: dynamic
2012-11-25 06:33:03,493: [DEBUG] Launched recorder
2012-11-25 06:33:03,494: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
Purging Piwik archives for dates: 2012-11-25 2012-11-24

Logs import summary
-------------------

    9 requests imported successfully
    42 requests were downloads
    42 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        3 HTTP errors
        0 HTTP redirects
        39 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    9 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 86580 seconds
    Requests imported per second: 0.0 requests per second


Logs import summary
-------------------

    0 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    0 requests imported to 0 sites
        0 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 12 seconds
    Requests imported per second: 0.0 requests per second

2012-11-25 06:33:16,016: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:16,016: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:16,016: [DEBUG] No token-auth specified
2012-11-25 06:33:16,016: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:16,017: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:16,156: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:16,156: [DEBUG] Resolver: dynamic
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder

So it is getting the logs when apache is reloading, which it does at night after logrotate.

@aspectra

Hi,
I would be glad, if you could add a new option to the script. It should only import the loglines with a specified path included. So do exactly the opposite of the --exclude-path-from option. As far as I understand we could just copy/paste the def check_path part and change the "True" and "False" return values. I posted the part with the changes.

    def check_path(self, hit):
        for include_path in config.options.included_paths:
            if fnmatch.fnmatch(hit.path, included_path):
                return True
        return False

Unfortunately I don't know where to modify the script to add this option.

Many thanks for your help.

@anonymous-piwik-user

Hi all,
I am new to this piwik. So, I installed piwik on an apache webserver and I tried to import a log file from a Tomcat webserver but I get the following error:
Fatal error: Cannot guess the logs format. Please give one using either the --log-format-name or --log-format-regex option
This is the command that I used:
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://192.168.1.100/piwik/ /home/user/app1/catalina.2012-12-10.log --idsite=1 --recorders=1 --enable-http-errors --enable-http-redirects --enable-static --enable-bots
And this is what the log file contains:
Dec 10, 2012 12:02:50 AM org.apache.catalina.core.StandardWrapperValve invoke
INFO: 2012-12-10 00:02:50,000 - DEBUG InOutCallableStatementCreator#<init> - Call: AdminReports.GETAPPLICATIONINFO(?)

I tried googling it but I didn't find much. Also I tried the piwik forum but the same. Can you help me? What parameter shall use with --log-format-name or --log-format-regex option?

@mattab
Owner

In trunk, when I CTRL+C the script, it does not exit directly, it takes 5-10 seconds before the software stops running an then outputs the log. I think it is a recent regression ?

@anonymous-piwik-user

Suggestion - Bandwidth Usage

I used to see it on awstats...
http://forum.piwik.org/read.php?2,98279,98330#msg-98330

There is no size information on logs, but i guess awstats check the acessed files on logs, and count it.

@mattab
Owner

For piwik.php performance improvements and asynchronous data imports, see #3632

@anonymous-piwik-user

Has anyone found a solution to this yet? I'm having the same problem with my IIS logs not importing.

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log Z:\logs\W3SVC14\u_ex121218.log...
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 43 lines recorded, 14 records/sec (avg), 43 records/sec (curr
ent)
1648 lines parsed, 43 lines recorded, 10 records/sec (avg), 0 records/sec (curre
nt)
Fatal error: None
You can restart the import of "Z:\logs\W3SVC14
\u_ex121218.log" from the point it failed by specifying --skip=3 on the command
line.

Replying to unaidswebmaster:

I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:

Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.

When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)

C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.

The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?

Thanks a lot!
-Jo

@anonymous-piwik-user

Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.

@anonymous-piwik-user

I am running into the same problem as Jo as well. Please let me know if there are any suggestions or possible solutions. We have been trying to diagnose the problem for a couple days but still have not found a solution. Thanks.

@anonymous-piwik-user

Replying to wpballard:

Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.

One thing I've noticed is that --dry-run works perfectly. That might help narrow down where the problem is. Likely in the code that commits the changes to the DB.

@anonymous-piwik-user

Hey Folks,

Glad to see there is good interest in the log file processing.

The first feature I would like to see added is the opposite of --exclude-path, would be --include-path

In our architecture we have MANY web assets under a single domain and weblogs are done by domain. This is out of our control. This would include multiple applications, API's, and web services. it would be nice to process the log files by including only the paths we want. The exclusion route is just cumbersome as each call would require 5-10 excludes instead of a single include.

@anonymous-piwik-user

The Second Feature I would like to see is support for the XFERLOG format (http://www.castaglia.org/proftpd/doc/xferlog.html) for handling FTP logs.

much of our business is based on the downloading of data and files via FTP, so these types of stats and analysis is valuable.

@anonymous-piwik-user

The third feature I would like to see added today is the ability to process log files rotated on a monthly basis. I know this goes contrary to the recommendations however in our business we do not manage the IT infrastructure, only the line of business services and apps on top of that infrastructure.

Currently I am handling this by way of a BASH script. Before I process the log file I count the number of lines (using $wc -l) then I store that in a loglines.log file. The next time I run the script I use tail on the loglines.log and grab the last line count ans use that to populate --skip param.

To capture the monthly log rotation if the current wc -l is less than loglines.log then I set --skip to zero (0).

it is crude, but works. Having this built in native python would be fairly straight forward and allow for support of rotating monthly.

The added bonus is that the same log file can be processed multiple times in a day even for daily rotated logs. This is a happy compromise between real-time javascript and daily log processing, especially for high volume sites with huge log files.

Cron is handy for this.

@cbay
Collaborator

Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.

@cbay
Collaborator

dsampson: agree for the --include-path suggestion. I'll add it later.

FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?

Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?

The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

@anonymous-piwik-user

See comments inline...

Replying to Cyril:

dsampson: agree for the --include-path suggestion. I'll add it later.

Thanks for this. Appreciated

FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?

For those of us in the big data business a FOSS solution offering all the features of piwik for FTP would be great. An unlikely fork, so thought it could be a posible feature.

Working on the regex for XFERLOG. having trouble re-casting a new regex group based on values of other groups. For instance the date field is not a clean YYYY-MM-DD so I need to figure out how to create a regex group based on values of three other regex groups. I am a regex greenhorn for sure.

Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?

I do it by comparing the last line count to the new one. for instance #linesyesterday will be greater than #linestoday if the logfile has been rotated. I have done logfiles in python using just regular text files in the past. They get big but the head can be severed when it gets too big. a no-sql db approach or data object could also work.

The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

This would be a good alternative with some hit on performance.

Thanks again for the reply

@anonymous-piwik-user

Did either of these feature make it into the latest 1.10.1 release?

Replying to dsampson:

See comments inline...

Replying to Cyril:

dsampson: agree for the --include-path suggestion. I'll add it later.

Log Rotation: The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if > a log line has already been imported, so that you can basically reimport any log file at any time, > and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe
and easy to use.

@anonymous-piwik-user

Working on the regex for XFERLOG.

Here is my first cut, however the DATE field will not be recognized. Dates in XFERLOG are not like those in Apache logs. I am not sure how to concatenate these groups based on other named groups.

I included some test strings. yes I used the public Google DNS for IP's for privacy reasons.

I captured everything I could according to the EXFER documentation. perhaps overkill but the best way I knew to work through the expression. manpage for XFERLOG here (http://www.castaglia.org/proftpd/doc/xferlog.html)

I also provided the example script call and the output from the script.

Looks like the issue is the DATE group. no surprise. But again I am not sure how to contruct it based on the input.

Any thoughts are appreciated

--------------TEST STRINGS-------------------
Mon Nov 1 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250k_dem/026/026a.zip b _ o a User@ ftp 0 *
Thu Nov 10 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250k_dem/026/026a.zip b _ o a User@ ftp 0 * c
Tue Jan 1 14:12:36 2013 1 8.8.4.4 88048 /pub/cantopo/250k_tif/MCR2010_01.tif b _ o a ftp@example.com ftp 0 * i
Tue Jan 1 14:15:57 2013 4 8.8.4.4 8769852 /pub/geott/ess_pubs/211/211354/gscof_3759r_b_2000_mn01.pdf b _ o a googlebot@google.com ftp 0 * c
Tue Jan 1 16:06:49 2013 11 8.8.4.4 7198877 /pub/toporama/50k_geo_tif/095/d/toporama_095d02_geo.zip b _ o a user@server.com ftp 0 * c
Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m_2.tif b _ o a googlebot@google.com ftp 0 * c
Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m_2.tif b _ o a googlebot@google.com ftp 0 * c
Tue Jan 1 06:59:59 2013 1 8.8.4.4 1679 /pub/geott/eo_imagery/gcdb/W073/N60/N60d50mW073d40m_1.summary b _ o a googlebot@google.com ftp 0 * c
Tue Jan 1 07:02:53 2013 1 8.8.4.4 168087 /pub/geott/eo_imagery/gcdb/W108/N50/N50d58mW108d28m_3.tif b _ o a googlebot@google.com ftp 0 * c
Tue Jan 1 07:04:39 2013 1 8.8.4.4 16958 /pub/geott/cli_1m/e00_pro/fcomfins.gif b _ o a googlebot@google.com ftp 0 * c

--------------REGEX Expression-----------------
(?x)
(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s
(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s
(?P<day>[\d]{1,})\s
(?P<time>[\d+:]+)\s
(?P<year>[\d]{4})\s

(?P<unknown>[\d]+)\s
(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s
(?P<length>[\d]{,})\s
(?P<path>/[\w+/]+)/
(?P<file>[\w\d-]+.\w+)\s
(?P<type>[a|b])\s
(?P<action>[C|U|T|_])\s
(?P<direction>[o|i|d])\s
(?P<mode>[a|g|r])\s
(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s
(?P<service>[\w]+)\s
(?P<auth>[0|1])\s
(?P<userid>[*])\s
(?P<status>[c|i])
(?P<stuff>)

----------------Script Call----------------
./misc/log-analytics/import_logs.py --url=http://PIWIKSERVER --token-auth=AUTHSTRING --output=proclogs/procFtpPiwik.log --enable-reverse-dns --idsite=17 --skip=0 --dry-run --log-format-regex="(?x)(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s(?P<day>[\d]{1,})\s(?P<time>[\d+:]+)\s(?P<year>[\d]{4})\s(?P<unknown>[\d]+)\s(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s(?P<length>[\d]{,})\s(?P<path>/[\w+/]+)/(?P<file>[\w\d-]+.\w+)\s(?P<type>[a|b])\s(?P<action>[C|U|T|_])\s(?P<direction>[o|i|d])\s(?P<mode>[a|g|r])\s(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s(?P<service>[\w]+)\s(?P<auth>[0|1])\s(?P<userid>[*])\s(?P<status>[c|i])(?P<stuff>)"-
-------------Script output ----------------------
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log logs/ftpLogsJunco/xferlog2...
Traceback (most recent call last):
File "./misc/log-analytics/import_logs.py", line 1411, in <module>
main()
File "./misc/log-analytics/import_logs.py", line 1375, in main
parser.parse(filename)
File "./misc/log-analytics/import_logs.py", line 1299, in parse
date_string = match.group('date')
IndexError: no such group

@anonymous-piwik-user

@ottodude125 and @elm: I have the same issue and reported it as a separate ticket here: #3757#ticket

@anonymous-piwik-user

How to exclude more than 150 user's visits on site?

@anonymous-piwik-user

Replying to Cyril:

Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.

My web logs have additional fields logged. Some of these do resolve/transfer over when using AWStats, others are excluded in AWStats with %other% values. I tried to exclude the additional field data by creating new, but unused lines in the import iis format section but was not able to get past the error "'IisFormat' object has no attribute 'regex'". Forum/web searches bring this up as a common problem but I haven't found a fix. Any suggestions? Sample log file inline.

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-02-23 00:00:01
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-02-23 00:00:01 192.168.1.202 GET /pages/AllItems.aspx - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 200 0 0 499
2013-02-23 00:00:01 192.168.1.202 GET /pages/logo.jpg - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 304 0 0 312

@mattab
Owner

Piwik Log Analytics is now being used by hundreds of users and seems to be working well! We are always interested in new feature requests and suggestions. You can post them here and if you are a developer, please consider opening a pull request

@anonymous-piwik-user

Hi,

The log analytic script does not accept any time argument.
Thus, is it assumed that the log files to be processed have already been filtered ( timestamp range) in order to avoid duplicate processing ?

Thanks.

@anonymous-piwik-user

Hi

I've been trying to import some logs from a tomcat/valve access log.

According to this http://tomcat.apache.org/tomcat-5.5-doc/config/valve.html, my app server.xml define

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="/sillage/logs/performances" pattern="%h %l %u %t %r %s %b %D Referer=[%{Referer}i]" prefix="access." resolveHosts="false" suffix=".log"/>

Here is a couple of line from one of my access-datetime.log

10.10.40.85 - - [08/Apr/2013:11:02:49 +0200] POST /...t.do HTTP/1.1 200 39060 629 Referer=[http://.....jsp]
10.10.40.60 - - [08/Apr/2013:11:02:49 +0200] GET /...e&typ_appel=json HTTP/1.1 200 2895 2 Referer=[-]
10.10.40.85 - - [08/Apr/2013:11:02:48 +0200] POST /...r.jsp?cmd=tracer HTTP/1.1 200 90 63 Referer=[http://....jsp]

Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual

%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?

(guess reading README exemple...) would help. Maybe...

@oliverhumpage

Replying to lyrrr:

Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual

%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?

(guess reading README exemple...) would help. Maybe...

If you're using --log-format-regex on the command line then I don't think the escaping is necessary. It's only if you're piping directly to piwik via (in my case) apache's ability to send logs to programmes that you need to work out how to do the multiple-escape thing.

@anonymous-piwik-user

I'll try tomorrow, but I'm skeptical: I copied the \\ stuff from the README.md example.

@oliverhumpage

I've just double-checked the README.md, and the only time I can see that weird escaping is in the bit I wrote called "Apache configuration source code". It's meant to be apache config, not CLI - apologies if that's not clear.

You may need to put a bit of escaping in depending on your shell, but nowhere near the amount that apache requires (since you've got to escape the initial parsing of the config file, then the shell escaping as it runs the command, and still be left with backslashes).

I think if you single quote it's mostly OK, i.e. with tcsh or bash

--log-format-regex='(?P<host>[\w...])'

would pass the regex in unscathed, or with my copy of ancient sh you just need one extra backslash, i.e.

--log-format-regex='(?P<host>[\\w...])'

etc.

HTH

@mattab
Owner

Maybe we are missing a few examples in the doc for how to call the script. Would you mind sharing your examples, if you're reading this?

we will add such help text in the README.

@anonymous-piwik-user

Okay, finaly this worked:

python misc/log-analytics/import_logs.py --url=http://localhost/piwik log_analysis/access.2013-04-02.log --idsite=1 --log-format-regex='(?P<ip>\S+) (?P<host>\S+) (?P<user_agent>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] (?P<query_string>\S*) (?P<path>\S+) HTTP\/1\.1 (?P<status>\S+) (?P<length>\S+) (?P<time>.*?) (?P<referrer>.*?)'

This would be an interesting example for your doc I guess

  • matching a regex to a valve format
  • considering %l (remote hostname) %u (remote user auth name) whhich default to - if unavailable
  • dealing with %r which is the request first line, combining POST/GET + path + protocol, somehow matched with query_string + path + hardcoded value to ignore

I now have to play with piwik to ponder the relevancy of the tool in my use case (analyzing client's call to a server managing schedules, client's information, etc; to get a better idea, a big picture on topic like network/database/cpu).
I guess I'm not very clear and twisting piwik out of the "web analyzis" intended usage. Any suggestion on this topic is welcome.

Last technical thing for this post: my time fiels is millisecond, not second. How to specify that?

Thanks for the help!

@anonymous-piwik-user

I have set this up on a varnish server that is logging through varnishncsa. However, the requests that varnish logs include the host name as the "request."

123.456.78.9 - - [23/Apr/2013:07:05:51 -0400] "GET http://asite.org/thing/471 HTTP/1.1" 200 13970 "http://www.google.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

When I import this with import_logs.py, piwik was registering hits at http://asite.org/http://asite.org/thing/471 so I was able to work around this by using the log-format-regex parameter.

--log-format-regex='(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ https?://asite\.org(?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

It would be great if this were more directly supported and documented (varnishncsa tracking through import_logs.py). I suspect my method isn't ideal for situations where more than one site is being cached with varnish and also if visitors to those sites are being logged by piwik. This method probably only works with one domain.

@oliverhumpage

Hi bangpound,

I'm not a piwik dev so I can't comment on including a varnishncsa in the import_logs.py itself, but if you change your regex slightly to replace

https?://asite\.org```

with

(?P<host>https?://[^/]+)```

then that will pick up the hostname of the site and therefore work well with multiple vhosts (either define them in piwik in advance, or use --add-sites-new-hosts to add them automatically).

Hope that helps.

@anonymous-piwik-user

Similar to cdgraff's request:
Feature Request: Support WMS (Windows Media Services) logs. Currently we use Awstats but it would be great to be able to move it to PIWIK.

I have attached a sample of WMS version 9.0 log file: WMS_20130523.log

@anonymous-piwik-user

Attachment: Log for WMS 9.0
WMS_20130523.log

@anonymous-piwik-user

I've noticed that the local time for imported logs is not set correctly. Is this correct or am I doing something wrong?

It seems as if Piwik is using the timezone of the web server that created the logs to set the local visitor time. I don't know if this is part of the importer or part of Piwik itself, but I would like to see the local visitor time be related to the timezone the visitor actually is based on their IP geoIP. It should be possible either using approximation based on longitude and latitude or by using a database like GeoNames.

@anonymous-piwik-user

Hey Folks,

Thought I would inform this thread that I have been working on a batch loading script for those of us that require some extra features such as remembering how many lines in a log were processed. The major use case is for people running scripts through cron jobs on log files roted monthly, but they want to run the stats daily or more frequently than monthly.

You can check out the branch development of batch-loader.py for piwik here:

https://github.com/drsampson/piwik/tree/batch-loader/misc/log-analytics/batch-loader

I would love some testers and feedback. Read the readme here for an overview:
https://github.com/drsampson/piwik/blob/batch-loader/misc/log-analytics/batch-loader/readme.md

Developer notes:
This work is a branch of a forked version of piwik. My goal is to someday make a pull request to integrate in piwik. So piwik developers are encouraged to comment so I can prepare.

@cbay
Collaborator

dsampson: I've had a very quick look at your script. The core feature, which is keeping track of already imported log lines, should be done in Piwik itself, as detailed by Matt on this ticket. Using a local SQLite database is an inferior solution.

Your Python code could be better. A few suggestions:

  • follow the PEP8, as it's the de-facto standard in the Python world
  • do not concatenate multiple strings with +. Instead, store your strings in a list and use ''.join()
  • even better, when you can, use string formatting
  • count = sum(1 for line in open(logFile)) : use len() with xreadlines() instead.
@anonymous-piwik-user

Thanks for the feedback.

  • I will review pep8, it has been a while
  • interesting approach with strings, I can give that a shot
  • do you mean string formatting using %s ? For DB access the community advised against that, but for running the external command that might make sense.

As for developing in Piwik. Python is the extent of this geographers hacking skills. I thought since this was not being done within PIWIK I would create a homebrew solution. Then I convinced myself to offer it back to the community for those who could use it.

Perhaps it will inspire someone to do it the right way within piwik, which would be awesome. Right now it keeps me out of the piwik internals, which is probably best for everyone (smile).

@cbay
Collaborator

String formatting was a general tip to avoid multiple concatenations. Indeed, it should NOT be used for SQL requests with unfiltered input.

As for having a proper solution to your problem, you might try harassing Matt so that he implements it into Piwik :) Just kidding, but I would LOVE to have it!

@mattab
Owner

Thanks for your submission of this tool that enhances log analytics process use cases.

For the particular "log line skip" feature, Why in core? because if several servers call Piwik, you are in trouble with the SQLite database. Better re-use Piwik datastore to keep track of dupes :)

Here is my updated proposal implementation.

  • Detect when log-lines are re-imported and only import them once.
    • Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
    • In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
    • After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
  • By default this feature would be enabled only for "Log import" script,
    • via a parameter that we know is the log import (&li=1 /import_logs=1)
    • but may be later useful to all users of Tracking API for general deduping service.
@anonymous-piwik-user

Matt,

I agree with you that getting it into core would be best. Having this solution means I could possibly dissolve my forked project. Again if I was a PHP and MySQL developer I would love to help. As a geographer, scripting is done on the side to handle special use cases.

For clarification of the use case for this script, it is launched independent of piwik. By that I mean the script will likely reside on a log server somewhere, not the PIWIK server. The script is called likely through a cron job. Since there will only be a single instance of the script run on any server you won't run into collides with multiple servers using it. If you need multiple instances then you will have each with an independant sqlite DB. That is why I used SQLITE because you have only one client accessing the client at any one time.

Let me know when these features are added to core and I will dissolve my fork.

Good luck.

@mattab
Owner

Updated description, adding:

  • #3867 cannot resume with line number reported by skip for ncsa_extended log format
  • #4045 autodetection hangs on a weird formatted line
@anonymous-piwik-user

Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.

Apache Log format is as follow :-

LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus

Sample Log:-
smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - +0700 "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.

Regular expression when importing log used is as followed-

--log-format-regex='(?P<host>[(?P<ip>\S+) \S+ \S+ [(?P<date>.?) (?P<timezone>.?)](\w-.])(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<user_agent>.?)"'

This works for regular log lines where there is only 1 IP address...

Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.

--log-format-regex='(?P<host>[(?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ [(?P<date>.?) (?P<timezone>.?)](\w-.])(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<user_agent>.?)"'

I will be nice if additional support to handle additional x-forwarded-for instead.

@anonymous-piwik-user

Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.

Apache Log format is as follow :-

LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus

Sample Log:-
smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - +0700 "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.

Regular expression when importing log used is as followed-

--log-format-regex='(?P<host>[(?P<ip>\S+) \S+ \S+ [(?P<date>.?) (?P<timezone>.?)](\w-.])(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<user_agent>.?)"'

This works for regular log lines where there is only 1 IP address...

Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.

--log-format-regex='(?P<host>[(?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ [(?P<date>.?) (?P<timezone>.?)](\w-.])(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<user_agent>.?)"'

I will be nice if additional support to handle additional x-forwarded-for instead.

@cbay
Collaborator

If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.

@anonymous-piwik-user

Correct me if I am wrong... I'm pretty new to piwik... had used awstats previously.

This would be possible if the log was not older log. We are talking about handling IMPORTING log and not existing...it makes not much sense o me if to ask user to use mod_rpaf when their aim is to import older logs which had not implemented that.

The aim of import is to import older logs...for current tracking, this can already be done by piwik itself.

Replying to Cyril:

If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.

@cbay
Collaborator

I don't get why that won't work with a custom regexp?

@oliverhumpage

Assuming you want the last IP in the list (and also that you trust the last IP in the list - this is why mod_rpaf is the best idea since you can prevent clients spoofing IPs):

--log-format-regex='(?P<host>[\w-.])(?::\d+)? (?:\S+?, )(?P<ip>\S+)/)

If you want to capture proxy information, I don't think piwik supports that, so you'd need to set up a separate site with an import regex that captures the first IP in the list instead.

@anonymous-piwik-user

Think the main point here is to "IMPORT" Existing log. For new log, it can be implemented easily as it is all done in java script.

As for "I don't get why that won't work with a custom regexp?" Any idea how/what the regexp can be...sorry I am no expert for regex...which is why I ended up having to process the log twice... and modifying the python script.

@anonymous-piwik-user

Hi,
I'm testing the import, and ran the python script twice on the same log file.
It looks like the same log file was processed twice.

Does it mean I have to handle on my own the log file history ?
Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?

Thanks,
Axel

@mattab
Owner

Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?

Correct. we would like to add this feature at some point. If you can sponsor it, get in touch!

@mattab
Owner

There was a patch submitted to keep track of imported files

@anonymous-piwik-user

Hi,

my box won't properly process log entries passed to stdin of import_logs.py. When i read the exact same entries from a file, everything works great. I am using nginx_json formatted entries. I have tried in dry run mode and normal - each time i read from stdin i get the following output (nothing imported). Can anyone get this setup to work via stdin?

Thank you for your help!

Test data:

{"ip": "41.11.12.41","host": "www.mywebsite.com","path": "/","status": "200","referrer": "http://"www.mywebsite.com/previous","user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/32.0.1700.107 Chrome/32.0.1700.107 Safari/537.36","length": 3593,"generation_time_milli": 0.275,"date": "2014-03-12T22:41:23+01:00"}

Python script parameters:
--url=http://piwik.mywebsite.com
--idsite=1
--recorders=1
--enable-http-errors
--enable-reverse-dns
--enable-bots
--log-format-name=nginx_json

--output
2014-03-12 23:29:37,251: [DEBUG] Accepted hostnames: all
2014-03-12 23:29:37,252: [DEBUG] Piwik URL is: http://piwik.mywebsite.com
2014-03-12 23:29:37,252: [DEBUG] No token-auth specified
2014-03-12 23:29:37,252: [DEBUG] No credentials specified, reading them from "the config file"
2014-03-12 23:29:37,374: [DEBUG] Authentication token token_auth is: a really beautiful token :)
2014-03-12 23:29:37,375: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-03-12 23:29:37,532: [DEBUG] Launched recorder
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Logs import summary

0 requests imported successfully
0 requests were downloads
0 requests ignored:
    0 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

0 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 10 seconds
Requests imported per second: 0.0 requests per second
@oliverhumpage

Jadeham,

Try setting --recorder-max-payload-size=1 . I remember having issues myself when testing with very small data sets (e.g. just 1 line).

@estemendoza

I have a similar problem to Jadeham.

I have configured nginx to log with json format and created the following script that reads from access.log (with json format) and pass every line as stdin:

import sh
from sh import tail

run = sh.Command("/usr/bin/python")
run = run.bake("/var/www/piwik/misc/log-analytics/import_logs.py")
run = run.bake("--output=/home/XXX/piwik_live_importer/piwik.log")
run = run.bake("--url=http://X.X.X.X:8081/piwik/")
run = run.bake("--idsite=1")
run = run.bake("--recorders=1")
run = run.bake("--recorder-max-payload-size=1")
run = run.bake("--enable-http-errors")
run = run.bake("--enable-http-redirects")
run = run.bake("--enable-static")
run = run.bake("--enable-bots")
run = run.bake("--log-format-name=nginx_json")
run = run.bake("-")

for line in tail("-f", "/var/log/nginx/access_json.log", _iter=True):
    run(_in=line)

The problem that I'm having is that it seems that every record is saved but if I go to main panel, today's history it's not shown. This is the output when saving every line:

Parsing log (stdin)...
Purging Piwik archives for dates: 2014-04-16
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary
-------------------

    1 requests imported successfully
    2 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    1 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 44.04 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Besides that, when running archive.php, it's slower than parsing default nginx log format and a lot of lines are marked as invalid:

Logs import summary
-------------------

    94299 requests imported successfully
    145340 requests were downloads
    84140 requests ignored:
        84140 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    94299 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 1147 seconds
    Requests imported per second: 82.21 requests per second

Is there any way to know why these records are not shown and which are the records that are being marked as invalid?

@estemendoza

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

@mattab
Owner

To see the data in the dashboard, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

Sure, please create a new ticket for this bug and attach a log file with 1 line that showcases the bug. Thanks

@anonymous-piwik-user

Replying to Hexxer:

Hi,

.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

Wow. 23 months have passed, and still no solution to this problem???

I'm getting the same error, and there's no docco anywhere to tell me how to fix it:

The url is correct (I copy and paste it into my browser, and it gives me the Piwik login screen), and the apache error logs show nothing from today. Here's my console output:

$./import_logs.py --url=https://www.mysite.com/pathto/piwik/ /var/log/apache/access.log --debug
2014-04-28 00:10:29,205: [DEBUG] Accepted hostnames: all
2014-04-28 00:10:29,205: [DEBUG] Piwik URL is: http://www.mysite.com/piwik/
2014-04-28 00:10:29,205: [DEBUG] No token-auth specified
2014-04-28 00:10:29,205: [No credentials specified, reading them from ".../config/config.ini.php"
2014-04-28 00:10:29,347: [Authentication token token_auth is: REDACTED
2014-04-28 00:10:29,347: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:29,349: [DEBUG] Launched recorder
Parsing log [...]/log/apache/access.log...
2014-04-28 00:10:29,350: [DEBUG] Detecting the log format
2014-04-28 00:10:29,350: [DEBUG] Check format icecast2
2014-04-28 00:10:29,350: [DEBUG] Format icecast2 does not match
2014-04-28 00:10:29,350: [DEBUG] Check format iis
2014-04-28 00:10:29,350: [DEBUG] Format iis does not match
2014-04-28 00:10:29,351: [DEBUG] Check format common
2014-04-28 00:10:29,351: [DEBUG] Format common does not match
2014-04-28 00:10:29,351: [DEBUG] Check format common_vhost
2014-04-28 00:10:29,351: [DEBUG] Format common_vhost matches
2014-04-28 00:10:29,351: [DEBUG] Check format nginx_json
2014-04-28 00:10:29,351: [DEBUG] Format nginx_json does not match
2014-04-28 00:10:29,351: [DEBUG] Check format s3
2014-04-28 00:10:29,352: [DEBUG] Format s3 does not match
2014-04-28 00:10:29,352: [DEBUG] Check format ncsa_extended
2014-04-28 00:10:29,352: [DEBUG] Format ncsa_extended does not match
2014-04-28 00:10:29,352: [DEBUG] Check format common_complete
2014-04-28 00:10:29,352: [DEBUG] Format common_complete does not match
2014-04-28 00:10:29,352: [DEBUG] Format common_vhost is the best match
2014-04-28 00:10:29,424: [Site ID for hostname www.mysite.com not in cache
2014-04-28 00:10:29,563: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:31,612: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:33,657: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
Fatal error: Forbidden
You can restart the import of "[...]/var/log/apache/access.log" from the point it failed by specifying --skip=5 on the command line.

And of course, trying with --skip=5 produces the same error.

I have googled, I have searched the archives, the bug tracker contains no clue. Would really appreciate some kind soul taking mercy on me here.

@mattab
Owner

Piwik: HTTP Error 403: Forbidden

Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).

@anonymous-piwik-user

Replying to matt:

Piwik: HTTP Error 403: Forbidden

Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).

Apache error log shows only a restart once every hour. I am unable to configure Apache directly, as I am running Piwik on Gandi.net's "Simple Hosting" service. I have repeatedly begged gandi support to look into this matter, but their attitude is (and not unreasonably) that their job is not to support user installation issues like this. If you can give me ammunition that shows it really is Gandi's fault, then maybe we can move forward here.

Or maybe it's just a Piwik bug. Or I'm doing something wrong. I don't know.

f

@mattab
Owner

@foobard I suggest you create a new ticket for your particular issue, and we will try help you troubleshoot it (maybe we need to get access to server to reproduce and investigate). Cheers!

@mattab
Owner

Please do not comment on this ticket anymore. instead, create a new ticket and assign it to "Component 'Log Analytics (import_logs.py)'

Here is the list of all tickets related to Log Analytics improvements: http://dev.piwik.org/trac/query?status=!closed&component=Log+Analytics+(import_logs.py)

@mattab mattab added this to the Future releases milestone
@mattab mattab removed the P: normal label
@mattab
Owner

Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues

refs #7163

@mattab mattab closed this
@mattab mattab added the duplicate label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.