Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

import_logs.py crashes on invalid requests #4352

Closed
anonymous-piwik-user opened this Issue · 10 comments

3 participants

@anonymous-piwik-user

We have lots of log lines with invalid requests, for example:

www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "\x80w\x01\x03\x01" 400 226 "-" "-"

(these are drive-by exploit attempts).

It looks like the script wants both the request method and the HTTP version, since if I modify the line above it works:

www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "GET \x80w\x01\x03\x01 HTTP/1.1" 400 226 "-" "-"

We can't stop these requests, so I guess the parser should at least parse any old garbage that shows up in the path?

@mattab
Owner

Thanks for the report. We'll try to reproduce and add a test case showing the failure and that it's fixed.

@anonymous-piwik-user

Thanks, it looks like the problem is that the path regex was expecting two spaces that weren't present, between where e.g. "GET" and "HTTP/1.1". I came up with a patch that works locally and made a pull request:

#159

I confess I couldn't get the test suite to run, so it might need some work still.

@mattab
Owner

(these are drive-by exploit attempts).

Piwik puts them as "invalid" which I think they really are invalid request. Marking as wontfix. If you think it really should be fixed let us know why (since they appear to be invalid requests)

@anonymous-piwik-user

Replying to matt:

(these are drive-by exploit attempts).

Piwik puts them as "invalid"

It looks like I left out something important -- these are often the first requests in the logs!

$ git clone -q https://github.com/piwik/piwik.git
$ cat bad.log 
www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "\x80w\x01\x03\x01" 400 226 "-" "-"
$ cp $MYCFG piwik/config/config.ini.php
$ python2.7 ./piwik/misc/log-analytics/import_logs.py --url $MYURL bad.log 
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log bad.log...
Fatal error: cannot determine the log format using the first line of the log file. Try removing it or specifying the format with the --log-format-name command line argument.

You could search through the entire file looking for a line that matches the regex, but that can waste tons of time on log files with an unknown format. The fix in the pull request instead allows the regex to match even when the request is garbage.

I am planning to add a test case to the pull request, but will be super busy for another 1.5 weeks.

@mattab
Owner

Sorry I missed the "Crashed" keyword!

@diosmosis
Collaborator

Fixed in 3572ef7.

@anonymous-piwik-user

Replying to capedfuzz:

Fixed in 3572ef7.

Thanks for the quick fix! I just ran it on last night's logs and it made it all the way through.

I would beware of a potential Heisenbug, though: we have a ton of sites that basically no real human ever visits. It's totally possible that one day we could have a log containing nothing but invalid requests (e.g. the bad.log case above). In that case, the entire import will still crash.

For a workaround in a similar vein, maybe if you run through the entire file and find no valid log lines, it shouldn't error out? I suppose to be reliable, the number of log lines (1000) would need to be adjustable. I would personally let it check the whole file for a match.

Does Piwik record the invalid requests that it finds once it knows the log format? I ask because you should get the same results from one file containing,

invalid
invalid
invalid
invalid
valid

as you do if you split that into two files,

invalid
invalid
invalid
invalid
valid

Only one invalid line would be parsed with the files split, since no format would be found in the first file. But if the invalid lines are dropped anyway, it's a moot point.

Thanks again.

@mattab
Owner

In bb477d7: Refs #4352 add comments

@mattab
Owner

In d549d2c: Refs #4352 increase to 100,000, it takes ~ 6 seconds on my box on a file with 100k lines, so still fast enough to fail. If a file has more than 100k wrong lines there's defnitely something wrong with it.

@anonymous-piwik-user

Agreed, thanks a million guys.

@anonymous-piwik-user anonymous-piwik-user added this to the 2.0 - Piwik 2.0 milestone
@sabl0r sabl0r referenced this issue from a commit in sabl0r/piwik
@mattab mattab Refs #4352 add comments bb477d7
@sabl0r sabl0r referenced this issue from a commit in sabl0r/piwik
@mattab mattab Refs #4352 increase to 100,000, it takes ~ 6 seconds on my box on a f…
…ile with 100k lines, so still fast enough to fail. If a file has more than 100k wrong lines there's defnitely something wrong with it.
d549d2c
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.