-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import_logs.py crashes on invalid requests #4352
Comments
Thanks for the report. We'll try to reproduce and add a test case showing the failure and that it's fixed. |
Thanks, it looks like the problem is that the path regex was expecting two spaces that weren't present, between where e.g. "GET" and "HTTP/1.1". I came up with a patch that works locally and made a pull request: I confess I couldn't get the test suite to run, so it might need some work still. |
Piwik puts them as "invalid" which I think they really are invalid request. Marking as wontfix. If you think it really should be fixed let us know why (since they appear to be invalid requests) |
Replying to matt:
It looks like I left out something important -- these are often the first requests in the logs!
You could search through the entire file looking for a line that matches the regex, but that can waste tons of time on log files with an unknown format. The fix in the pull request instead allows the regex to match even when the request is garbage. I am planning to add a test case to the pull request, but will be super busy for another 1.5 weeks. |
Sorry I missed the "Crashed" keyword! |
Fixed in 3572ef7. |
Replying to capedfuzz:
Thanks for the quick fix! I just ran it on last night's logs and it made it all the way through. I would beware of a potential Heisenbug, though: we have a ton of sites that basically no real human ever visits. It's totally possible that one day we could have a log containing nothing but invalid requests (e.g. the bad.log case above). In that case, the entire import will still crash. For a workaround in a similar vein, maybe if you run through the entire file and find no valid log lines, it shouldn't error out? I suppose to be reliable, the number of log lines (1000) would need to be adjustable. I would personally let it check the whole file for a match. Does Piwik record the invalid requests that it finds once it knows the log format? I ask because you should get the same results from one file containing,
as you do if you split that into two files,
Only one invalid line would be parsed with the files split, since no format would be found in the first file. But if the invalid lines are dropped anyway, it's a moot point. Thanks again. |
Agreed, thanks a million guys. |
…box on a file with 100k lines, so still fast enough to fail. If a file has more than 100k wrong lines there's defnitely something wrong with it.
We have lots of log lines with invalid requests, for example:
(these are drive-by exploit attempts).
It looks like the script wants both the request method and the HTTP version, since if I modify the line above it works:
We can't stop these requests, so I guess the parser should at least parse any old garbage that shows up in the path?
The text was updated successfully, but these errors were encountered: