New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations for import_logs.py #300
Conversation
@@ -1704,13 +1723,11 @@ def invalid_line(line, reason): | |||
invalid_line(line, 'invalid encoding') | |||
continue | |||
|
|||
# Check if the hit must be excluded. | |||
if all((method(hit) for method in self.check_methods)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this if statement that checks for valid requests, was removed in your pull request, maybe this could explain some of the parsing CPU improvements?
Hi, most of the CPU improvements comes from the other commits. The diff you're quoting is from commit 63d000f, I may miss something but I think it's redundant with the same (inverted) check made before the date check. |
Ok you are right, your commit is valid, I missed the other redundant check! |
The rest of the pull request looks fine for me. |
FYI I've asked Travis if we can run the build on Python 2.6 instead of current 2.7.3. Will let you know when they reply.... |
OK thanks, I think I have still some work to do in order to not break upgrade on servers with python 2.6. Maybe one solution would be to fallback to no cache if we're in python 2.6 and the library is not installed, and add some documentation about that requirement (README.md + a non fatal warning ?) |
@cbonte Thanks for the details. What's the point of running a dry-run on a 100 million hits file, though? I'm genuinely intrigued for your use-case. Considering you have a very specific use-case IMHO, we could not merge this change BUT create a parse_date function that you could easily override locally. The import_logs scripts was specifically designed to be easy to override. @mattab, what do you want to do? PS: why concatenate the date and timezone to create the key rather than use a tuple? |
I vote for removing the date optimization, if it's really small difference when not using --dry-run (especially if it helps make it run on Python 2.6?) |
Honestly, in our usage, this is not a small difference. We are currently studying the migration from urchin to piwik for all of our websites, most of them generating more than 100 millions hits per day (the logs I used concerned a low traffic day). |
@cbay dry-run is used on some websites to profile them with piwik. We are currently in a study phase to migrate some high traffic website from urchin to piwik. dry-run is heavily used to have some quick statistics and compare with urchin ones. In a migration process, this is quite helpful, at least in our case, I hope it can be helpful to others and to the piwik team in order to promote it. |
@cbonte I understand your use case, but since it's far from being typical IMHO and it adds complexity, I'm reluctant to merge it. Regarding performance, have you tried PyPy? |
Hi again cbay, Concerning performance with PyPi, I didn't have time today to reprocuce it on the same hardware, but launching the bench on a openvz container with 4 vcpus, with a Debian Squeeze and Python 2.6.6 (hardware is built on 2 physical Intel(R) Xeon(R) CPU X5670 @ 2.93GHz / 6 cores / 12 threads), i could obtain such results : |
…upported version (best practise is to test at least the minimum supported version) Refs #300
@cbonte before you make the script compatible with Python 2.6, could you please re-merge with master, and push a trivial change, just to trigger the build again ? I expect that the build should fail, since the code is not compatible with Python 2.6 and the builds should now use python 2.6 |
@mattab I rebased the branch, which triggered a build (see commit 9b6b1f6). Then I could add the option do disable the cache for Python 2.6 without OrderedDict, I think it's ready now, feel free to review the code. |
Would such option have any benefit at all? we try to avoid adding option or setting unless they are absolutely necessary. PR looks good to me |
hi, sorry I didn't mean a new command line option, it was about setting the cache optional, not active if OrderedDict is not present instead of raising an exception, as it's done in the latest commits. |
Do you know if some users don't have OrderedDict present in their python setup? If so, then it would be best to make it optional (instead of raising exception, silently ignore the cache and use no-cache). |
@mattab I'd say that most users with python 2.6 are concerned, that's why i already made it optional (no exception raised but a message is displayed : I can make it completely silent if you prefer). |
because it's optional and does not change anything for 99.99% of users, Silence is definitely better |
check_methods are called twice for each hit. The first ones are sufficient to decide if the hit should be excluded or not.
cbay reported that set comprehension was available only in python 2.7+. This patch fixes the syntax to keep backward compatibility with python 2.6.
Fallback to a non cached dates when OrderedDict is not available. It can occur with python < 2.7 and Pypi OrderedDict is not installed.
As suggested by cbay, the cache key can be a tuple instead of a string concatenation.
Modification done and I rebased the branch from master, hoping it would fix the Travis tests but it still fails and I see that master is also failing. |
Optimizations for import_logs.py Fixes #5314 Kuddos for nice pull request! We hope to see more contributions from you in the future :)
…pageviews to be lost when importing big log files. This particular log file I'm testing on is for an intranet with thousands times the same IP address. Not sure if it's related, but the same IP address will have many visits at the same second, for different users (different _id=X in the piwik.php requests) refs #300
…pageviews to be lost when importing big log files. This particular log file I'm testing on is for an intranet with thousands times the same IP address. Not sure if it's related, but the same IP address will have many visits at the same second, for different users (different _id=X in the piwik.php requests) refs matomo-org/matomo#300
This set of patches addresses some optimizations in the log-analytics script.
Tests were made on an extract of 1 million lines from logs in ncsa format (dry-run).
Before the optimizations :
Total time: 219 seconds
Requests imported per second: 4554.54 requests per second
After :
Total time: 72 seconds
Requests imported per second: 13717.18 requests per second