-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quiet / Minimal verbosity flag #184
Comments
Thanks for the suggestion! On a multi-core machine the stdout buffering is not going to be the limiting factor on ArchiveBox performance until it's running at least 2 or 3 orders of magnitude faster than right now. If you're interested in the performance breakdown, there are several major performance tickets that are going to be cleared up in the next ~6 months before stdout buffering becomes worth investigating:
I would be open to adding a [2019-03-12 07:24:53] ArchiveBox started. Importing 30 new links from output/sources/sharli-example.txt (Parsed as Plain Text)...
[2019-03-12 07:24:53] ArchiveBox finished. Imported 30 new links: 2 failed, 28 succeeded. Saved index (2,945 links) to output/index.html. |
So I was running into weird issues trying to pipe to Current ArchiveBox
Optimization flags on Chromium
While optimized flags helped a bit I noticed the major bottleneck in archival speed was with My hope is that the bottleneck in processing links will be alleviated and will add to the performance gains expected once the project is moved to Django. wget2 project home is https://gitlab.com/gnuwget/wget2 in case others are interested |
We're already planning on moving to If it ends up being IO bound and not CPU bound we can always stick an event loop in each worker and do simultaneous async archiving of multiple links on each core. |
Going to close this for now because the speed concerns have been addressed in other ways (v0.6 is 10-100x faster than v0.5 in many operations), and much of archivebox's CLI output is now split between stderr and stdout, so if you want less verbose output you can always pipe stdout to /dev/null and just read stderr. I thought about adding a |
Type:
What is the problem that your feature request solves
ArchiveBox could see a performance increase by allowing users to minimize or completely disable the mostly informational/debug messages during archiving.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
ArchiveBox is great at what it does and the multi-output makes for an extremely robust archival solution but it is pretty slow as I'm experiencing as archive 4.3k urls. I know part of this is because the urls are not processed in parallel (which was being discussed elsewhere) but another big contributor to many commandline applications taking a performance hit is the I/O blocking from logging to console. Terminal output of every individual phase of archiving a url adds to the I/O footprint of ArchiveBox because it forces the terminal to wait for input, draw to the screen, update its view, refresh the screen, etc. for every message that is posted.
I'm interested In knowing if anyone has run benchmarks on
./archive
while redirecting stdout to /dev/null vs standard output with no progress bar vs standard output with progress bars to assess the possibility of introducing a flag/option that will either silence output entirely or limit archival status messages to the[+] [2019-03-19 13:46:40] ...
message output when a URL is beginning to be archived.What hacks or alternative solutions have you tried to solve the problem?
Tests could be done by running
cat inputFile.txt | ./archive > /dev/null
How badly do you want this new feature?
Here are some links further detailing the need for buffered I/O when dealing with applications that are heavily impacted in performance by blocking I/O operations:
https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up
https://eklitzke.org/stdout-buffering
The text was updated successfully, but these errors were encountered: