Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quiet / Minimal verbosity flag #184

Closed
2 of 7 tasks
n0ncetonic opened this issue Mar 19, 2019 · 4 comments
Closed
2 of 7 tasks

Quiet / Minimal verbosity flag #184

n0ncetonic opened this issue Mar 19, 2019 · 4 comments
Labels
size: easy status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet touches: configuration

Comments

@n0ncetonic
Copy link
Contributor

Type:

  • General Question or Disussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves
ArchiveBox could see a performance increase by allowing users to minimize or completely disable the mostly informational/debug messages during archiving.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

ArchiveBox is great at what it does and the multi-output makes for an extremely robust archival solution but it is pretty slow as I'm experiencing as archive 4.3k urls. I know part of this is because the urls are not processed in parallel (which was being discussed elsewhere) but another big contributor to many commandline applications taking a performance hit is the I/O blocking from logging to console. Terminal output of every individual phase of archiving a url adds to the I/O footprint of ArchiveBox because it forces the terminal to wait for input, draw to the screen, update its view, refresh the screen, etc. for every message that is posted.

I'm interested In knowing if anyone has run benchmarks on ./archive while redirecting stdout to /dev/null vs standard output with no progress bar vs standard output with progress bars to assess the possibility of introducing a flag/option that will either silence output entirely or limit archival status messages to the [+] [2019-03-19 13:46:40] ... message output when a URL is beginning to be archived.

What hacks or alternative solutions have you tried to solve the problem?
Tests could be done by running cat inputFile.txt | ./archive > /dev/null

How badly do you want this new feature?

  • It's an urgent deal-breaker, I cant live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to contribute to development

Here are some links further detailing the need for buffered I/O when dealing with applications that are heavily impacted in performance by blocking I/O operations:

https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up
https://eklitzke.org/stdout-buffering

@pirate
Copy link
Member

pirate commented Mar 19, 2019

Thanks for the suggestion!

On a multi-core machine the stdout buffering is not going to be the limiting factor on ArchiveBox performance until it's running at least 2 or 3 orders of magnitude faster than right now.

If you're interested in the performance breakdown, there are several major performance tickets that are going to be cleared up in the next ~6 months before stdout buffering becomes worth investigating:

I would be open to adding a -q option for quieter output, but only if it does something more than just > /dev/null.
Maybe it could output just the status lines at the start and end, in a format suitable for logging, e.g.:

[2019-03-12 07:24:53] ArchiveBox started. Importing 30 new links from output/sources/sharli-example.txt (Parsed as Plain Text)...
[2019-03-12 07:24:53] ArchiveBox finished. Imported 30 new links: 2 failed, 28 succeeded. Saved index (2,945 links) to output/index.html.

@pirate pirate added status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet touches: configuration size: easy labels Mar 19, 2019
@n0ncetonic
Copy link
Contributor Author

So I was running into weird issues trying to pipe to /dev/null and so that experiment was put on hold. Here I'm posting some logs of runs with no progress bar or color vs my branch with some flags added to the chromium headless execution.

Current ArchiveBox master branch

[+] [2019-03-19 10:04:15] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1280 (new)
      <snip>
[+] [2019-03-19 10:04:34] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1281 (new)
    <snip>
[+] [2019-03-19 10:04:49] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html
   <snip>
[+] [2019-03-19 10:05:05] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1283 (new)
  <snip>
[+] [2019-03-19 10:05:18] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1284 (new)

Optimization flags on Chromium

[+] [2019-03-19 15:56:07] "https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1909 (new)
<snip>
[+] [2019-03-19 15:56:16] "https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1910 (new)
<snip>
[+] [2019-03-19 15:56:25] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1911 (new)
<snip>
[+] [2019-03-19 15:56:34] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1912 (new)
<snip>
[+] [2019-03-19 15:56:43] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1913 (new)
 <snip>
[+] [2019-03-19 15:56:52] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1914 (new)

master branch has an average archival time per url using all archival methods except Google Favicon of about 12seconds with a range of 10seconds to 13seconds. With some added flags to help optimize the performance of headless chromium and the same archival methods I saw an average time of about 9seconds with a range of 8seconds to 10seconds. Slight increase in performance of only a few seconds. At 12seconds a request I am getting approximately 5 links archived a minute vs 6 links a minute with optimized chromium headless.

While optimized flags helped a bit I noticed the major bottleneck in archival speed was with wget and there isn't a whole lot that can be done in this regard as wget's functionality is not present in many other commandline download utilities and is almost non-existent in faster tools such as axel or aria2. I did stumble upon gnu wget2 which is a spiritual successor to gnu wget with much more modern performance focused features such as support for HTTP/2, parallel connections, and If-Modified-Since header support as well as support for processing RSS feeds. I am going to test wget2 both for performance and to determine if it can easily be dropped into the master branch with minimal changes to the overall code base.

My hope is that the bottleneck in processing links will be alleviated and will add to the performance gains expected once the project is moved to Django.

wget2 project home is https://gitlab.com/gnuwget/wget2 in case others are interested

@pirate
Copy link
Member

pirate commented Mar 20, 2019

We're already planning on moving to wpull to replace wget in the near future, see here for more info: #177

If it ends up being IO bound and not CPU bound we can always stick an event loop in each worker and do simultaneous async archiving of multiple links on each core.

@pirate
Copy link
Member

pirate commented Apr 10, 2021

Going to close this for now because the speed concerns have been addressed in other ways (v0.6 is 10-100x faster than v0.5 in many operations), and much of archivebox's CLI output is now split between stderr and stdout, so if you want less verbose output you can always pipe stdout to /dev/null and just read stderr.

I thought about adding a SHOW_HINTS=True/False config flag to further reduce verbosity but decided against it, as it's not worth the overhead of another config option.
Instead I'm just hiding most hints once you have more than 25 snapshots in your archive, as it assumes you know how to use it by then.

@pirate pirate closed this as completed Apr 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: easy status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet touches: configuration
Projects
None yet
Development

No branches or pull requests

2 participants