Quiet / Minimal verbosity flag #184

n0ncetonic · 2019-03-19T20:50:45Z

Type:

General Question or Disussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves
ArchiveBox could see a performance increase by allowing users to minimize or completely disable the mostly informational/debug messages during archiving.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

ArchiveBox is great at what it does and the multi-output makes for an extremely robust archival solution but it is pretty slow as I'm experiencing as archive 4.3k urls. I know part of this is because the urls are not processed in parallel (which was being discussed elsewhere) but another big contributor to many commandline applications taking a performance hit is the I/O blocking from logging to console. Terminal output of every individual phase of archiving a url adds to the I/O footprint of ArchiveBox because it forces the terminal to wait for input, draw to the screen, update its view, refresh the screen, etc. for every message that is posted.

I'm interested In knowing if anyone has run benchmarks on ./archive while redirecting stdout to /dev/null vs standard output with no progress bar vs standard output with progress bars to assess the possibility of introducing a flag/option that will either silence output entirely or limit archival status messages to the [+] [2019-03-19 13:46:40] ... message output when a URL is beginning to be archived.

What hacks or alternative solutions have you tried to solve the problem?
Tests could be done by running cat inputFile.txt | ./archive > /dev/null

How badly do you want this new feature?

It's an urgent deal-breaker, I cant live without it
It's important to add it in the near-mid term future
It would be nice to have eventually
I'm willing to contribute to development

Here are some links further detailing the need for buffered I/O when dealing with applications that are heavily impacted in performance by blocking I/O operations:

https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up
https://eklitzke.org/stdout-buffering

The text was updated successfully, but these errors were encountered:

pirate · 2019-03-19T23:29:16Z

Thanks for the suggestion!

On a multi-core machine the stdout buffering is not going to be the limiting factor on ArchiveBox performance until it's running at least 2 or 3 orders of magnitude faster than right now.

If you're interested in the performance breakdown, there are several major performance tickets that are going to be cleared up in the next ~6 months before stdout buffering becomes worth investigating:

an entire instance of Chrome is launched and killed 3 times for every link, this will be fixed by moving to pyppeteer: Switch all dependencies to pure python and release ArchiveBox pip package #177 (the current design is so ridiculously inefficient, I'm amazed that no one has opened a ticket complaining about it yet)
we create child processes to call out to wget, youtube-dl, and curl multiple times for each link, this will also be fixed by moving to pure python versions: Switch all dependencies to pure python and release ArchiveBox pip package #177
we load and rewrite the entire main index file on every link as a hack to get semi-realtime index UI updates during the archive process
it's singlethreaded. pyppeteer will fix this as well by allowing us to create 1 headless browser per core and process archive to n links at a time
it uses static HTML and JSON files to store the data instead of SQLite with indexes, so everything constantly has to be read, parsed, iterated+filtered, and dumped back to disk, this will be fixed by: New index single-source-of-truth instead of JSON: SQL Database w/ migrations #57

I would be open to adding a -q option for quieter output, but only if it does something more than just > /dev/null.
Maybe it could output just the status lines at the start and end, in a format suitable for logging, e.g.:

[2019-03-12 07:24:53] ArchiveBox started. Importing 30 new links from output/sources/sharli-example.txt (Parsed as Plain Text)...
[2019-03-12 07:24:53] ArchiveBox finished. Imported 30 new links: 2 failed, 28 succeeded. Saved index (2,945 links) to output/index.html.

n0ncetonic · 2019-03-20T03:58:34Z

So I was running into weird issues trying to pipe to /dev/null and so that experiment was put on hold. Here I'm posting some logs of runs with no progress bar or color vs my branch with some flags added to the chromium headless execution.

Current ArchiveBox master branch

[+] [2019-03-19 10:04:15] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1280 (new)
      <snip>
[+] [2019-03-19 10:04:34] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1281 (new)
    <snip>
[+] [2019-03-19 10:04:49] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html
   <snip>
[+] [2019-03-19 10:05:05] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1283 (new)
  <snip>
[+] [2019-03-19 10:05:18] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1284 (new)

Optimization flags on Chromium

[+] [2019-03-19 15:56:07] "https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1909 (new)
<snip>
[+] [2019-03-19 15:56:16] "https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1910 (new)
<snip>
[+] [2019-03-19 15:56:25] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1911 (new)
<snip>
[+] [2019-03-19 15:56:34] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1912 (new)
<snip>
[+] [2019-03-19 15:56:43] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1913 (new)
 <snip>
[+] [2019-03-19 15:56:52] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1914 (new)

master branch has an average archival time per url using all archival methods except Google Favicon of about 12seconds with a range of 10seconds to 13seconds. With some added flags to help optimize the performance of headless chromium and the same archival methods I saw an average time of about 9seconds with a range of 8seconds to 10seconds. Slight increase in performance of only a few seconds. At 12seconds a request I am getting approximately 5 links archived a minute vs 6 links a minute with optimized chromium headless.

While optimized flags helped a bit I noticed the major bottleneck in archival speed was with wget and there isn't a whole lot that can be done in this regard as wget's functionality is not present in many other commandline download utilities and is almost non-existent in faster tools such as axel or aria2. I did stumble upon gnu wget2 which is a spiritual successor to gnu wget with much more modern performance focused features such as support for HTTP/2, parallel connections, and If-Modified-Since header support as well as support for processing RSS feeds. I am going to test wget2 both for performance and to determine if it can easily be dropped into the master branch with minimal changes to the overall code base.

My hope is that the bottleneck in processing links will be alleviated and will add to the performance gains expected once the project is moved to Django.

wget2 project home is https://gitlab.com/gnuwget/wget2 in case others are interested

pirate · 2019-03-20T23:31:21Z

We're already planning on moving to wpull to replace wget in the near future, see here for more info: #177

If it ends up being IO bound and not CPU bound we can always stick an event loop in each worker and do simultaneous async archiving of multiple links on each core.

pirate · 2021-04-10T09:01:10Z

Going to close this for now because the speed concerns have been addressed in other ways (v0.6 is 10-100x faster than v0.5 in many operations), and much of archivebox's CLI output is now split between stderr and stdout, so if you want less verbose output you can always pipe stdout to /dev/null and just read stderr.

I thought about adding a SHOW_HINTS=True/False config flag to further reduce verbosity but decided against it, as it's not worth the overhead of another config option.
Instead I'm just hiding most hints once you have more than 25 snapshots in your archive, as it assumes you know how to use it by then.

pirate added status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet touches: configuration size: easy labels Mar 19, 2019

pirate closed this as completed Apr 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quiet / Minimal verbosity flag #184

Quiet / Minimal verbosity flag #184

n0ncetonic commented Mar 19, 2019

pirate commented Mar 19, 2019

n0ncetonic commented Mar 20, 2019

pirate commented Mar 20, 2019

pirate commented Apr 10, 2021 •

edited

Loading

Quiet / Minimal verbosity flag #184

Quiet / Minimal verbosity flag #184

Comments

n0ncetonic commented Mar 19, 2019

pirate commented Mar 19, 2019

n0ncetonic commented Mar 20, 2019

pirate commented Mar 20, 2019

pirate commented Apr 10, 2021 • edited Loading

pirate commented Apr 10, 2021 •

edited

Loading