Nick Sweeting edited this page Jan 29, 2019 · 16 revisions

ArchiveBox Documentation

(Recently renamed from Bookmark Archiver)

▶️ If you need help or have a question, you can open an issue or reach out on Twitter.

Use the sidebar on the right to browse documentation topics ->


Can import links from:

  • Browser history or bookmarks (Chrome, Firefox, Safari, IE, Opera)
  • RSS or plain text lists
  • Pocket, Pinboard, Instapaper
  • Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, and any text with links in it!

Can save these things for each site:

  • favicon.ico favicon of the site
  • example.com/page-name.html wget clone of the site, with .html appended if not present
  • output.pdf Printed PDF of site using headless chrome
  • screenshot.png 1440x900 screenshot of site using headless chrome
  • output.html DOM Dump of the HTML after rendering using headless chrome
  • archive.org.txt A link to the saved site on archive.org
  • warc/ for the html + gzipped warc file .gz
  • media/ any mp4, mp3, subtitles, and metadata found using youtube-dl
  • git/ clone of any repository for github, bitbucket, or gitlab links
  • index.html & index.json HTML and JSON index files containing metadata and details

By default it does everything, visit the Configuration page for details on how to disable or fine-tune certain methods.

The archiving is additive, so you can schedule ./archive to run regularly and pull new links into the index. All the saved content is static and indexed with JSON files, so it lives forever & is easily parseable, it requires no always-running backend.

DEMO: archive.sweeting.me

Desktop ScreenshotMobile Screenshot

Details

ArchiveBox/archive is the script that takes a Pocket-format, JSON-format, Netscape-format, RSS, or plan-text-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online.

The archiver produces an output folder output/ containing an index.html, index.json, and archived copies of all the sites, organized by timestamp bookmarked. It's Powered by headless Chromium and good 'ol wget.

Wget doesn't work on sites you need to be logged into, but chrome headless does, see the Configuration* section for CHROME_USER_DATA_DIR.

Large Exports & Estimated Runtime:

I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.

You can run it in parallel by using the resume feature, or by manually splitting export.html into multiple files:

./archive export.html 1498800000 &  # second argument is timestamp to resume downloading from
./archive export.html 1498810000 &
./archive export.html 1498820000 &
./archive export.html 1498830000 &

Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).

If you already imported a huge list of bookmarks and want to import only new bookmarks, you can use the ONLY_NEW environment variable. This is useful if you want to import a bookmark dump periodically and want to skip broken links which are already in the index.


You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.