Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
(Recently renamed from
Use the sidebar on the right to browse documentation topics ->
Can import links from:
- Browser history or bookmarks (Chrome, Firefox, Safari, IE, Opera)
- RSS or plain text lists
- Pocket, Pinboard, Instapaper
- Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, and any text with links in it!
Can save these things for each site:
favicon.icofavicon of the site
example.com/page-name.htmlwget clone of the site, with .html appended if not present
output.pdfPrinted PDF of site using headless chrome
screenshot.png1440x900 screenshot of site using headless chrome
output.htmlDOM Dump of the HTML after rendering using headless chrome
archive.org.txtA link to the saved site on archive.org
warc/for the html + gzipped warc file .gz
media/any mp4, mp3, subtitles, and metadata found using youtube-dl
git/clone of any repository for github, bitbucket, or gitlab links
index.jsonHTML and JSON index files containing metadata and details
By default it does everything, visit the Configuration page for details on how to disable or fine-tune certain methods.
The archiving is additive, so you can schedule
./archive to run regularly and pull new links into the index.
All the saved content is static and indexed with JSON files, so it lives forever & is easily parseable, it requires no always-running backend.
ArchiveBox/archive is the script that takes a Pocket-format, JSON-format, Netscape-format, RSS, or plan-text-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online.
The archiver produces an output folder
output/ containing an
index.json, and archived copies of all the sites,
organized by timestamp bookmarked. It's Powered by headless Chromium and good 'ol
Wget doesn't work on sites you need to be logged into, but chrome headless does, see the Configuration* section for
Large Exports & Estimated Runtime:
I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
You can run it in parallel by using the
resume feature, or by manually splitting export.html into multiple files:
./archive export.html 1498800000 & # second argument is timestamp to resume downloading from ./archive export.html 1498810000 & ./archive export.html 1498820000 & ./archive export.html 1498830000 &
Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
If you already imported a huge list of bookmarks and want to import only new
bookmarks, you can use the
ONLY_NEW environment variable. This is useful if
you want to import a bookmark dump periodically and want to skip broken links
which are already in the index.