Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upHome
ArchiveBox Documentation
(Recently renamed from Bookmark Archiver
)
▶️ If you need help or have a question, you can open an issue or reach out on Twitter.
Use the sidebar on the right to browse documentation topics ->
Can import links from:
-
Browser history or bookmarks (Chrome, Firefox, Safari, IE, Opera)
-
RSS or plain text lists
-
Pocket, Pinboard, Instapaper
- Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, and any text with links in it!
Can save these things for each site:
-
favicon.ico
favicon of the site -
example.com/page-name.html
wget clone of the site, with .html appended if not present -
output.pdf
Printed PDF of site using headless chrome -
screenshot.png
1440x900 screenshot of site using headless chrome -
output.html
DOM Dump of the HTML after rendering using headless chrome -
archive.org.txt
A link to the saved site on archive.org -
warc/
for the html + gzipped warc file .gz -
media/
any mp4, mp3, subtitles, and metadata found using youtube-dl -
git/
clone of any repository for github, bitbucket, or gitlab links -
index.html
&index.json
HTML and JSON index files containing metadata and details
By default it does everything, visit the Configuration page for details on how to disable or fine-tune certain methods.
The archiving is additive, so you can schedule ./archive
to run regularly and pull new links into the index.
All the saved content is static and indexed with JSON files, so it lives forever & is easily parseable, it requires no always-running backend.
Details
ArchiveBox/archive
is the script that takes a Pocket-format, JSON-format, Netscape-format, RSS, or plan-text-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online.
The archiver produces an output folder output/
containing an index.html
, index.json
, and archived copies of all the sites,
organized by timestamp bookmarked. It's Powered by headless Chromium and good 'ol wget
.
Wget doesn't work on sites you need to be logged into, but chrome headless does, see the Configuration* section for CHROME_USER_DATA_DIR
.
Large Exports & Estimated Runtime:
I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
You can run it in parallel by using the resume
feature, or by manually splitting export.html into multiple files:
./archive export.html 1498800000 & # second argument is timestamp to resume downloading from
./archive export.html 1498810000 &
./archive export.html 1498820000 &
./archive export.html 1498830000 &
Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
If you already imported a huge list of bookmarks and want to import only new
bookmarks, you can use the ONLY_NEW
environment variable. This is useful if
you want to import a bookmark dump periodically and want to skip broken links
which are already in the index.