brozzler - distributed browser-based web crawler
IA's public Wayback Machine (moved from SourceForge)
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
WARC writing MITM HTTP/S proxy
One webpage for every book ever published!
A chrome browser extension
Python Client Library for the Archive.org OpenLibrary API
Sort-friendly URI Reordering Transform (SURT) python module
Cache stampede test harness. Code accompanies the presentation made at RedisConf 2017, 30 May to 1 June, 2017, in San Francisco.
rethinkdb python library
The Internet Archive Book Reader
Python script to create CDX index files of WARC data
Reduce annoying 404 pages by automatically checking for an archived copy in the Wayback Machine. Learn more about this Test Pilot experiment at https://testpilot.firefox.com/
A queue-controlled browser automation tool for improving web crawl quality
Python library for reading and writing warc files
Liveweb proxy of the Wayback Machine project
Repo to collect tools to help Internet Archive activities
For code related to making ePub files
web access control (exclusion oracle) tools for optional use with wayback machine