An Apache Spark framework for easy data processing, extraction as well as derivation for Web archives and archival collections, developed by the Internet Archive and L3S Research Center.
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
Scripts to transfer archive.org collections, using https://github.com/jjjake/internetarchive
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Don't write specs anymore, just save 'em while testing your code interactively. Specs will become a byproduct.
Analyze digitized books from the Internet Archive remotely with ArchiveSpark