Skip to content
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
42 lines (31 sloc) 3.86 KB

ArchiveSpark Documentation

ArchiveSpark is a Java/JVM library, written in Scala, based on Apache Spark, which can be used as an API in any Java/Scala/JVM program for easy and efficient access to Web archives and other supported datasets. In addition to that, it can be used stand-alone using Scala's interactive shell or notebook tools, such as Jupyter.

To get familiar with ArchiveSpark, but also for most of the common use cases, we recommend the use with Jupyter. In order to get you started more easily, we provide a pre-packaged and pre-configured Docker container with ArchiveSpark and Jupyter ready to run, just one command away:

To learn more about ArchiveSpark have a look at our GitHub repository.

Getting Started

API Docs

Developer Documentation


ArchiveSpark is described and published in two research papers, which you should cite when you use ArchiveSpark in your work:

In addition to these publications, ArchiveSpark was used as a major component in the following works:

You can’t perform that action at this time.