Skip to content

ibnesayeed/docker-archivespark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArchiveSpark

ArchiveSpark is a framework that facilitates access to Web Archives and enables easy data extraction as well as derivation. This repository makes ArchiveSpark available as a Docker image. We made it easy to run ArchiveSpark as a Docker container without the need of a Hadoop cluster and without the need of installing and configuring various dependencies. To get started, run the following command (assuming that Docker is installed on the machine):

$ docker run --rm -it -p 8888:8888 ibnesayeed/archivespark

Then open a browser and access the Notebook at http://localhost:8888/. There should be an example notebook (example.ipynb) available to play with the included example CDX and WARC files.

To run the container with CDX and WARC files other than included examples, mount corresponding directories in the container at run time. Additionally, to make the notebooks available outside of the container, mount a notebook directory of the host machine with write permissions. Following command illustrates the usage:

$ docker run --rm -it -p 8888:8888 -v /path/to/cdx:/cdx -v /path/to/warc:/warc -v /path/to/notebooks:/notebooks ibnesayeed/archivespark

About

ArchiveSpark is a framework that facilitates access to Web Archives, enables easy data extraction as well as derivation.

Resources

Stars

Watchers

Forks

Packages

No packages published