ArchiveSpark

ArchiveSpark is a framework that facilitates access to Web Archives and enables easy data extraction as well as derivation. This repository makes ArchiveSpark available as a Docker image. We made it easy to run ArchiveSpark as a Docker container without the need of a Hadoop cluster and without the need of installing and configuring various dependencies. To get started, run the following command (assuming that Docker is installed on the machine):

$ docker run --rm -it -p 8888:8888 ibnesayeed/archivespark

Then open a browser and access the Notebook at http://localhost:8888/. There should be an example notebook (example.ipynb) available to play with the included example CDX and WARC files.

To run the container with CDX and WARC files other than included examples, mount corresponding directories in the container at run time. Additionally, to make the notebooks available outside of the container, mount a notebook directory of the host machine with write permissions. Following command illustrates the usage:

$ docker run --rm -it -p 8888:8888 -v /path/to/cdx:/cdx -v /path/to/warc:/warc -v /path/to/notebooks:/notebooks ibnesayeed/archivespark

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cdx		cdx
warc		warc
Dockerfile		Dockerfile
README.md		README.md
example.ipynb		example.ipynb
kernel.json		kernel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdx

cdx

warc

warc

Dockerfile

Dockerfile

README.md

README.md

example.ipynb

example.ipynb

kernel.json

kernel.json

Repository files navigation

ArchiveSpark

About

Releases 2

Packages

Languages

ibnesayeed/docker-archivespark

Folders and files

Latest commit

History

Repository files navigation

ArchiveSpark

About

Resources

Stars

Watchers

Forks

Languages