Common Crawl Index Server

This project is a deployment of the pywb web archive replay and index server to provide an index query mechanism for datasets provided by CommonCrawl

Usage & Installation

To run locally, please install with pip install -r requirements.txt

CommonCrawl stores data on Amazon S3 and the index is publicly accessible from S3.

Currently, individual indexes for each crawl can be accessed under: s3://aws-publicdatasets/common-crawl/cc-index/collections/[CC-MAIN-YYYY-WW]

Most of the index will be served from S3, however, a smaller secondary index must be installed locally for each collection.

This can be done automatically by running: install-collections.sh which will install all available collections locally.

This script will use s3cmd tool to sync the the index.

If successful, there should be collections directory with at least one index.

To run, simply run cdx-server to start up the index server, or optionally wayback, to run pywb replay system along with the cdx server.

CDX Server API

The API endpoints correspond to existing index collections in collections directory.

For example, one currently available index is CC-MAIN-2015-06 and it can be accessed via

http://localhost:8080/CC-MAIN-2015-06-index?url=commoncrawl.org

Refer to CDX Server API for more detailed instructions on the API itself.

The pywb README provides additional information about pywb.

Building the Index

Please see the webarchive-indexing repository for more info on how the index is built.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
install-collections.sh		install-collections.sh
requirements.txt		requirements.txt
run-uwsgi.sh		run-uwsgi.sh
uwsgi.ini		uwsgi.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Common Crawl Index Server

Usage & Installation

CDX Server API

Building the Index

About

Releases

Packages

Contributors 2

Languages

ikreymer/cc-index-server

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Index Server

Usage & Installation

CDX Server API

Building the Index

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages