Browse files

Update README (move pywb configuration section to wiki),

recommend running pywb.apps.wayback
make uWSGI optional (but included in Vagrant)
rename ->
  • Loading branch information...
1 parent fe1fa43 commit 25a851435218e15a67bfe1cf95c42ca5bbb8f2b8 @ikreymer committed Mar 5, 2014
Showing with 52 additions and 102 deletions.
  1. +48 −94
  2. +2 −1 Vagrantfile
  3. +2 −7 →
@@ -4,22 +4,23 @@ PyWb 0.2 Beta
[![Build Status](](
[![Coverage Status](](
-pywb is a Python re-implementation of the Wayback Machine software.
+pywb is a new Python implementation of the Wayback Machine software and tools.
-The goal is to provide a brand new, clean implementation of the Wayback Machine.
+At its core, it provides a web app which 'replays' archived web data stored in ARC and WARC files and provides metadata about the archived
-The 0.2 architecture includes a seperation of the project into distinct packages, which have
-their own tests and may be used seperately if needed.
-The focus is to focus on providing the best/accurate replay of archival web content (usually in WARC or ARC files),
-and new ways of handling dynamic and difficult content.
+The basic feature set of web replay is nearly complete.
-pywb should also be easy to deploy and modify!
+pywb features new domain specific rules which can be applied to certain difficult and dynamic content in order to make
+web replay work.
+The rules set will be under constant iteration to deal with new challenges as the web evoles.
### Wayback Machine
-A typical Wayback Machine serves archival content in the following form:
+pywb is compatible with the standard Wayback Machine url format:
`http://<host>/<collection>/<timestamp>/<original url>`
@@ -31,52 +32,70 @@ Ex: The [Internet Archive Wayback Machine](https// has urls of
A listing of archived content, often in calendar form, is available when a `*` is used instead of timestamp.
The Wayback Machine uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.
-pywb uses this interface as a starting point.
+pywb provides these features as a starting point.
### Requirements
-pywb currently works best with 2.7.x
-It should run in a standard WSGI container, although currently
-tested primarily with uWSGI 1.9 and 2.0
+pywb has tested in python 2.6, 2.7 and pypy.
-Support for Python 3 is planned.
+It runs best in python 2.7 currently.
+pywb tool suite provides several WSGI applications, which have been tested under
+*wsgiref* and *uWSGI*.
-### Installation
+For best results, the *uWSGI* container is recommended.
+Support for Python 3 is planned.
+### Sample Data
-pywb comes with sample archived content, also used
-for unit testing the app.
+pywb comes with a a set of sample archived content, also used by the test suite.
The data can be found in `sample_archive` and contains
`warc` and `cdx` files. The sample archive contains
recent captures from `` and ``
+### Installation
-To start a pywb with sample data
+To start a pywb with sample data:
1. Clone this repo
2. Install with `python install`
-3. Run pywb by via script `` (script currently assumes a default python and uwsgi install, feel free to edit as needed)
+3. Run pywb via `python -m pywb.apps.wayback` to start the server in implementation.
+ OR run `` to start with uWSGI (see below for more info).
-4. Test pywb in your browser! (pywb is set to run on port 8080 by default.)
+4. Test pywb in your browser! (pywb is set to run on port 8080 by default).
If everything worked, the following pages should be loading (served from *sample_archive* dir):
| Original Url | Latest Capture | List of All Captures |
| ------------- | ------------- | ----------------------- |
| `` | [http://localhost:8080/pywb/](http://localhost:8080/pywb/ | [http://localhost:8080/pywb/*/](http://localhost:8080/pywb/*/ |
| `` | [http://localhost:8080/pywb/](http://localhost:8080/pywb/ | [http://localhost:8080/pywb/*/](http://localhost:8080/pywb/*/ |
+#### uWSGI startup script
+A sample uWSGI start up script, `` which assumes a default uWSGI installation is provided as well.
+Currently, uWSGI is not installed automatically with this distribution, but it is recommended for production environments.
+Please see [uWSGI Installation][1] for more details on installing uWSGI.
### Vagrant
-pywb comes with a Vagrantfile to help you set up a VM quickly for testing.
+pywb comes with a Vagrantfile to help you set up a VM quickly for testing and deploy pywb
+with uWSGI.
If you have [Vagrant]( and [VirtualBox](
installed, then you can start a test instance of pywb like so:
@@ -86,7 +105,7 @@ cd pywb
vagrant up
-After pywb and all its dependencies are installed, the uwsgi server will start up and you should see:
+After pywb and all its dependencies are installed, the uWSGI server will startup
spawned uWSGI worker 1 (and the only) (pid: 123, cores: 1)
@@ -107,7 +126,7 @@ The full set of tests can be run by executing:
-which will run the tests using py.test
+which will run the tests using py.test.
### Sample Setup
@@ -129,88 +148,22 @@ archive_paths: ./sample_archive/warcs/
This sets up pywb with a single route for collection /pywb
-(The [full version of config.yaml](config.yaml) contains additional documentation and specifies
+(The the latest version of [config.yaml](config.yaml) contains additional documentation and specifies
all the optional properties, such as ui filenames for Jinja2/html template files.)
For more advanced use, the pywb init path can be customized further:
-* The `PYWB_CONFIG` env can be used to set a different yaml file.
-* The `PYWB_CONFIG_MODULE` env variable can be used to set a different init module, for implementing a custom init
-(or for extensions not yet supported via yaml)
-See `` for more details
-### Running with Existing CDX/WARCs
-If you have existing .warc/.arc and .cdx files, you can adjust the `index_paths` and `archive_paths` to point to
-the location of those files.
-#### SURT
-By default, pywb expects the cdx files to be Sort-friendly URL Reordering Transform (SURT) ordering.
-This is an ordering that transforms: `` -> `com,example)/` to faciliate better search.
-It is recommended for future indexing, but is not required.
-Non-SURT ordered cdx indexs will work as well, but be sure to specify:
-`surt_ordered: False` in the [config.yaml](config.yaml)
-### Creating CDX from WARCs
-If you have warc files without cdxs, the following steps can be taken to create the indexs.
-cdx indexs are sorted plain text files indexing the contents of archival records in one or more WARC/ARC files.
-(The cdx_writer tool creates SURT ordered keys by default)
-pywb does not currently generate indexs automatically, but this may be added in the future.
-For production purposes, it is recommended that the cdx indexs be generated ahead of time.
-** Note: these recommendations are subject to change as the external libraries are being cleaned up **
-The directions are for running in a shell:
-1. Clone
-2. Clone to get ****
-3. Copy **** from `CDX_Writer` into **warctools/hanzo** in `warctools`
-4. Ensure sort order set to byte-order `export LC_ALL=C` to ensure proper sorting.
-5. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx`
- This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point `pywb` to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
-6. pywb sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
- from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
- An example sort merge post process can be done as follows:
- ```
- export LC_ALL=C
- sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
- ```
+* The `PYWB_CONFIG_FILE` env can be used to set a different yaml file.
- (The merged cdx will start with several ` CDX` headers due to the merge. These headers indicate the cdx format and should be all the same!
- They are always first and pywb ignores them)
+* Custom init app (with or without yaml) can be created. See [bin/] and [pywb/core/] for examples
+ of boot strapping.
- In the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`
+### Configuring PyWb With Archived Data
+Please see the [PyWb Configuration]( for latest instructions on how to setup pywb to run with your existing WARC/ARC collections.
### Additional Documentation
@@ -225,3 +178,4 @@ You are encouraged to fork and contribute to this project to improve web archivi
Please take a look at list of current [issues]( and feel free to open new ones
@@ -7,14 +7,15 @@ apt-get install -y python-dev
apt-get install -y git
apt-get install -y python-pip
pip install virtualenv
+pip install uwsgi
sudo -u vagrant virtualenv pywb_env
echo Installing pywb and dependencies via pip... This may take a while.
if [ ! -d pywb ]; then
git clone;
cd pywb
sudo -u vagrant ../pywb_env/bin/pip install .
-sudo -u vagrant -H sh -c ". ../pywb_env/bin/activate; ./"
+sudo -u vagrant -H sh -c ". ../pywb_env/bin/activate; ./"
# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
@@ -3,12 +3,7 @@
mypath=$(cd `dirname $0` && pwd)
# Set a different config file
-#export 'PYWB_CONFIG=myconfig.yaml'
-# Set alternate init module
-# The modules pywb_config()
-# ex: my_pywb.pywb_config()
-#export 'PYWB_CONFIG=my_pywb'
+#export 'PYWB_CONFIG_FILE=myconfig.yaml'
@@ -19,7 +14,7 @@ if [ -z "$1" ]; then
# Standard root config
params="$params --wsgi $app"
- # run with --mount
+ # run with --mount to specify a non-root context
# requires a file not a package, so creating a to load the package
echo "#!/bin/python\n" > $mypath/
echo "import $app\napplication = $app.application" >> $mypath/

0 comments on commit 25a8514

Please sign in to comment.