PROPOSAL: Web Archive Replay Test Suite

Summary

An automated test suite for code regression and integration testing, index generation and APIs in support of web archive replay functionality.

Introduction

One of the critical parts of any production quality software is a strong regression testing suite. This allows automated testing to be used to ensure that the critical functionality of the system has not been compromised by any changes added to the software, and brings a higher level of confidence to the release process than unit-testing alone. This kind of deployment/system-integration/service-level testing is one area where the current OpenWayback source code is recognised as lacking.

Therefore, the idea is to create example WARC and ARC files with known properties, covering the diversity of deployment patterns of the OpenWayback community. Then, whenever a developer suggests a change to OpenWayback, we can automatically fire up an instance of the OpenWayback machine pointing at the test data, and run a series of tests that probe whether the expected behaviour has been met. If any of the behaviour has been broken by the change, the changes to the code will be rejected (or, if this reflects new functionality, the test suite would be updated). Thus, by creating a corpus of web archiving test data and using it to create a service-level test suite for OpenWayback, we can proceed to develop the code with far more confidence.

Test corpora and test suites are valuable things for web archives to invest in. As well as bringing stability to current tools, they also provide a benchmark against with other tools can be judged and so, with luck, will outlive the tools they are built for. Also, as they are built around the functionality we wish the tools to perform, rather than around specific details of the tools themselves, creation of the tests does not require extremely deep knowledge of specific code-bases (e.g. Heritrix3, OpenWayback).

Test suites have the further advantage of being something that can be grown over time – i.e. they do not necessarily need massive up-front investment, but rather can be built over time, adding in new features and coverage as resources allow.

Benefits

To provide consistency of testing and secure the OpenWayback release process.
IIPC/OpenWayback community would not need to invest significant amounts of time to test OpenWayback for each release.
Test suite can be built up over time – small initial investment which can be built on if successful.
As a test suite for the total system functionality, this test suite might outlive any specific replay implementation.

As a quantitative measure of the benefits it is envisaged that this tool would save each institution deploying OpenWayback in production at least one round of testing (deployment to the test environment) per OpenWayback release.

Approach

Generate a fixed body of ARC and WARC files from example web resources.
- I've started collected some example content here.
- To enable very permissive licensing, e.g. CC0, this would mean creating a test site rather than crawling an existing site.
Collect metadata describing the expected behaviour.
- Using the HAR file format to store the expected output.
Use a build management tool to automatically start up OpenWayback instances under various configurations, and for each one verify the API-level behaviour by comparing it against the expected behaviour described in the metadata.
- An example of firing up OpenWayback from Maven can be found here, and of firing up a server during Maven integration testing here. These two could be combined to spin up a Maven instance and then run tests against it.

Cases

Inputs
- Can load ARC/WARC/ARC.GZ/WARC.GZ created by H1, H3, wget 1.14 (c.f. that issue), warcprox, LAP, etc.
- Can cope with de-duplicated WARCs, of various kinds:
  - URL-sensitive or URL-agnostic with or without headers in the de-dup record, so four combinations.
  - Plus WARC-Refers-To: pointing for filename variations?
  - TODO Add list of different WARC structures for doing this that are 'in the wild', not just the spec. ones.
- IDN Support? (c.f. #27)
- Robust XML inserts? (c.f. #60)
- Can cope with WARC records in various orders. Some recent tests by @machawk indicated that OpenWayback might only be happy if the response record comes before the request or the metadata.
- Note that NZNL recent built a test case that included a faux viral payload, in order to test on-access virus checking. That could also be included here if that's of broader interest.
- Could also consider adding deliberately 'bad' WARCs that contain errors that a decent parser should NOT try to recover from (if there are any - Postel's Law usually wins here).
Outputs
- CDX files
- API behaviours:
  - Proxy replay mode
  - Archival URL (re-write) replay mode resolution
  - DomainPrefix replay mode (TODO Does anyone use that?!)
  - XMLQuery API
  - CDX-server API
  - Memento API (perhaps integrating http://www.mementoweb.org/tools/validator/)

i.e. for each combination of inputs, we would like to automatically verify each of the various outputs.

Ideas

Once the basic functionality has been covered, we might consider extending this approach to cover other areas.

Large Scale Collection Replay
- Come up with clever ways to test the performance of some of the difficult cases like single pages with very large numbers of archived instances.
Web Page Replay Quality Control
- The test web resources can each be designed to exercise particular web standard features, and marked up appropriately. While create the WARC and ARC files, we can also collect screenshots of the original resources being served using an automated browser. The expected output could be stored alongside the other test data, and we could try using image comparison techniques (either simple ones or more sophisticated ones) to spot significant regressions.
Full Web Archive Life-cycle Integration Testing
- Move to spinning up hosted files with known properties, and actually run H3 (or other crawlers) on them first to make the (W)ARCs, and then push them through OpenWayback too.
Obsolete/Difficult Features & Formats
- We could deliberately collect a more diverse range of formats, covering older/obsolete/rare features or behaviour, e.g. the blink tag.
- We could use this to monitor how replay should change to cope with these cases.
- If extended to crawling, this could also cover difficult dependency extraction cases, for example.
- This is related to the OPF format corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly