Skip to content

CDX Server requirements

Mohamed Elsayed edited this page Jan 31, 2016 · 10 revisions

As part of making the CDX-Server the default index engine for OpenWayback we need to clean up and formally define the API for the CDX-Server. This document is meant as a workplace for defining those API's.

The CDX-Server API, as it is today, is chracterized by a relatively close link to how the underlying CDX format is implemented. Functionality varies if you are using traditional flat CDX files or compressed zipnum clusters. One of the nice things by having a CDX Server is to separate the API from the underlying implementation. This way it would be relatively easy to implement indexes based on other technologies in the future. As a consequence we should avoid implementing features just because they are easy to do with a certain format if there is no real need for it. The same feature might be hard to implement on other technologies.

The API should also try to avoid giving the user conflicting options. For example it is possible, in the current api, to indicate match type both with a parameter and a wildcard. It is then possible to set matchType=prefix and at the same time use a wildcard indicating matchType=domain.

The following is a list of use-cases seen from the perspective of a user. Many of the use-cases are described as expectations to the GUI of OpenWayback, but is meant to help the understanding of the CDX-Server's role. For each use-case we need to understand what functionality the CDX-Server is required to support. CDX-Server functionality with no supporting use-case should not be implemented in OpenWayback 3.0.0.

This is a work in progress. Edits and comments are highly appreciated.

Use-cases

1. The user has a link to a particular version of a document

This case could be a user referencing a document from a thesis. It is important that the capture referenced is exactly the one the user used when writing the thesis. In this case the user should get the capture that exactly matches both the url and timestamp.

The digest needs also to be considered to actually guarantee that the user gets the same version. In addition you need to know that all embeds also are the same version the user originally requested. Achieving all this might be hard or impossible to do.

2. The user selects one particular capture in the calendar

Similar to the above, but it might be allowed to return a capture close in time if the requested capture is missing i.e. the requirement for getting the same version is slightly loosened.

3. Get the best matching page when following a link

User is looking at a page and want to follow a link by clicking it. User then expects to be brought to closest in time capture of the new page.

4. Get the best match for embedded resources

Similar to above, but user is not involved. This is for loading embedded images and so on.

5. User requests/searches for an exact url without any timestamp, expecting to get a summary of captures for the url over time

The summary of captures might be presented in different ways, for example a list or a calendar.

6. User looks up a domain expecting a summary of captures over time

7. User searches with a truncated path expecting the results to show up as matching paths regardless of time

8. User searches with a truncated path expecting the results to show up as matching paths regardless of time and subdomain

9. User navigates back and forth in the calendar

10. User wants to see when content of a page has changed

11. User requests/searchers for an exact url with a partial timestamp, expecting to get a summary of captures for the url over time

12. Get a random page within a partial timestamp

Possibly add a go to "random page" feature.

13. Get number of snapshots taken per year in BubbleCalendar