Experimental proxy and wrapper boilerplate for safely and efficiently embedding Web Archives (.warc, .warc.gz, .wacz) into web pages.
This implementation:
- Wraps Webrecorder's replayweb.page client-side playback technology.
- Serves, proxies and caches web archive files using NGINX.
- Allows for two-way communication between the embedding website and the embedded archive using post messages.
<!-- Embedding a playback of archive.wacz on https://example.com -->
<iframe
src="https://wacz.example.com/?source=archive.wacz&url=https://what-was-archived.ext/path"
allow="allow-scripts allow-forms allow-same-origin"
>
</iframe>See also: Live Demo, Blog post
wacz-exhibitor serves an HTML document containing a pre-configured instance of replayweb.page, webrecorder's client-side web archives playback system, pointing at a proxied version of the requested WARC/WACZ file.
The playback will only start if said HTML document is embedded in a cross-origin <iframe> for security reasons (XSS prevention in the context of an <iframe> needing both allow-script and allow-same-origin).
We recommend hosting wacz-exhibitor on a subdomain of the embedding website to avoid third-party cookie limitations:
www.example.com -> Has iframes pointing at wacz.example.com
wacz.example.com -> Hosts wacz-exhibitor
wacz-exhibitor pulls and serves the requested archive file in the format required by <replay-web-page> (right Content-Type, support for range requests, CORS resolution and Content Security Policy).
The requested web archive file can be sourced from either:
- The local
/archives/folder. This is where the server will look first. - A remote location the server will proxy from, defined in
nginx.conf.
Serves an HTML document containing an instance of <replay-web-page>, pointing at a proxied archive file.
Must be embedded in a cross-origin <iframe>, preferably on the same parent domain to avoid third-party cookie limitations.
GET, HEAD
| Name | Required ? | Description |
|---|---|---|
source |
Yes | Filename of the .warc, .warc.gz or .wacz. Can contain a path, but cannot be a url. The file must either be present in the /archives/ folder or on the remote server defined in nginx.conf. |
url |
No | Url of a page within the archive to display. |
ts |
No | Timestamp of the page to retrieve. Can be either a YYYYMMDDHHMMSS-formatted string or a millisecond timestamp or a. |
embed |
No | <replay-web-page>'s embed mode. Can be set to replayonly to hide its UI. |
deepLink |
No | <replay-web-page>'s deepLink mode. |
noSandbox |
No | If set, will remove the sandbox from the <replay-web-page> iframe. May be necessary for certain playbacks; e.g., cross-browser compatible playbacks of PDFs. |
<!-- On https://*.domain.ext: -->
<iframe
src="https://wacz.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path"
allow="allow-scripts allow-forms allow-same-origin allow-downloads"
>
</iframe>Pulls, caches and serves a given .warc, .warc.gz or .wacz file, with full support for range requests.
Will first look for the path + file given in the local /archives/ folder, and try to proxy it from the remote server defined in nginx.conf.
This project consists of a single Dockerfile derived from the official NGINX Docker image, which can be deployed on any docker-compatible machine.
The following example describes the process of deploying wacz-exhibitor on fly.io, a platform-as-a-service provider.
nginx.confneeds to be edited. See comments starting withEDIT:in the document for instructions.- Install the
flyctlclient and sign-in, if not already done. - Initialize and deploy the project by running the
flyctl launchcommand (useflyctl deployfor subsequent deploys). wacz-exhibitoris now live and visible on thefly.iodashboard.- We highly recommend setting up a custom domain and SSL certificate. This can be done directly from the
fly.iodashboard. Ideally, the target domain should be a subdomain of the website on whichwacz-exhibitoriframes are going to be embedded: for example,www.domain.extembedding an<iframe>fromwacz.domain.ext.
docker build . -t wacz-exhibitor-local
docker run --rm -p 8080:8080 wacz-exhibitor-local
# wacz-exhibitor is now accessible at http://localhost:8080Shortcut: start-dev.sh
A minimal sandbox is available to test embedding wacz-exhibitor <iframe>s in webpages.
You may edit sandbox/index.html to make it point to a specific web archive file and run the following command to start the sandbox:
# Assuming: wacz-exhibitor is running on port 8080 ...
bash start-sandbox.sh
# The sandbox is now accessible at http://localhost:8000wacz-exhibitor allows the embedding website to communicate with the embedded archive playback using post messages.
All messages coming from a wacz-exhibitor <iframe> come with a waczExhibitorHref property, helping identify the sender.
This feature can be used to build interactive experiences using web archive files.
wacz-exhibitor will look for the following properties in messages coming from the embedding website and react accordingly:
| Property name | Expected value | Description |
|---|---|---|
updateUrl |
String | If provided, will replace the current url parameter of <replay-web-page>. |
updateTs |
Number | If provided, will replace the current ts parameter of <replay-web-page>. |
getCollInfo |
Boolean | If provided, will send a post message back with <replay-web-page>'s collInfo object, containing meta information about the currently-loaded archive. |
getInited |
Boolean | If provided, will send a post message back with the current value of <replay-web-page>s inited property, indicating whether or not the service worker is ready. |
overrideElementAttribute |
HTMLAttributeOverride |
If provided, will look for the element with the specified CSS selector inside <replay-web-page> and if found, apply the requested HTML attribute to it. If the element is not found, will send a post message back reporting "status": "timed out", along with a copy of the original message's data. |
wacz-exhibitor will forward to the embedding website every post message sent by <replay-web-page>'s service worker.
The most common example is the following, which is sent during navigation within an archive:
{
"waczExhibitorHref": "https://wacz.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path",
"url": "https://what-was-archived.ext/new-path/",
"view": "pages",
"ts": "20220816162527"
}// Assuming: there's only 1 <iframe class="wacz-exhibitor">
const playback = document.querySelector("iframe.wacz-exhibitor");
window.addEventListener("message", (event) => {
// This message bears data and comes from the `wacz-exhibitor` <iframe>
if (event?.data && event.source === playback.contentWindow) {
console.log(event);
}
});// Assuming: there's only 1 <iframe class="wacz-exhibitor">
const playback = document.querySelector("iframe.wacz-exhibitor");
const playbackOrigin = new URL(playback.src).origin;
playback.contentWindow.postMessage(
{"updateUrl": "https://what-was-archived.ext/new-path"},
playbackOrigin
);