Skip to content

Commit

Permalink
Merge pull request #3 from harvard-lil/url-postmessage
Browse files Browse the repository at this point in the history
v0.0.1
  • Loading branch information
matteocargnelutti committed Oct 5, 2022
2 parents 5e1a7c2 + 9b9eac3 commit 0a33318
Show file tree
Hide file tree
Showing 6 changed files with 821 additions and 302 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## v0.0.1 - Oct 5 2022
- Implemented two-way communication system using post messages (see readme).
- Upgraded to `<replay-web-page>` 1.7.0
- Aligned parameter names with `<replay-web-page>` attribute names:
- `?archive-file` > `?source`
- `?archived-url` > `?url`
- Made `?url` parameter optional. Will fallback to `page:0` if not provided.
- Added support for `?ts`, `?embed` and `?deepLink` parameters.
- Removed logic to automatically append `noCache` for browsers that do not support the `StorageManager.estimate` API, now handled by `<replay-web-page>` directly.
- Removed logic to automatically append `noWebWorker` for older versions of Safari (< 16), now handled by `<replay-web-page>` directly.
- Removed logic for checking if `<replay-web-page>` is embedded in a cross-origin `<iframe>`, now handled by `<replay-web-page requireSubDomainIframe>`.
- Used `<replay-web-page>`'s' `sandbox` attribute in replacement of `noSandbox`.
93 changes: 78 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# warc-embed 🏛️
Experimental proxy and wrapper for safely embedding Web Archives (`.warc.gz`, `.wacz`) into web pages.
Experimental proxy and wrapper boilerplate for safely and efficiently embedding Web Archives (`.warc.gz`, `.wacz`) into web pages.

This particular implementation, based on [NGINX](https://www.nginx.com/), consists in a [docker-compose setup](https://docs.docker.com/compose/) allowing for quick and easy deployment on a VPS.<br>
It also benefits from [NGINX's advanced range request caching features](https://www.nginx.com/blog/smart-efficient-byte-range-caching-nginx/).
This particular implementation:
- Wraps [Webrecorder's `<replay-web-page>`](https://replayweb.page/docs/embedding) client-side playback technology.
- Serve, proxies and [cache](https://www.nginx.com/blog/smart-efficient-byte-range-caching-nginx/) web archive files using [NGINX](https://www.nginx.com/). Implementation consists in a [docker-compose setup](https://docs.docker.com/compose/), allowing for quick and easy deployment on a VPS.

🖼️ [Live Demo](https://warcembed-demo.lil.tools)

Expand All @@ -14,7 +15,9 @@ It also benefits from [NGINX's advanced range request caching features](https://
- [Concept](#concept)
- [Environment Variables](#environment-variables)
- [Routes](#routes)
- [Communicating with the embedded archive](#communicating-with-the-embedded-archive)
- [Deployment](#deployment)
- [Changelog](/CHANGELOG.md)

---

Expand All @@ -36,7 +39,7 @@ The playback will only start when said document is embedded in a cross-origin `<
```html
<!-- On https://*.domain.ext: -->
<iframe
src="https://warcembed.domain.ext/?archive-file=archive.warc.gz&archived-url=https://what-was-archived.ext/path"
src="https://warcembed.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path"
allow="allow-scripts allow-modals allow-forms allow-same-origin"
>
</iframe>
Expand All @@ -55,11 +58,13 @@ These environment variables are used by `docker-compose` to replace values in `n
| `HOST_NAME` | Yes | Host name of the deployed instance of `warc-embed`. Ex: `warcembed.example.com`. |
| `REMOTE_ARCHIVES_SERVER` | Yes | Remote location to fetch archives from when not present locally. Ex: `https://warcserver.example.com` |

[☝️ Back to summary](#summary)

---

## Routes

### /?archive-file=X&archived-url=Y
### /?source=X&url=Y

#### Role
Serves [an HTML document containing an instance of `<replay-web-page>`](/html/embed/index.html), pointing at a proxied archive file.
Expand All @@ -76,21 +81,17 @@ www.example.com: Has iframes pointing at warcembed.example.com
#### Query parameters
| Name | Required ? | Description |
| --- | --- | --- |
| `archive-file` | Yes | Path + filename of the `.warc.gz` or `.wacz`. Can contain a path. <br>Must either be present in the [`/archives/` folder](/html/archives/) or on the remote server defined by [the `REMOTE_ARCHIVES_SERVER` environment variable](#environment-variables). |
| `archived-url` | Yes | Url of the page that was archived. |
| `show-location-bar` | No | If set, will show `<replay-web-page>`'s address bar. <br>Particularly useful for multi-page archives.|
| `source` | Yes | Path + filename of the `.warc.gz` or `.wacz`. Can contain a path. <br>Must either be present in the [`/archives/` folder](/html/archives/) or on the remote server defined by [the `REMOTE_ARCHIVES_SERVER` environment variable](#environment-variables). |
| `url` | No | Url of a page within the archive to display. If not set, will try to open the first page available. |
| `ts`| No | Timestamp of the page to retrieve. Can be either a YYYYMMDDHHMMSS-formatted string or a millisecond timestamp or a. |
| `embed` | No | `<replay-web-page>`'s [embed mode](https://replayweb.page/docs/embedding). Can be set to `replayonly` to hide its UI. |
| `deepLink` | No | `<replay-web-page>`'s [`deepLink` mode](https://replayweb.page/docs/embedding). |

#### Examples
```html
<!-- On https://*.domain.ext: -->
<iframe
src="https://warcembed.domain.ext/?archive-file=archive.warc.gz&archived-url=https://what-was-archived.ext/path"
allow="allow-scripts allow-modals allow-forms allow-same-origin"
>
</iframe>

<iframe
src="https://warcembed.domain.ext/?archive-file=/some/folder/archive.warc.gz&archived-url=https://what-was-archived.ext/path&show-location-bar=1"
src="https://warcembed.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path"
allow="allow-scripts allow-modals allow-forms allow-same-origin"
>
</iframe>
Expand All @@ -103,6 +104,66 @@ Pulls, caches and serves a given `.warc.gz` or `.wacz` file, with full support f

Will first look for the path + file given in the local [`/archives/` folder](/html/archives/), and try to proxy it from the remote server defined by [the `REMOTE_ARCHIVES_SERVER` environment variable](#environment-variables).

[☝️ Back to summary](#summary)

---

## Communicating with the embedded archive

`warc-embed` allows the embedding website to communicate with the embedded archive playback using [post messages](https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage).
All messages coming _from_ a `warc-embed` `<iframe>` come with a `warcEmbedHref` property, helping identify the sender.

### Messages interpreted by the `warc-embed` `<iframe>`
`warc-embed` will look for the following properties in messages coming from the embedding website and react accordingly:

| Property name | Expected value | Description |
| --- | --- | --- |
| `updateUrl` | String | If provided, will replace the current `url` parameter of `<replay-web-page>`. |
| `updateTs` | Number | If provided, will replace the current `ts` parameter of `<replay-web-page>`. |
| `getCollInfo` | Boolean | If provided, will send a post message back with `<replay-web-page>`'s `collInfo` object, containing meta information about the currently-loaded archive. |
| `getInited` | Boolean | If provided, will send a post message back with the current value of `<replay-web-page>`s `inited` property, indicating whether or not the service worker is ready. |

### Messages hoisted from `<replay-web-page>`
`warc-embed` will forward to the embedding website every post message sent by `<replay-web-page>`'s service worker.

The most common example is the following, which is sent during navigation within an archive:

```json
{
"warcEmbedHref": "https://warcembed.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path",
"url": "https://what-was-archived.ext/new-path/",
"view": "pages",
"ts": "20220816162527"
}
```

### Example: Intercepting messages from a `warc-embed` `<iframe>`
```javascript
// Assuming: there's only 1 <iframe class="warc-embed">
const playback = document.querySelector("iframe.warc-embed");

window.addEventListener("message", (e) => {
// This message bears data and comes from the `warc-embed` <iframe>
if (event?.data && event.source === playback.contentWindow) {
console.log(event);
}
});
```
### Example: Sending a message to a `warc-embed` `<iframe>`
```javascript
// Assuming: there's only 1 <iframe class="warc-embed">
const playback = document.querySelector("iframe.warc-embed");
const playbackOrigin = new URL(playback.src).origin;

playback.contentWindow.postMessage(
{"setUrl": "https://lil.law.harvard.edu/projects"},
playbackOrigin
);
```
[☝️ Back to summary](#summary)
---
## Deployment
Expand All @@ -122,3 +183,5 @@ The following quick start checklist will describe one of the many ways this setu
**Note:** Although it doesn't contain any non-public / sensitive information, we encourage you to avoid keeping `.env` around in a production setting.<br>
After initial setup, it may be safely discarded if replaced by actual environment variables.
[☝️ Back to summary](#summary)
66 changes: 2 additions & 64 deletions html/embed/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,72 +17,10 @@
width: 100%;
height: 100%;
}

replay-web-page {
display: block;
}
</style>

<script defer src="/replay-web-page/ui.js"></script>
<script type="module">
//
// Don't start playback unless we're in a cross origin <iframe>
//
let isInIframe = true;
let isCrossOrigin = true;
let isLocalhost = window.location.host.startsWith("localhost:");

try {
if (parent.window.location === window.location) {
isInIframe = false;
}

if (parent.window.location.host === window.location.host) {
isCrossOrigin = false;
}
}
catch(err) { // If we don't have access to `parent.window.x`, we're likely ok.
isInIframe = true;
isCrossOrigin = true;
}

if (!isLocalhost && (!isInIframe || !isCrossOrigin)) {
throw new Error("This page should be embedded in a cross origin iframe.");
}

//
// Grab playback info from search params
//
const params = new URLSearchParams(window.location.search);

if (params.get("archive-file") === null || params.get("archived-url") === null) {
throw new Error("`archive-file` and `archived-url` search params must be provided.");
}

//
// Inject `<replay-web-page>`
//
const replay = document.createElement("replay-web-page");
replay.setAttribute("source", `/${params.get("archive-file")}`);
replay.setAttribute("url", `${params.get("archived-url")}`);
replay.setAttribute("replayBase", "/replay-web-page/");
replay.setAttribute("noCache", "");

// Option: &show-location-bar: overrides embed mode to show the location bar.
if (params.get("show-location-bar")) {
replay.setAttribute("embed", "default");
}
else {
replay.setAttribute("embed", "replayonly");
}

// `noWebWorker` for Safari < 16
if (window.GestureEvent !== undefined && window.SharedWorker === undefined){
replay.setAttribute("noWebWorker", "");
}

document.querySelector("body").appendChild(replay);
</script>
<script type="module" src="/replay-web-page/ui.js"></script>
<script type="module" src="/index.js"></script>
</head>

<body>
Expand Down
143 changes: 143 additions & 0 deletions html/embed/index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
//------------------------------------------------------------------------------
// Module-level variables
//------------------------------------------------------------------------------
const params = new URLSearchParams(window.location.search);
const player = document.createElement("replay-web-page");

//------------------------------------------------------------------------------
// Check for required params
//------------------------------------------------------------------------------
if (params.get("source") === null) {
throw new Error("`source` search param must be provided.");
}

//------------------------------------------------------------------------------
// Prepare and inject `<replay-web-page>`
//------------------------------------------------------------------------------
player.setAttribute("source", `/${params.get("source")}`);
player.setAttribute("url", `page:0`);
player.setAttribute("replayBase", "/replay-web-page/");
player.setAttribute("embed", "default");
player.setAttribute("requireSubDomainIframe", "");
player.setAttribute("sandbox", "");

// Param: `url` (see: https://replayweb.page/docs/embedding)
if (params.get("url")) {
player.setAttribute("url", params.get("url"));
}

// Param: `ts` (see: https://replayweb.page/docs/embedding)
if (params.get("ts")) {
player.setAttribute("ts", handleTsParam(params.get("ts")));
}

// Param: `embed` (see: https://replayweb.page/docs/embedding)
if (["default", "full", "replayonly", "replay-with-info"].includes(params.get("embed"))) {
player.setAttribute("embed", params.get("embed"));
}

// Param: `deepLink` (see: https://replayweb.page/docs/embedding)
if (params.get("deepLink")) {
player.setAttribute("deepLink", "");
}

document.querySelector("body").appendChild(player);

//------------------------------------------------------------------------------
// Two-way communication between embedder and embedded
//------------------------------------------------------------------------------
window.addEventListener("message", (event) => {
//
// Forward messages coming from the service worker
//
try {
if (event.source.location.pathname === player.getAttribute("replayBase")) {
parent.window.postMessage(
{ warcEmbedHref: window.location.href, ...event.data },
"*"
);
}
}
catch(err) {
// Will fail on cross-origin messages
}

//
// Handle messages coming from parent
//
if (event.source === parent.window && event.data) {

// `updateUrl`: Updates `<replay-web-page>`s "url" attribute
if (event.data["updateUrl"]) {
player.setAttribute("url", event.data.updateUrl);
}

// `updateTs` Updates `<replay-web-page>`s "ts" attribute
if (event.data["updateTs"]) {
player.setAttribute("ts", handleTsParam(event.data.updateTs));
}

// `getInited`: Hoists current value of `<replay-web-page>.__inited`.
// This value indicates whether or not the service worker is ready.
if (event.data["getInited"]) {
parent.window.postMessage(
{ inited: player.__inited, warcEmbedHref: window.location.href },
event.origin
);
}

// `getCollInfo`
// Pries into `<replay-web-page>` to hoist `wr-coll.__collInfo`, which contains useful collection-related data.
if (event.data["getCollInfo"]) {
let collInfo = {};

try {
collInfo = player.shadowRoot
.querySelector("iframe")
.contentDocument
.querySelector("replay-app-main")
.shadowRoot
.querySelector("wr-coll")
.__collInfo;
}
catch(err) {
// console.log(err); // Not blocking | Just not ready.
}

parent.window.postMessage(
{ collInfo: collInfo, warcEmbedHref: window.location.href },
event.origin
);
}

}

}, false);

//------------------------------------------------------------------------------
// Utils
//------------------------------------------------------------------------------
/**
* Converts `ts` from timestamp to YYYYMMDDHHMMSS if necessary.
* In `<replay-web-page>`, `ts` can be either depending on context, which can lead to confusions.
* This function brings support for `ts` as either a timestamp OR a formatted date.
*
* @param {Number|String} ts
* @returns {Number}
*/
function handleTsParam(ts) {
ts = parseInt(ts);

if (ts <= 9999999999999) {
const date = new Date(ts);
let newTs = `${date.getUTCFullYear()}`;
newTs += `${(date.getUTCMonth() + 1).toString().padStart(2, 0)}`;
newTs += `${date.getUTCDate().toString().padStart(2, 0)}`;
newTs += `${date.getUTCHours().toString().padStart(2, 0)}`;
newTs += `${date.getUTCMinutes().toString().padStart(2, 0)}`;
newTs += `${date.getSeconds().toString().padStart(2, 0)}`;
ts = newTs;
}

return ts;
}
120 changes: 114 additions & 6 deletions html/replay-web-page/sw.js

Large diffs are not rendered by default.

689 changes: 472 additions & 217 deletions html/replay-web-page/ui.js

Large diffs are not rendered by default.

0 comments on commit 0a33318

Please sign in to comment.