Skip to content
This repository has been archived by the owner on Feb 24, 2023. It is now read-only.

Experimental proxy and wrapper for safely embedding Web Archives (warc.gz, wacz) into web pages.

License

Notifications You must be signed in to change notification settings

harvard-lil/warc-embed-netlify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

warc-embed-netlify 🏛️

Experimental proxy and wrapper for safely embedding Web Archives (.warc.gz, .wacz) into web pages.

This particular implementation uses Netlify and its Edge Functions as its backbone.

See also: warc-embed (Self-hosted + NGINX version)


Summary


Concept

"It's a wrapper"

warc-embed-netlify serves an HTML document containing a pre-configured instance of <replay-web-page>, webrecorder's front-end archive playback system, pointing at a proxied version of the requested archive.

The playback will only start when said document is embedded in a cross-origin <iframe> for security reasons (XSS prevention in the context of an <iframe> needing both allow-script and allow-same-origin).

See details for the /embed route.

"It's a proxy"

warc-embed-netlify pulls the requested archive file and adds the HTTP headers <replay-web-page> requires in order to download and interpret the file, such as access-control-allow-origin and content-type.

It also offers a very basic polyfill for range requests, required for playing back .wacz files, if the server hosting the archive file does not support this feature.

See details for the /archive.warc.gz route - for the /archive.wacz route.

Example

<!-- On https://*.domain.ext: -->
<iframe
  src="https://warcembed.domain.ext/embed/?archive-url=https://otherdomain.ext/archive.warc.gz&original-url=https://what-was-archived.ext/path"
  allow="allow-scripts allow-modals allow-forms allow-same-origin"
>
</iframe>

☝️ Back to summary


Deployment

Allowlist

The proxy will only pull archive files from hosts listed in allowlist.js.

Edit this file to determine which domains a specific instance of the proxy can pull files from.

Updating <replay-web-page>

This project hosts its own copy of replayweb.page.

You may update it to the latest version by running ./update-replay-web-page.sh and pushing changes.

Deploy on Netlify

Deploy to Netlify

At the time of writing this README, Netlify's free plan grants 3M Netlify Edge function hits per month and per account.

See Netlify's pricing.

Attaching a subdomain to this deployment:

See Netlify's documentation on domains management.

☝️ Back to summary


Routes

/embed

Role

Serves an HTML document containing an instance of <replay-web-page>, pointing at a proxied archive file.

Must be embedded in a cross-origin <iframe>, preferably on the same parent domain to avoid thrid-party cookie limitations:

warcembed.domain.ext: Hosts warc-embed-netlify
www.domain.ext: Has iframes pointing to warc.domain.ext/embed

Methods

GET, HEAD

Source

embed.js

Query parameters

Name Required ? Description
archive-url Yes Full url to the .warc.gz or .wacz file to embed. Must point to a host listed in allowlist.
original-url Yes Url of the page that was archived.

Example

<!-- On https://*.domain.ext: -->
<iframe
  src="https://warcembed.domain.ext/embed/?archive-url=https://otherdomain.ext/archive.warc.gz&original-url=https://what-was-archived.ext/path"
  allow="allow-scripts allow-modals allow-forms allow-same-origin"
>
</iframe>

/archive.[wacz|warc.gz]

Role

Pulls a given .wacz or warc.gz file from the url given by ?archive-url and serves it with the headers needed to playback including:

  • access-control-allow-origin
  • accept-ranges
  • content-type
  • content-disposition

The <replay-web-page> instance in the document generated by /embed points to this route.

Files need to be hosted on a server supporting range requests: archive.js will try to detect support for range requests, and provide a basic polyfill for it if not.

Methods

GET, HEAD

Source

archive.js

Query parameters

Name Required ? Description
archive-url Yes Full url to the .wacz or .warc.gz file to embed. Must point to a host listed in allowlist.

☝️ Back to summary


Local development

This project can be run locally using the Netlify CLI. No account is needed.

In your terminal:

# Install netlify-cli globally 
npm install netlify-cli -g

# Start the development server (should run on port 8888 by default)
netlify dev

☝️ Back to summary

About

Experimental proxy and wrapper for safely embedding Web Archives (warc.gz, wacz) into web pages.

Resources

License

Stars

Watchers

Forks