Save entire Web page to IPFS #91

chpio · 2016-03-02T10:45:23Z

Degraded user experience example when the user is just shown raw HTML content upon trying to "Import to IPFS" a web page.

(This is a part of meta-issue: Mirroring Web to IPFS tracked at ipfs/in-web-browsers#94)

in addition to #59

lidel · 2016-03-02T12:27:35Z

Sounds good.
The way I see this feature:

save and wrap all assets (images, js, css, html) with a directory object
additionally, create a pixed-perfect screenshot of entire page
wrap both HTML and PNG snapshots in a shareable landing page that lets you pick if you want screenshot or HTML version of the snapshot

Update: 2read extension is a great poc!

lidel · 2018-07-13T22:26:39Z

@victorbjelkholm do you think websaver (or its parts responsible for saving DOM) could be re-used for this?

victorb · 2018-07-24T11:11:49Z

@lidel for sure! I think the hard part is replicating the DOM into a string that can be used to render it again. Websaver kind of works, but it's hold together with hacks as I couldn't find a clean solution of serializing the DOM. Best way I found was this:

    var s = new window.XMLSerializer()
    var d = document
    // Scripts come out with html-like elements being escaped
    var str = s.serializeToString(d)
    str = unescape(str)
    const url = document.location.toString()
    const item = {type: 'archive', url, content: str, preview}
browser.runtime.sendMessage(item)

Then the second part (edit: this is actually what happens first, then the serialization happens) is going through all link, style and script tags and properly download and inline them. Basically that part is downloading them (via background script to avoid cors restrictions) if there is a href/src attribute, and making them into blobs that can be inlined. It also takes a screenshot with the tabs.captureVisibleTab API and finally saves the object to local storage + IPFS.

I also hit another tricky issue that I'm unsure of how to solve. Current implementation is naive in that it assumes that URLs mostly end up being the same content for all users, which is not true. Sometimes web applications renders data into JS files (hurr) and served as normal scripts.

lidel · 2018-07-24T13:02:41Z

I know the pain: we can't even use MHTML as it is not supported by browser vendors (anymore).

My hope is that in the long term something like webpackage will gain adoption. It aims to address (among other things) website snapshoting use case in a safe and reproducible manner that is aware of HTTP semantics: webpackage: Save and share a web page (Use Case) – sounds super relevant to what we want as the endgame here and for ipfs/in-web-browsers#94 in general.

But for now, doing a rough snapshot via inlining+serialization you described along with screenshot could cover work in ~80% of use cases and is something we could do with today's tools.

@victorb Is websaver repo available somewhere, or is it just a quick hack distilled into the snipped above?

lidel · 2019-03-22T17:23:57Z

Reopening: I believe "Add to IPFS" via right-click on a page will only save HTML alone.

Mirroring full page with assets (images, CSS and JS) require additional work integrating something like websaver by @victorb.

Mikaela · 2020-02-25T16:14:16Z

Is this issue the cause why this import of PrivacyTools homepage looks broken? I took it as a demo and was going to open a duplicate or ask about it as part of #850 as I stored three pages to IPFS and only encountered this issue with it.

jessicaschilling · 2020-12-02T18:11:48Z

Bumping this due to its being mentioned again in #948.

lidel · 2020-12-10T00:29:54Z

@Gozala do you think there are lessons from https://github.com/inkandswitch/xcrpt that we could apply here without spinning our wheels too much? Sounds like a similar problem space and a really useful feature to have.

Gozala · 2020-12-10T19:05:33Z

Thanks for asking @lidel, indeed goals are very similar here. I think there couple of things I'd share from building that:

freeze-dry is a great library that does some of the heavy lifting in terms of actually extracting all the resources from the page.
It does however inline everything into html via data URLs, which isn't ideal. I made some efforts in the past to redesign crawler such that individual resources could be saved as separate files and used it to prototype proxy server that live archives page to IPFS as it's loaded.
For the end user experience it doesn't really matter if content is inlined as data URLs or if files are interlinked. So I'd start with simpler approach and improve internals later on.
Archived pages that I've been focusing on are more like pdfs (snapshots of DOM tree at the moment of archival) than interactive pages. That means clicking buttons doesn't work and etc... That was however deliberate choice as some of the past work & user research lead us believe that it better addresses actual user needs is technically simpler. Here are some relevant links here
- https://gozala.io/work/web-clips
- https://gozala.io/work/web-highlighter
- All that lead me to believe that stripped down version of page snapshot e.g. markdownified version with standard renderring and switch to see a full fledged version is the most interesting approach.
@ikreymer's work on https://conifer.rhizome.org/ is worth taking a look at as well. Which takes a different approach of recording network request with response headers and bodies so that live page could be reenacted. I believe there is some effort to allow storing recordings into IPFS as well.

One other more meta point I'd like to make is that I think it would be better to see more interoperable tools working in concert with each other than attempting to build a whole suit of tools into a product in this case ipfs-companion. That is to suggest I think it would be more desirable to seek other projects in the space that are already working on archiving pages in some form & figure out ways how collectively they could enhance each other. E.g. braves approach of suggesting to enable an IPFS extension when navigating to resource on IPFS is a good example where functionality may not be built-in but it provides a good way to add such a functionality.

Hope some of this is helpful here.

ikreymer · 2020-12-10T20:47:08Z

@Gozala thanks for the mention!
Yes, Webrecorder tools have focused on IPFS support lately.
The https://replayweb.page/ system (https://github.com/webrecorder/replayweb.page) now actually supports replay of archived pages from IPFS directly and can itself be run from IPFS.

Here's just two examples:

Loading from IPFS via a proxy: https://ipfs.io/ipfs/QmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX/?source=./examples/netpreserve-twitter.warc
Here's the latest version of ReplayWeb.page itself loading from IPFS directly:
https://replayweb.page/?source=ipfs%3A%2F%2FQmYvsdJt7ji8bqBFLLjRAcAPgcqFMfb7WGsbXzr6TFk6yM%2Fissue-02.wacz

This archive (of Parametric Press Issue 02) is ~100MB, but everything is loaded on-demand. You can link to specific pages in the archive and search queries, and text search

The archives are serialized as standard WARCs or WACZ (a new zip based format that supports random access).

I think this approach works much better than saving static DOM, which generally is limited to pages that are self-contained or fully static, and is not necessarily simpler to create. If you really only want static content, another extension that works really well for that is: https://github.com/gildas-lormeau/SingleFile

Depending on what the goals of this issue are, I suppose you could close this as resolved :)

I'll have more updates on this work soon, and would happy to chat more about collaborations if anyone is interested!

Gozala · 2020-12-17T00:53:12Z

@ikreymer that is really impressive work, I'm blown away!

I think this approach works much better than saving static DOM, which generally is limited to pages that are self-contained or fully static, and is not necessarily simpler to create.

I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.

I'll have more updates on this work soon, and would happy to chat more about collaborations if anyone is interested!

That is exciting I think it might be great to demo your work on one of the community calls. @autonome might be a good person to chat about possible collobartions

jessicaschilling · 2020-12-17T01:13:22Z

@ikreymer This could be a great lightning talk at one of the upcoming IPFS virtual meetups 😊 If you'd be interested in talking, here's the speaker form: https://protocollabs.typeform.com/to/hLGfKhxn

ikreymer · 2020-12-19T18:56:26Z

@ikreymer that is really impressive work, I'm blown away!

Thanks!

I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.

Yes, I agree, it does depend on what the goals are. And starting with a 'high-fidelity' archive, you can always 'downsample' to just getting the static DOM later, or a more limited view, etc...

@ikreymer This could be a great lightning talk at one of the upcoming IPFS virtual meetups 😊 If you'd be interested in talking, here's the speaker form: https://protocollabs.typeform.com/to/hLGfKhxn

Thanks, I'll sign-up for a future call, would be happy to share this!

Gozala · 2020-12-29T05:13:07Z

I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.

Yes, I agree, it does depend on what the goals are. And starting with a 'high-fidelity' archive, you can always 'downsample' to just getting the static DOM later, or a more limited view, etc...

This had been on my mind once again and I recalled as to how I have settled on DOM snapshots as preferred option. As I we were exploring alternative medium for web where we fused notion of browser history and tabs

https://www.freecodecamp.org/news/lossless-web-navigation-with-trails-9cd48c0abb56/
https://www.freecodecamp.org/news/lossless-web-navigation-spatial-model-37f83438201d/

We end up wanting to unload unused tabs and replacing those with card renderings, which that could be on switch be brought back to life, in the exact same state in way after tab was loaded. Without having safari like flush reload action loosing scroll position etc... Saving frozen JS-less DOM provided a reasonable experience, we would just overlay it with a control that told user it was snapshot of the page from a specific date and allowed user to go to current version.

That is to say capturing archive will not be able to fulfill that use case. So ideally I think tool would both capture dom snapshot along with web archive and provide you a way to go from snapshot to archive to current page and vice versa.

ikreymer · 2021-01-23T22:07:41Z

This had been on my mind once again and I recalled as to how I have settled on DOM snapshots as preferred option. As I we were exploring alternative medium for web where we fused notion of browser history and tabs

https://www.freecodecamp.org/news/lossless-web-navigation-with-trails-9cd48c0abb56/
https://www.freecodecamp.org/news/lossless-web-navigation-spatial-model-37f83438201d/

We end up wanting to unload unused tabs and replacing those with card renderings, which that could be on switch be brought back to life, in the exact same state in way after tab was loaded. Without having safari like flush reload action loosing scroll position etc... Saving frozen JS-less DOM provided a reasonable experience, we would just overlay it with a control that told user it was snapshot of the page from a specific date and allowed user to go to current version.

That is to say capturing archive will not be able to fulfill that use case. So ideally I think tool would both capture dom snapshot along with web archive and provide you a way to go from snapshot to archive to current page and vice versa.

Thanks for sharing these! Yeah, I think it comes down to whether the 'JS-less DOM' is a reasonable experience, or not,
which of course depends on the content.

But the trails idea and various spatial navigations are definitely cool ideas, would be happy to chat about it at some point!
We considered trying to do something like the trails idea with webrecorder.io (now confier.rhizome.org) a few years ago, but didn't have chance to implement anything of the sort.

One other thing I've experimented with is saving the window.history stack, and then recreating via history.pushState. On a few sites, this can actually give you a way to replay a dynamic page that was achieved by several history navigations, and allow to 'go back'. But only works well one sites that 'play nice' with history api.. Hope to revisit this idea at some point.

ikreymer · 2021-01-23T22:22:05Z

I wanted to finally share, we've just launched the ArchiveWeb.page chrome extension, which allows for archiving any page in a chrome-based browser (via the debug protocol to get full-fidelity).

The ArchiveWeb.page extension also includes experimental IPFS support, so users can archive a page (or several), then share via IPFS, and send a link to load via replayweb.page or via a regular. Here's more info in our guide

There was some work to optimize the system for on-demand loading of a web archive. A web archive may get sufficiently large, and to avoid pulling all the blocks in the multihash over to preload node at once, when loading via ReplayWeb.page, the system tries to pull only the blocks needed for a particular page.

The main issue for now is that if the websocket connection to the preload node is lost and is not reestablished, the sharing is stopped (I understand this is being addressed via: libp2p/js-libp2p#744)

Also looking to see how this can work better in Brave using the native IPFS node.

I'll sign-up for the virtual meet-up and happy to talk more about this work!

jessicaschilling · 2021-01-24T01:39:42Z

Thanks for the update! If you'd be interested in presenting at one of the virtual meetups, send a note to ipfs-community@protocol.ai with some details about your extension and what you'd specifically be interested in presenting.

Also, consider submitting to Awesome IPFS? https://github.com/ipfs/awesome-ipfs/blob/master/CONTRIBUTING.md

lidel · 2021-03-04T14:25:19Z

For drive-by readers, this video was recorded at recent IPFS Community Meetup and gives a good overview and demo of the system created by @ikreymer

I think for the time being we will aim at making single-page snapshots created by Companion bit more useful (eg. inline images and css so things look decent) so Firefox users have something, but for advanced archiving point at https://replayweb.page + separate extension.

UI TBD.

lidel added the kind/enhancement A net-new feature or improvement to an existing feature label Mar 2, 2016

lidel changed the title ~~Save whole page to ipfs [feature request]~~ Save whole page to ipfs Mar 2, 2016

lidel added the help wanted Seeking public contribution on this issue label Mar 21, 2016

lidel mentioned this issue Mar 26, 2016

Mirroring Web to IPFS #96

Closed

3 tasks

lidel added status/blocked/missing-api Blocked by missing API and removed help wanted Seeking public contribution on this issue labels Aug 3, 2016

lidel added help wanted Seeking public contribution on this issue and removed status/blocked/missing-api Blocked by missing API labels Jul 22, 2017

lidel added the exp/expert Having worked on the specific codebase is important label Oct 15, 2017

lidel added the status/ready Ready to be worked label Mar 7, 2018

lidel mentioned this issue Jul 24, 2018

Mirroring Web to IPFS ipfs/in-web-browsers#94

Open

2 tasks

chpio closed this as completed Jan 9, 2019

ghost removed the status/ready Ready to be worked label Jan 9, 2019

lidel reopened this Mar 22, 2019

lidel mentioned this issue Mar 22, 2019

add to ipfs dont get html images and css #699

Closed

lidel mentioned this issue Aug 23, 2019

Add initial SPEC draft ipfs-shipyard/cohosting#2

Merged

Mikaela mentioned this issue Feb 25, 2020

This page -> Import to IPFS workflow is confusing #850

Closed

jessicaschilling changed the title ~~Save whole page to ipfs~~ Save entire Web page to IPFS Apr 7, 2020

jessicaschilling added effort/weeks Estimated to take multiple weeks P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked topic/design-ux UX strategy, research, not solely visual design labels Apr 7, 2020

jessicaschilling mentioned this issue Dec 2, 2020

Include assets when we do "right click > import on ipfs" on a website #948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save entire Web page to IPFS #91

Save entire Web page to IPFS #91

chpio commented Mar 2, 2016 •

edited by lidel

Loading

lidel commented Mar 2, 2016 •

edited

Loading

lidel commented Jul 13, 2018

victorb commented Jul 24, 2018 •

edited

Loading

lidel commented Jul 24, 2018 •

edited

Loading

lidel commented Mar 22, 2019

Mikaela commented Feb 25, 2020

jessicaschilling commented Dec 2, 2020

lidel commented Dec 10, 2020 •

edited

Loading

Gozala commented Dec 10, 2020 •

edited

Loading

ikreymer commented Dec 10, 2020 •

edited

Loading

Gozala commented Dec 17, 2020

jessicaschilling commented Dec 17, 2020

ikreymer commented Dec 19, 2020

Gozala commented Dec 29, 2020

ikreymer commented Jan 23, 2021

ikreymer commented Jan 23, 2021

jessicaschilling commented Jan 24, 2021

lidel commented Mar 4, 2021

Save entire Web page to IPFS #91

Save entire Web page to IPFS #91

Comments

chpio commented Mar 2, 2016 • edited by lidel Loading

lidel commented Mar 2, 2016 • edited Loading

lidel commented Jul 13, 2018

victorb commented Jul 24, 2018 • edited Loading

lidel commented Jul 24, 2018 • edited Loading

lidel commented Mar 22, 2019

Mikaela commented Feb 25, 2020

jessicaschilling commented Dec 2, 2020

lidel commented Dec 10, 2020 • edited Loading

Gozala commented Dec 10, 2020 • edited Loading

ikreymer commented Dec 10, 2020 • edited Loading

Gozala commented Dec 17, 2020

jessicaschilling commented Dec 17, 2020

ikreymer commented Dec 19, 2020

Gozala commented Dec 29, 2020

ikreymer commented Jan 23, 2021

ikreymer commented Jan 23, 2021

jessicaschilling commented Jan 24, 2021

lidel commented Mar 4, 2021

chpio commented Mar 2, 2016 •

edited by lidel

Loading

lidel commented Mar 2, 2016 •

edited

Loading

victorb commented Jul 24, 2018 •

edited

Loading

lidel commented Jul 24, 2018 •

edited

Loading

lidel commented Dec 10, 2020 •

edited

Loading

Gozala commented Dec 10, 2020 •

edited

Loading

ikreymer commented Dec 10, 2020 •

edited

Loading