Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save entire Web page to IPFS #91

Open
chpio opened this issue Mar 2, 2016 · 18 comments
Open

Save entire Web page to IPFS #91

chpio opened this issue Mar 2, 2016 · 18 comments
Labels
effort/weeks Estimated to take multiple weeks exp/expert Having worked on the specific codebase is important help wanted Seeking public contribution on this issue kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked topic/design-ux UX strategy, research, not solely visual design

Comments

@chpio
Copy link

chpio commented Mar 2, 2016

Degraded user experience example when the user is just shown raw HTML content upon trying to "Import to IPFS" a web page.

(This is a part of meta-issue: Mirroring Web to IPFS tracked at ipfs/in-web-browsers#94)


in addition to #59

@lidel
Copy link
Member

lidel commented Mar 2, 2016

Sounds good.
The way I see this feature:

  • save and wrap all assets (images, js, css, html) with a directory object
  • additionally, create a pixed-perfect screenshot of entire page
  • wrap both HTML and PNG snapshots in a shareable landing page that lets you pick if you want screenshot or HTML version of the snapshot

Update: 2read extension is a great poc!

@lidel lidel added the kind/enhancement A net-new feature or improvement to an existing feature label Mar 2, 2016
@lidel lidel changed the title Save whole page to ipfs [feature request] Save whole page to ipfs Mar 2, 2016
@lidel lidel added the help wanted Seeking public contribution on this issue label Mar 21, 2016
@lidel lidel mentioned this issue Mar 26, 2016
3 tasks
@lidel lidel added status/blocked/missing-api Blocked by missing API and removed help wanted Seeking public contribution on this issue labels Aug 3, 2016
@lidel lidel added help wanted Seeking public contribution on this issue and removed status/blocked/missing-api Blocked by missing API labels Jul 22, 2017
@lidel lidel added the exp/expert Having worked on the specific codebase is important label Oct 15, 2017
@lidel lidel added the status/ready Ready to be worked label Mar 7, 2018
@lidel
Copy link
Member

lidel commented Jul 13, 2018

@victorbjelkholm do you think websaver (or its parts responsible for saving DOM) could be re-used for this?

@victorb
Copy link
Member

victorb commented Jul 24, 2018

@lidel for sure! I think the hard part is replicating the DOM into a string that can be used to render it again. Websaver kind of works, but it's hold together with hacks as I couldn't find a clean solution of serializing the DOM. Best way I found was this:

    var s = new window.XMLSerializer()
    var d = document
    // Scripts come out with html-like elements being escaped
    var str = s.serializeToString(d)
    str = unescape(str)
    const url = document.location.toString()
    const item = {type: 'archive', url, content: str, preview}
browser.runtime.sendMessage(item)

Then the second part (edit: this is actually what happens first, then the serialization happens) is going through all link, style and script tags and properly download and inline them. Basically that part is downloading them (via background script to avoid cors restrictions) if there is a href/src attribute, and making them into blobs that can be inlined. It also takes a screenshot with the tabs.captureVisibleTab API and finally saves the object to local storage + IPFS.

I also hit another tricky issue that I'm unsure of how to solve. Current implementation is naive in that it assumes that URLs mostly end up being the same content for all users, which is not true. Sometimes web applications renders data into JS files (hurr) and served as normal scripts.

@lidel
Copy link
Member

lidel commented Jul 24, 2018

I know the pain: we can't even use MHTML as it is not supported by browser vendors (anymore).

My hope is that in the long term something like webpackage will gain adoption. It aims to address (among other things) website snapshoting use case in a safe and reproducible manner that is aware of HTTP semantics: webpackage: Save and share a web page (Use Case) – sounds super relevant to what we want as the endgame here and for ipfs/in-web-browsers#94 in general.

But for now, doing a rough snapshot via inlining+serialization you described along with screenshot could cover work in ~80% of use cases and is something we could do with today's tools.

@victorb Is websaver repo available somewhere, or is it just a quick hack distilled into the snipped above?

@chpio chpio closed this as completed Jan 9, 2019
@ghost ghost removed the status/ready Ready to be worked label Jan 9, 2019
@lidel
Copy link
Member

lidel commented Mar 22, 2019

Reopening: I believe "Add to IPFS" via right-click on a page will only save HTML alone.

Mirroring full page with assets (images, CSS and JS) require additional work integrating something like websaver by @victorb.

@Mikaela
Copy link
Contributor

Mikaela commented Feb 25, 2020

Is this issue the cause why this import of PrivacyTools homepage looks broken? I took it as a demo and was going to open a duplicate or ask about it as part of #850 as I stored three pages to IPFS and only encountered this issue with it.

@jessicaschilling jessicaschilling changed the title Save whole page to ipfs Save entire Web page to IPFS Apr 7, 2020
@jessicaschilling jessicaschilling added effort/weeks Estimated to take multiple weeks P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked topic/design-ux UX strategy, research, not solely visual design labels Apr 7, 2020
@jessicaschilling
Copy link
Contributor

Bumping this due to its being mentioned again in #948.

@lidel
Copy link
Member

lidel commented Dec 10, 2020

@Gozala do you think there are lessons from https://github.com/inkandswitch/xcrpt that we could apply here without spinning our wheels too much? Sounds like a similar problem space and a really useful feature to have.

@Gozala
Copy link

Gozala commented Dec 10, 2020

Thanks for asking @lidel, indeed goals are very similar here. I think there couple of things I'd share from building that:

  1. freeze-dry is a great library that does some of the heavy lifting in terms of actually extracting all the resources from the page.
  2. It does however inline everything into html via data URLs, which isn't ideal. I made some efforts in the past to redesign crawler such that individual resources could be saved as separate files and used it to prototype proxy server that live archives page to IPFS as it's loaded.
  3. For the end user experience it doesn't really matter if content is inlined as data URLs or if files are interlinked. So I'd start with simpler approach and improve internals later on.
  4. Archived pages that I've been focusing on are more like pdfs (snapshots of DOM tree at the moment of archival) than interactive pages. That means clicking buttons doesn't work and etc... That was however deliberate choice as some of the past work & user research lead us believe that it better addresses actual user needs is technically simpler. Here are some relevant links here
  5. @ikreymer's work on https://conifer.rhizome.org/ is worth taking a look at as well. Which takes a different approach of recording network request with response headers and bodies so that live page could be reenacted. I believe there is some effort to allow storing recordings into IPFS as well.

One other more meta point I'd like to make is that I think it would be better to see more interoperable tools working in concert with each other than attempting to build a whole suit of tools into a product in this case ipfs-companion. That is to suggest I think it would be more desirable to seek other projects in the space that are already working on archiving pages in some form & figure out ways how collectively they could enhance each other. E.g. braves approach of suggesting to enable an IPFS extension when navigating to resource on IPFS is a good example where functionality may not be built-in but it provides a good way to add such a functionality.

Hope some of this is helpful here.

@ikreymer
Copy link

ikreymer commented Dec 10, 2020

@Gozala thanks for the mention!
Yes, Webrecorder tools have focused on IPFS support lately.
The https://replayweb.page/ system (https://github.com/webrecorder/replayweb.page) now actually supports replay of archived pages from IPFS directly and can itself be run from IPFS.

Here's just two examples:

This archive (of Parametric Press Issue 02) is ~100MB, but everything is loaded on-demand. You can link to specific pages in the archive and search queries, and text search

The archives are serialized as standard WARCs or WACZ (a new zip based format that supports random access).

I think this approach works much better than saving static DOM, which generally is limited to pages that are self-contained or fully static, and is not necessarily simpler to create. If you really only want static content, another extension that works really well for that is: https://github.com/gildas-lormeau/SingleFile

Depending on what the goals of this issue are, I suppose you could close this as resolved :)

I'll have more updates on this work soon, and would happy to chat more about collaborations if anyone is interested!

@Gozala
Copy link

Gozala commented Dec 17, 2020

@ikreymer that is really impressive work, I'm blown away!

I think this approach works much better than saving static DOM, which generally is limited to pages that are self-contained or fully static, and is not necessarily simpler to create.

I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.

I'll have more updates on this work soon, and would happy to chat more about collaborations if anyone is interested!

That is exciting I think it might be great to demo your work on one of the community calls. @autonome might be a good person to chat about possible collobartions

@jessicaschilling
Copy link
Contributor

@ikreymer This could be a great lightning talk at one of the upcoming IPFS virtual meetups 😊 If you'd be interested in talking, here's the speaker form: https://protocollabs.typeform.com/to/hLGfKhxn

@ikreymer
Copy link

@ikreymer that is really impressive work, I'm blown away!

Thanks!

I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.

Yes, I agree, it does depend on what the goals are. And starting with a 'high-fidelity' archive, you can always 'downsample' to just getting the static DOM later, or a more limited view, etc...

@ikreymer This could be a great lightning talk at one of the upcoming IPFS virtual meetups 😊 If you'd be interested in talking, here's the speaker form: https://protocollabs.typeform.com/to/hLGfKhxn

Thanks, I'll sign-up for a future call, would be happy to share this!

@Gozala
Copy link

Gozala commented Dec 29, 2020

I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.

Yes, I agree, it does depend on what the goals are. And starting with a 'high-fidelity' archive, you can always 'downsample' to just getting the static DOM later, or a more limited view, etc...

This had been on my mind once again and I recalled as to how I have settled on DOM snapshots as preferred option. As I we were exploring alternative medium for web where we fused notion of browser history and tabs

https://www.freecodecamp.org/news/lossless-web-navigation-with-trails-9cd48c0abb56/
https://www.freecodecamp.org/news/lossless-web-navigation-spatial-model-37f83438201d/

We end up wanting to unload unused tabs and replacing those with card renderings, which that could be on switch be brought back to life, in the exact same state in way after tab was loaded. Without having safari like flush reload action loosing scroll position etc... Saving frozen JS-less DOM provided a reasonable experience, we would just overlay it with a control that told user it was snapshot of the page from a specific date and allowed user to go to current version.

That is to say capturing archive will not be able to fulfill that use case. So ideally I think tool would both capture dom snapshot along with web archive and provide you a way to go from snapshot to archive to current page and vice versa.

@ikreymer
Copy link

This had been on my mind once again and I recalled as to how I have settled on DOM snapshots as preferred option. As I we were exploring alternative medium for web where we fused notion of browser history and tabs

https://www.freecodecamp.org/news/lossless-web-navigation-with-trails-9cd48c0abb56/
https://www.freecodecamp.org/news/lossless-web-navigation-spatial-model-37f83438201d/

We end up wanting to unload unused tabs and replacing those with card renderings, which that could be on switch be brought back to life, in the exact same state in way after tab was loaded. Without having safari like flush reload action loosing scroll position etc... Saving frozen JS-less DOM provided a reasonable experience, we would just overlay it with a control that told user it was snapshot of the page from a specific date and allowed user to go to current version.

That is to say capturing archive will not be able to fulfill that use case. So ideally I think tool would both capture dom snapshot along with web archive and provide you a way to go from snapshot to archive to current page and vice versa.

Thanks for sharing these! Yeah, I think it comes down to whether the 'JS-less DOM' is a reasonable experience, or not,
which of course depends on the content.

But the trails idea and various spatial navigations are definitely cool ideas, would be happy to chat about it at some point!
We considered trying to do something like the trails idea with webrecorder.io (now confier.rhizome.org) a few years ago, but didn't have chance to implement anything of the sort.

One other thing I've experimented with is saving the window.history stack, and then recreating via history.pushState. On a few sites, this can actually give you a way to replay a dynamic page that was achieved by several history navigations, and allow to 'go back'. But only works well one sites that 'play nice' with history api.. Hope to revisit this idea at some point.

@ikreymer
Copy link

I wanted to finally share, we've just launched the ArchiveWeb.page chrome extension, which allows for archiving any page in a chrome-based browser (via the debug protocol to get full-fidelity).

The ArchiveWeb.page extension also includes experimental IPFS support, so users can archive a page (or several), then share via IPFS, and send a link to load via replayweb.page or via a regular. Here's more info in our guide

There was some work to optimize the system for on-demand loading of a web archive. A web archive may get sufficiently large, and to avoid pulling all the blocks in the multihash over to preload node at once, when loading via ReplayWeb.page, the system tries to pull only the blocks needed for a particular page.

The main issue for now is that if the websocket connection to the preload node is lost and is not reestablished, the sharing is stopped (I understand this is being addressed via: libp2p/js-libp2p#744)

Also looking to see how this can work better in Brave using the native IPFS node.

I'll sign-up for the virtual meet-up and happy to talk more about this work!

@jessicaschilling
Copy link
Contributor

Thanks for the update! If you'd be interested in presenting at one of the virtual meetups, send a note to ipfs-community@protocol.ai with some details about your extension and what you'd specifically be interested in presenting.

Also, consider submitting to Awesome IPFS? https://github.com/ipfs/awesome-ipfs/blob/master/CONTRIBUTING.md

@lidel
Copy link
Member

lidel commented Mar 4, 2021

For drive-by readers, this video was recorded at recent IPFS Community Meetup and gives a good overview and demo of the system created by @ikreymer

I think for the time being we will aim at making single-page snapshots created by Companion bit more useful (eg. inline images and css so things look decent) so Firefox users have something, but for advanced archiving point at https://replayweb.page + separate extension.

UI TBD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/weeks Estimated to take multiple weeks exp/expert Having worked on the specific codebase is important help wanted Seeking public contribution on this issue kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked topic/design-ux UX strategy, research, not solely visual design
Projects
No open projects
Status: Needs Grooming
Development

No branches or pull requests

7 participants