Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mirroring Web to IPFS #94

Open
1 of 2 tasks
lidel opened this issue Jul 24, 2018 · 7 comments
Open
1 of 2 tasks

Mirroring Web to IPFS #94

lidel opened this issue Jul 24, 2018 · 7 comments

Comments

@lidel
Copy link
Member

lidel commented Jul 24, 2018

This is a meta-issue tracking related work and discussions (moved from ipfs/ipfs-companion#96).

Feasible

More Design Work Required

Saving reproducible snapshot of entire page load

Automatic mirroring of standard websites to IPFS as you browse them (ipfs/ipfs-companion#535)


Related Discussions

2016-03-26

IRC log about mirroring SRI2IPFS
165958           geir_ │ lgierth: The web sites would have to link to ipfs content for this plugin to work. What i propose is a proxy that works like a transparent proxy and puts content into ipfs if it's not already there
170124            ed_t │ anyone know anything about ipfs-boards
170141            ed_t │ it keeps telling me I am in limited mode
170202            ed_t │ a full ipfs 0.40-rc3 node is running on localhost:5001
170217            ed_t │ but it does not seem to see it using the demo link
170228        +lgierth │ geir_: ah got what you wanna do -- i'm not sure you can easily just rewrite anything
170253        +lgierth │ for completely static pages, yes, but for slightly more dynamic stuff?
170303        +lgierth │ i'll be back in a bit, getting some coffee
170422           geir_ │ lgierth: I mean only for the static stuff like images, libs and so on. Should be pretty strait forward to implement. And a big bandwidth save for big networks
171542           lidel │ geir_, we are planning to add "host to ipfs" feature to the addon
171614           lidel │ when that is done, it should be easy to add option to automatically add every visited page
171634           lidel │ not sure how addon would do lookups tho
171734           lidel │ (meaning, how do i know the multihash of the page, how do we handle ipfs-cache expiration when page gets updated, etc)
171831           geir_ │ lidel: I see, thanks for the info. I still like the idea of a transparent proxy so every user/device on the network will use the "cdn" automatically
171852           lidel │ perhaps we could start with mirroring static assets that have SRI hash (https://www.srihash.org/)
171920           lidel │ and come up with a way for doing SRI2IPFS lookups

2015+

2018-01-14

2018-03-08

2018-07-09

2018-07-23

@LoveIsGrief
Copy link

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

@mitra42
Copy link

mitra42 commented Feb 21, 2019

Sure, I'd be happy to talk. - dweb.archive.org doesn't do it for web pages (yet) but does mirror some of the content accessed through dweb-gateway to the IPFS http api. (Not all of it, because of the combination of IPFS losing data, and no error result/fallback when it cant find something).

Note that we also use urlstore as our primary mirroring mechanism, because we have the opposite concern to you, i.e. that we can't replicate 50 peta-bytes, so just push the reference so that the most used items will get mirrored by IPFS, and an upcoming version will also pull items via IPFS as alternative to a direct fetch from the archive.

I also wrote dweb.mirror which is a crawler, specialized to crawl archive.org items (not wayback machine yet) and that mirrors everything to IPFS.

@jimpick
Copy link

jimpick commented May 3, 2019

I'll be going to csv,conf next week. It will be another chance to talk more with @ikreymer, who is giving a talk on WARC files: https://csvconf.com/speakers/#ilya-kreymer

@RubenKelevra
Copy link

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

How about asking archive.org if we could help them by cooperating, I'm sure they have issues with crawling capacity?

Archive.org could provide data in ipfs when a given URL has been captured. If this is some days ago, we could ask the user, if he likes to capture the URL (since he might be logged in or personal information is currently inserted in a form or similar). If he agrees we share the snapshot in IPFS (somehow - I have no idea how this would technically work to make it locatable by URL and timestamp). archive.org could pin it or download it, for displaying it on their website.

@ikreymer
Copy link

Hi, I've just recently launched https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers. The system can load web archives from a variety of locations, and could be expanded to support IPFS.

In fact, it can trivially work using an IPFS gateway already:
https://gateway.pinata.cloud/ipfs/QmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX/?source=https%3A%2F%2Fgateway.pinata.cloud%2Fipfs%2FQmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX%2Fexamples%2Fnetpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135

It should be possible to extend to support ipfs:// urls, or perhaps using the gateway could work as well (though cloudflare specifically does not allow service workers).

ReplayWeb.page is the latest tool from Webrecorder, here's also a blog post announcing it:
https://webrecorder.net/2020/06/11/webrecorder-conifer-and-replayweb-page.html#introducing-replaywebpage

@lidel
Copy link
Member Author

lidel commented Feb 26, 2021

Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0

@RubenKelevra
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants