Wikipedia #20

Open
davidar opened this Issue Sep 16, 2015 · 57 comments

Comments

Projects
None yet
@davidar
Member

davidar commented Sep 16, 2015

In terms of being able to view this on the web, I'm tempted to push Pandoc through a Haskell-to-JS compiler like Haste.

CC: @jbenet

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 17, 2015

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/  # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal markup viewer)

rht commented Sep 17, 2015

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/  # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal markup viewer)

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 17, 2015

Member

That's also a possibility, but more time consuming and inflexible

On Thu, 17 Sep 2015 11:29 rht notifications@github.com wrote:

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/ # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal
markup viewer)


Reply to this email directly or view it on GitHub
#20 (comment).

David A Roberts
https://davidar.io

Member

davidar commented Sep 17, 2015

That's also a possibility, but more time consuming and inflexible

On Thu, 17 Sep 2015 11:29 rht notifications@github.com wrote:

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/ # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal
markup viewer)


Reply to this email directly or view it on GitHub
#20 (comment).

David A Roberts
https://davidar.io

@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Sep 18, 2015

I actually started on this a while ago, but then thought it would be silly for a single person to attempt this and stopped, but now that I see this issue, I think it might not have been such a bad idea:

I've been experimenting with using a 15GiB (compressed and without images) dump of the English Wikipedia and extracting HTML files using gozim and wget. This gave me a folder full of HTML pages that interlink nicely using relative links.

It took a couple of hours to extract every page reachable from 'Internet' within 2 hops, which amounted to about 1% of the articles in the dump, so it would take at least a week to create HTML pages for the entire dump. And since these HTML files are uncompressed, I'm not sure I have enough disk space available to do the complete dump, but I could repeat my initial trial and make it available in IPFS.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates. If it is decided that this way of doing it might not be such a bad idea, it might be possible to alter gozim to embed such license information. Or maybe we can simply put a LICENSE-file in the top-most directory.

I actually started on this a while ago, but then thought it would be silly for a single person to attempt this and stopped, but now that I see this issue, I think it might not have been such a bad idea:

I've been experimenting with using a 15GiB (compressed and without images) dump of the English Wikipedia and extracting HTML files using gozim and wget. This gave me a folder full of HTML pages that interlink nicely using relative links.

It took a couple of hours to extract every page reachable from 'Internet' within 2 hops, which amounted to about 1% of the articles in the dump, so it would take at least a week to create HTML pages for the entire dump. And since these HTML files are uncompressed, I'm not sure I have enough disk space available to do the complete dump, but I could repeat my initial trial and make it available in IPFS.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates. If it is decided that this way of doing it might not be such a bad idea, it might be possible to alter gozim to embed such license information. Or maybe we can simply put a LICENSE-file in the top-most directory.

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 19, 2015

Member

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Or maybe we can simply put a LICENSE-file in the top-most directory.

Definitely. See #25

Member

davidar commented Sep 19, 2015

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Or maybe we can simply put a LICENSE-file in the top-most directory.

Definitely. See #25

@davidar davidar added the help wanted label Sep 19, 2015

@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Sep 19, 2015

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files. However, resource usage on the client may or may not be prohibitively large.

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

There is no real script. It's literally:

  1. gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
  2. wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

This will crawl everything reachable from 'Internet'. It may be possible to directly crawl the index of pages itself, but I haven't tried that yet.

You probably need to wrap gozimhttpd in a while loop, because it tends to crash once in a while. As for storage requirements: The 60.000 articles I extracted take up 5GiB of storage, so a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Hm. Maybe that's because they are using a different dump, or a newer version of gozim (though the latter seems unlikely); the pages I extracted don't have that footer.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files. However, resource usage on the client may or may not be prohibitively large.

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

There is no real script. It's literally:

  1. gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
  2. wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

This will crawl everything reachable from 'Internet'. It may be possible to directly crawl the index of pages itself, but I haven't tried that yet.

You probably need to wrap gozimhttpd in a while loop, because it tends to crash once in a while. As for storage requirements: The 60.000 articles I extracted take up 5GiB of storage, so a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Hm. Maybe that's because they are using a different dump, or a newer version of gozim (though the latter seems unlikely); the pages I extracted don't have that footer.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 19, 2015

Member

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files.

For context, this is what @brewsterkahle uses for his IPFS-hosted blog

However, resource usage on the client may or may not be prohibitively large.

Yeah, that was my concern too. If so, it might have to wait until #8

There is no real script. It's literally:

gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

Too easy

a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

Ok, we'll have to wait until we get some more storage then.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

Thanks. Ping me on http://chat.ipfs.io to help debug.

Member

davidar commented Sep 19, 2015

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files.

For context, this is what @brewsterkahle uses for his IPFS-hosted blog

However, resource usage on the client may or may not be prohibitively large.

Yeah, that was my concern too. If so, it might have to wait until #8

There is no real script. It's literally:

gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

Too easy

a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

Ok, we'll have to wait until we get some more storage then.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

Thanks. Ping me on http://chat.ipfs.io to help debug.

@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Sep 19, 2015

Short progress update: I'm now feeding files to ipfs add in batches of 25, that seems to have solved the memory issue for now. I hope that feeding in the files piecemeal will prevent the crash that occurs when adding the entire directory at once. I'll probably be able to try adding the entire thing again tomorrow.

I also took another look at gozim. It is relatively easy to extract the HTML-files without going through wget first -- should've thought of that before coming up with the wget-scheme. That way we won't miss any articles; I'll have to do more research on redirects though.

Quick & dirty dumping program here.

Short progress update: I'm now feeding files to ipfs add in batches of 25, that seems to have solved the memory issue for now. I hope that feeding in the files piecemeal will prevent the crash that occurs when adding the entire directory at once. I'll probably be able to try adding the entire thing again tomorrow.

I also took another look at gozim. It is relatively easy to extract the HTML-files without going through wget first -- should've thought of that before coming up with the wget-scheme. That way we won't miss any articles; I'll have to do more research on redirects though.

Quick & dirty dumping program here.

@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Sep 20, 2015

I had no luck getting ipfs add to ingest the HTML files; pre-adding the files in batches didn't do anything. ipfs (without the daemon running) consumed enough RAM to fill a 100GB swap file and then crashed with an error, runtime: out of memory. A script I wrote to add files one by one using the object patch subcommand was too slow, taking 3 to 5 seconds for a single page, so I abandoned that approach.

There are two related issues describing problems with ipfs add. I'll try again once those are resolved.

I had no luck getting ipfs add to ingest the HTML files; pre-adding the files in batches didn't do anything. ipfs (without the daemon running) consumed enough RAM to fill a 100GB swap file and then crashed with an error, runtime: out of memory. A script I wrote to add files one by one using the object patch subcommand was too slow, taking 3 to 5 seconds for a single page, so I abandoned that approach.

There are two related issues describing problems with ipfs add. I'll try again once those are resolved.

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 23, 2015

Member

@DataWraith Hmm, that's no good 😕. For the moment, could you tar/zip all the files together and add that?

CC: @whyrusleeping

Member

davidar commented Sep 23, 2015

@DataWraith Hmm, that's no good 😕. For the moment, could you tar/zip all the files together and add that?

CC: @whyrusleeping

@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Sep 23, 2015

Hi.

I've decided to delete the trial-files obtained using wget and go all out and try to actually dump the entire most-recent English Wikipedia snapshot (with images) with my program. It's currently in the 'D's (1.3 million articles done) and I estimate it will finish in another 60 to 70 hours. I'll try adding the dump using the undocumented ipfs tar add, which did not seem to blow up memory-wise in the small trial I did. Not sure why that would be different from the normal ipfs add, but apparently it is. If that still fails, I'll run the tar-archive through lrzip and upload that.

My initial estimate of space required was off, because the article sample I obtained using wget did not contain the small stub articles, of which there are many. The 1.3 million articles I have now add up to 40GiB, so, assuming that the distribution of article sizes is not skewed, we are looking at an overall size of about 160GiB plus maybe another 40GiB for the images. In addition, I'm using btrfs to store the dump, and its built-in compression support halves the actual amount of data stored, so size should not be a problem.

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

Hi.

I've decided to delete the trial-files obtained using wget and go all out and try to actually dump the entire most-recent English Wikipedia snapshot (with images) with my program. It's currently in the 'D's (1.3 million articles done) and I estimate it will finish in another 60 to 70 hours. I'll try adding the dump using the undocumented ipfs tar add, which did not seem to blow up memory-wise in the small trial I did. Not sure why that would be different from the normal ipfs add, but apparently it is. If that still fails, I'll run the tar-archive through lrzip and upload that.

My initial estimate of space required was off, because the article sample I obtained using wget did not contain the small stub articles, of which there are many. The 1.3 million articles I have now add up to 40GiB, so, assuming that the distribution of article sizes is not skewed, we are looking at an overall size of about 160GiB plus maybe another 40GiB for the images. In addition, I'm using btrfs to store the dump, and its built-in compression support halves the actual amount of data stored, so size should not be a problem.

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 24, 2015

Member

@DataWraith Awesome, can't wait to see it :)

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@whyrusleeping Please make ipfs add faster 🙏

Member

davidar commented Sep 24, 2015

@DataWraith Awesome, can't wait to see it :)

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@whyrusleeping Please make ipfs add faster 🙏

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 24, 2015

@whyrusleeping

For scale (foo/ is 11 MB, 10 files of 1.1 MB each):

  • cp: cp -r foo bar 0.00s user 0.01s system 86% cpu 0.008 total
  • master: ipfs add -q -r foo >actual 0.13s user 0.04s system 10% cpu 1.582 total
  • master (no sync on flatfs): ipfs add -q -r foo > actual 0.11s user 0.03s system 102% cpu 0.136 total (the remaining time bloat comes from leveldb)
  • git: git add foo 0.00s user 0.00s system 84% cpu 0.006 total
  • rsync: rsync -r foo bar 5.16s user 1.18s system 108% cpu 5.840 total
  • tar: tar cvf foo.tar foo 0.00s user 0.01s system 95% cpu 0.013 total
  • ipfs tar add: ipfs tar add foo.tar 0.25s user 0.05s system 35% cpu 0.857 total

It appears that cp doesn't have an explicit call to fsync in its implementation https://github.com/coreutils/coreutils/search?utf8=%E2%9C%93&q=fsync.
(I think it's fine to not have explicit sync call?)

rht commented Sep 24, 2015

@whyrusleeping

For scale (foo/ is 11 MB, 10 files of 1.1 MB each):

  • cp: cp -r foo bar 0.00s user 0.01s system 86% cpu 0.008 total
  • master: ipfs add -q -r foo >actual 0.13s user 0.04s system 10% cpu 1.582 total
  • master (no sync on flatfs): ipfs add -q -r foo > actual 0.11s user 0.03s system 102% cpu 0.136 total (the remaining time bloat comes from leveldb)
  • git: git add foo 0.00s user 0.00s system 84% cpu 0.006 total
  • rsync: rsync -r foo bar 5.16s user 1.18s system 108% cpu 5.840 total
  • tar: tar cvf foo.tar foo 0.00s user 0.01s system 95% cpu 0.013 total
  • ipfs tar add: ipfs tar add foo.tar 0.25s user 0.05s system 35% cpu 0.857 total

It appears that cp doesn't have an explicit call to fsync in its implementation https://github.com/coreutils/coreutils/search?utf8=%E2%9C%93&q=fsync.
(I think it's fine to not have explicit sync call?)

@whyrusleeping

This comment has been minimized.

Show comment
Hide comment
@whyrusleeping

whyrusleeping Sep 24, 2015

Member

@davidar @rht okay, I'll make that top priority after UDT and ipns land.

Member

whyrusleeping commented Sep 24, 2015

@davidar @rht okay, I'll make that top priority after UDT and ipns land.

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 24, 2015

(git does explicit sync https://github.com/git/git/blob/master/pack-write.c#L277
edit: but only on pack updates)

rht commented Sep 24, 2015

(git does explicit sync https://github.com/git/git/blob/master/pack-write.c#L277
edit: but only on pack updates)

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 24, 2015

@davidar I get you point, which either mean 1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).
Though it is currently slow (e.g. pandoc pdf to html << (or maybe ~) pdf.js << browser plugin for pdf).

As with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

rht commented Sep 24, 2015

@davidar I get you point, which either mean 1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).
Though it is currently slow (e.g. pandoc pdf to html << (or maybe ~) pdf.js << browser plugin for pdf).

As with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 24, 2015

I wonder if some of the critical operations should be offloaded to FPGA.

rht commented Sep 24, 2015

I wonder if some of the critical operations should be offloaded to FPGA.

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 25, 2015

Member
  1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).

Uh oh, which side of this argument am I on now? #25 @jbenet

with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

Member

davidar commented Sep 25, 2015

  1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).

Uh oh, which side of this argument am I on now? #25 @jbenet

with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 25, 2015

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

And this can be repurposed for any 'pre-computed' stuff, not just search indexes? e.g. (content sorted/filtered by paramX, or entire sql queries ipfs/ipfs#82?)

rht commented Sep 25, 2015

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

And this can be repurposed for any 'pre-computed' stuff, not just search indexes? e.g. (content sorted/filtered by paramX, or entire sql queries ipfs/ipfs#82?)

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 26, 2015

Member

@rht yes, I would think so, I don't see any reason why it wouldn't be possible to build a SQL database format on top of IPLD (albeit non-trivial)

Member

davidar commented Sep 26, 2015

@rht yes, I would think so, I don't see any reason why it wouldn't be possible to build a SQL database format on top of IPLD (albeit non-trivial)

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 26, 2015

Member

@rht looks like someone already beat me to it: http://markup.rocks

Member

davidar commented Sep 26, 2015

@rht looks like someone already beat me to it: http://markup.rocks

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 27, 2015

@davidar by a few months. Very useful to know that it is fast.
Currently imagining the possibilities.

Also, found this http://git.kernel.org/cgit/git/git.git/tree/Documentation/config.txt#n693:

This is a total waste of time and effort on a filesystem that orders data writes properly, but can be useful for filesystems that do not use journalling (traditional UNIX filesystems) or that only journal metadata and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").

@whyrusleeping disable fsync by default and add a config flag to enable it? (wanted to close the gap with git, which is still 2 orders of magnitude away).

rht commented Sep 27, 2015

@davidar by a few months. Very useful to know that it is fast.
Currently imagining the possibilities.

Also, found this http://git.kernel.org/cgit/git/git.git/tree/Documentation/config.txt#n693:

This is a total waste of time and effort on a filesystem that orders data writes properly, but can be useful for filesystems that do not use journalling (traditional UNIX filesystems) or that only journal metadata and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").

@whyrusleeping disable fsync by default and add a config flag to enable it? (wanted to close the gap with git, which is still 2 orders of magnitude away).

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 27, 2015

Member

Very useful to know that it is fast.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well. The FP Complete IDE is also written in a subset of Haskell.

Currently imagining the possibilities.

Something like the ipfs markdown viewer but using pandoc would be cool.

Member

davidar commented Sep 27, 2015

Very useful to know that it is fast.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well. The FP Complete IDE is also written in a subset of Haskell.

Currently imagining the possibilities.

Something like the ipfs markdown viewer but using pandoc would be cool.

@rht rht referenced this issue in ipfs/go-datastore Sep 27, 2015

Closed

Add sync flag to flatfs #30

@davidar

This comment has been minimized.

Show comment
Hide comment
Member

davidar commented Sep 27, 2015

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 27, 2015

@davidar saw it, neat. i.e. it's a pandoc but without the huge GHC stuff, cabal-install ritual, etc.
It's a pandoc.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well.

But so does python, ruby, ... You mean sane type system?
https://github.com/faylang/fay/wiki says fay doesn't have GHC's STM, concurrency--which is fine.

This has nice things like:

Additionally, because all Fay code is Haskell code, certain modules can be shared between the ‘native’ Haskell and ‘web’ Haskell, most interestingly the types module of your project. This enables two things:
The enforced (by GHC) coherence of client-side and server-side data types. The transparent serializing and deserializing of data types between these two entities (e.g. over AJAX).

(haven't actually looked at a minimalist typed :lambda: calculus metacircular evaluator (the one people write (or chant) every day for the untyped ones))

rht commented Sep 27, 2015

@davidar saw it, neat. i.e. it's a pandoc but without the huge GHC stuff, cabal-install ritual, etc.
It's a pandoc.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well.

But so does python, ruby, ... You mean sane type system?
https://github.com/faylang/fay/wiki says fay doesn't have GHC's STM, concurrency--which is fine.

This has nice things like:

Additionally, because all Fay code is Haskell code, certain modules can be shared between the ‘native’ Haskell and ‘web’ Haskell, most interestingly the types module of your project. This enables two things:
The enforced (by GHC) coherence of client-side and server-side data types. The transparent serializing and deserializing of data types between these two entities (e.g. over AJAX).

(haven't actually looked at a minimalist typed :lambda: calculus metacircular evaluator (the one people write (or chant) every day for the untyped ones))

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 28, 2015

Member

... You mean sane type system?

Yeah, I meant of the languages with a strong enough type system to be able to produce optimised code

Member

davidar commented Sep 28, 2015

... You mean sane type system?

Yeah, I meant of the languages with a strong enough type system to be able to produce optimised code

@davidar

This comment has been minimized.

Show comment
Hide comment
@jamescarlyle

This comment has been minimized.

Show comment
Hide comment
@jamescarlyle

jamescarlyle Sep 29, 2015

The source of the v.basic wiki editor referenced by David is at https://github.com/jamescarlyle/ipfs-wiki

The source of the v.basic wiki editor referenced by David is at https://github.com/jamescarlyle/ipfs-wiki

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 29, 2015

How to make this work? the text I typed didn't show up.

rht commented Sep 29, 2015

How to make this work? the text I typed didn't show up.

@jamescarlyle

This comment has been minimized.

Show comment
Hide comment
@jamescarlyle

jamescarlyle Sep 29, 2015

@rht, sorry, I posted it without any public testing. I've added the briefest of READMEs to the GH repo - specifically, "There is a current dependency on a local daemon listening on port 5001 (this is the default port for the IPFS daemon), in order to both fetch content and save changes. This means that the IPFS gateway used to serve the js also needs to use the same protocol, i.e. http rather than https." So running a daemon and serving locally should be fine. Will get to running via a public gateway in due course; sorry about that.

@rht, sorry, I posted it without any public testing. I've added the briefest of READMEs to the GH repo - specifically, "There is a current dependency on a local daemon listening on port 5001 (this is the default port for the IPFS daemon), in order to both fetch content and save changes. This means that the IPFS gateway used to serve the js also needs to use the same protocol, i.e. http rather than https." So running a daemon and serving locally should be fine. Will get to running via a public gateway in due course; sorry about that.

@rht

This comment has been minimized.

Show comment
Hide comment
@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Sep 29, 2015

Hi all,

the Wikipedia dump is finished. I packed it into a single .tar-file weighing in at 176GB, which lrzip then compressed down to 42GB. My internet connection, while decent, will still take its time to upload that much data; I'll edit this post with a Dropbox link to the file once the upload is done.

Edit: Dropbox link: https://www.dropbox.com/s/7ut0g1mdbwuq393/wikipedia_en_all_2015-05.tar.lrz?dl=0

Hi all,

the Wikipedia dump is finished. I packed it into a single .tar-file weighing in at 176GB, which lrzip then compressed down to 42GB. My internet connection, while decent, will still take its time to upload that much data; I'll edit this post with a Dropbox link to the file once the upload is done.

Edit: Dropbox link: https://www.dropbox.com/s/7ut0g1mdbwuq393/wikipedia_en_all_2015-05.tar.lrz?dl=0

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Sep 30, 2015

Member

Maybe should be put directly to one of our storage nodes with scp.

Member

jbenet commented Sep 30, 2015

Maybe should be put directly to one of our storage nodes with scp.

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Sep 30, 2015

Member

Maybe should be put directly to one of our storage nodes with scp.

@DataWraith let me know if there's anything I can do to help with this

CC: @lgierth

Member

davidar commented Sep 30, 2015

Maybe should be put directly to one of our storage nodes with scp.

@DataWraith let me know if there's anything I can do to help with this

CC: @lgierth

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Sep 30, 2015

(ic, for the ipfs-wiki, I was blocked by cors...)

rht commented Sep 30, 2015

(ic, for the ipfs-wiki, I was blocked by cors...)

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Oct 4, 2015

Member

@DataWraith Awesome, downloading now :)

Member

davidar commented Oct 4, 2015

@DataWraith Awesome, downloading now :)

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Oct 5, 2015

Member

@DataWraith And now it's on IPFS 🎈

@whyrusleeping Looking forward to ipfs add being fast enough to handle the extracted version ;)

Member

davidar commented Oct 5, 2015

@DataWraith And now it's on IPFS 🎈

@whyrusleeping Looking forward to ipfs add being fast enough to handle the extracted version ;)

@DataWraith

This comment has been minimized.

Show comment
Hide comment

@davidar Awesome!

@whyrusleeping

This comment has been minimized.

Show comment
Hide comment
@whyrusleeping

whyrusleeping Oct 5, 2015

Member

@davidar its very high on my todo list.

Member

whyrusleeping commented Oct 5, 2015

@davidar its very high on my todo list.

@davidar

This comment has been minimized.

Show comment
Hide comment
Member

davidar commented Oct 6, 2015

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Nov 28, 2015

This can proceed with ipfs/go-ipfs#1964 + ipfs/go-ipfs#1973 merged (pending @jbenet's CR).
nosync is still not sufficient.

rht commented Nov 28, 2015

This can proceed with ipfs/go-ipfs#1964 + ipfs/go-ipfs#1973 merged (pending @jbenet's CR).
nosync is still not sufficient.

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Nov 28, 2015

Member

@rht that's awesome :). Are you also testing perf on spinning disks (not just SSDs)? It seems to be the random access latency that really kills perf

Edit: also make sure the test files are created in a random order (not in lexicographical order)

Member

davidar commented Nov 28, 2015

@rht that's awesome :). Are you also testing perf on spinning disks (not just SSDs)? It seems to be the random access latency that really kills perf

Edit: also make sure the test files are created in a random order (not in lexicographical order)

@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Nov 28, 2015

The first reduces the number of operations needed (including disk io), so will make add on HDD faster. For the second, channel iterators in golang has been reported to be slow (but I'm not sure of its direct impact on disk io), so should make add on HDD faster.

rht commented Nov 28, 2015

The first reduces the number of operations needed (including disk io), so will make add on HDD faster. For the second, channel iterators in golang has been reported to be slow (but I'm not sure of its direct impact on disk io), so should make add on HDD faster.

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Dec 1, 2015

Member

on it! (cr)

Member

jbenet commented Dec 1, 2015

on it! (cr)

@DataWraith

This comment has been minimized.

Show comment
Hide comment
@DataWraith

DataWraith Dec 2, 2015

I'm trying out those pull requests on the Wikipedia dump right now. ipfs tar add still crashed with an out-of-memory error, but plain ipfs add -r -H -p . is chugging along nicely. It's been running for almost 12 hours now, so hopefully it's not going to crash.

It has added the articles starting with numbers, and is now working on the articles starting with A, so it'll be a while until the whole dump is processed.

I'm trying out those pull requests on the Wikipedia dump right now. ipfs tar add still crashed with an out-of-memory error, but plain ipfs add -r -H -p . is chugging along nicely. It's been running for almost 12 hours now, so hopefully it's not going to crash.

It has added the articles starting with numbers, and is now working on the articles starting with A, so it'll be a while until the whole dump is processed.

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Dec 2, 2015

Member

@DataWraith thanks, good to hear -- btw, dev0.4.0 has many interesting perf upgrades, with flags like --no-sync which should make it much faster.

Member

jbenet commented Dec 2, 2015

@DataWraith thanks, good to hear -- btw, dev0.4.0 has many interesting perf upgrades, with flags like --no-sync which should make it much faster.

@dignifiedquire

This comment has been minimized.

Show comment
Hide comment
@dignifiedquire

dignifiedquire Jan 10, 2016

Member

ipfs add is mich faster in 0.4 maybe we can revisit this and try to setup a script to constantly update the mirrored version in ipfs

Member

dignifiedquire commented Jan 10, 2016

ipfs add is mich faster in 0.4 maybe we can revisit this and try to setup a script to constantly update the mirrored version in ipfs

@eminence

This comment has been minimized.

Show comment
Hide comment
@eminence

eminence Jan 11, 2016

Collaborator

Instead of working with the massive Wikipedia, I've been playing with the smaller, but still sizable Wikispecies project. It has 439,460 articles, and is about 4.5 GB on disk.

I've imported the static HTML dumps from the Kiwik openzim dump files. The dump to disk took less than 10 minutes, and the import into ipfs (with ipfs040 with Datastore.NoSync: true) took about 3 or 4 hours.

It's browsable on my local gateway, but I've not been able to get the site to load on the ipfs public gateways. Can any of you try?

http://localhost:8120/ipfs/QmbZp1H1mCbVSiD2K8xpFFhzRGoLJTU6E4keY9WQpyuxP1/A/index.htm

(edit Jan 14th -- after upgrading my nodes to master branch, I stopped running my dev040 node, so this hash is no longer available. Stay tuned for updates)

Collaborator

eminence commented Jan 11, 2016

Instead of working with the massive Wikipedia, I've been playing with the smaller, but still sizable Wikispecies project. It has 439,460 articles, and is about 4.5 GB on disk.

I've imported the static HTML dumps from the Kiwik openzim dump files. The dump to disk took less than 10 minutes, and the import into ipfs (with ipfs040 with Datastore.NoSync: true) took about 3 or 4 hours.

It's browsable on my local gateway, but I've not been able to get the site to load on the ipfs public gateways. Can any of you try?

http://localhost:8120/ipfs/QmbZp1H1mCbVSiD2K8xpFFhzRGoLJTU6E4keY9WQpyuxP1/A/index.htm

(edit Jan 14th -- after upgrading my nodes to master branch, I stopped running my dev040 node, so this hash is no longer available. Stay tuned for updates)

@lgierth lgierth referenced this issue in ipfs/pm Jan 12, 2016

Closed

Sprint January 11th #79

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Jan 12, 2016

Member

I've not been able to get the site to load on the ipfs public gateways

Same :/

Member

davidar commented Jan 12, 2016

I've not been able to get the site to load on the ipfs public gateways

Same :/

@eminence

This comment has been minimized.

Show comment
Hide comment
@eminence

eminence Jan 17, 2016

Collaborator

Ok, here is my next iteration on this project :

http://v04x.ipfs.io/ipfs/QmV6H1quZ4VwzaaoY1zDxmrZEtXMTN1WLJHpPWY627dYVJ/A/20/8f/Main_Page.html

This is also an IPFS-hosted version of Wikispecies, but with one major change:

Instead of having every article in one massive folder, each article has been partitioned into sub-folders based on the hash of the filename. For articles, there are two levels of hashing, and for images there is one level of hashing.

The goal of this is to reduce the number of links in the A/ and I/m nodes, since they appeared to be too large to load via the public IPFS gateways. I think in this regard, this has been successful.

However, there still seem to be some issues. As I browse around the Main_Page.html link (see above), sometimes the page will load quickly and instantly. Other times, images will be missing, the page will load slowly, or maybe even not at all. This is true even for pages that I've visited already (and thus should be in the gateway's cache)

I can't really tell what's going on here. Running ipfs refs on these hashes from another node of mine works pretty flawlessly. So I conclude the problem might not be with my node. But I'm not sure what other debugging tricks I can use to get to the bottom of this. I think this is a fairly important issue to resolve.

Finally, here are the two tools I wrote in the process of working on this:

zim dumping takes a few minutes, wiki_rewriting takes less than an hour, and ipfs add -r probably took a few hours. in all cases, i appear to disk-io bound

Collaborator

eminence commented Jan 17, 2016

Ok, here is my next iteration on this project :

http://v04x.ipfs.io/ipfs/QmV6H1quZ4VwzaaoY1zDxmrZEtXMTN1WLJHpPWY627dYVJ/A/20/8f/Main_Page.html

This is also an IPFS-hosted version of Wikispecies, but with one major change:

Instead of having every article in one massive folder, each article has been partitioned into sub-folders based on the hash of the filename. For articles, there are two levels of hashing, and for images there is one level of hashing.

The goal of this is to reduce the number of links in the A/ and I/m nodes, since they appeared to be too large to load via the public IPFS gateways. I think in this regard, this has been successful.

However, there still seem to be some issues. As I browse around the Main_Page.html link (see above), sometimes the page will load quickly and instantly. Other times, images will be missing, the page will load slowly, or maybe even not at all. This is true even for pages that I've visited already (and thus should be in the gateway's cache)

I can't really tell what's going on here. Running ipfs refs on these hashes from another node of mine works pretty flawlessly. So I conclude the problem might not be with my node. But I'm not sure what other debugging tricks I can use to get to the bottom of this. I think this is a fairly important issue to resolve.

Finally, here are the two tools I wrote in the process of working on this:

zim dumping takes a few minutes, wiki_rewriting takes less than an hour, and ipfs add -r probably took a few hours. in all cases, i appear to disk-io bound

@whyrusleeping

This comment has been minimized.

Show comment
Hide comment
@whyrusleeping

whyrusleeping Jan 17, 2016

Member

@eminence this is great! It also further emphasizes the fact that we need to figure out directory sharding. I'll think on this today and see what I come up with.

Keep up the good work :)

Member

whyrusleeping commented Jan 17, 2016

@eminence this is great! It also further emphasizes the fact that we need to figure out directory sharding. I'll think on this today and see what I come up with.

Keep up the good work :)

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Jan 19, 2016

Member

@whyrusleeping note that directory sharding will go on top of IPLD, and that it should work for arbitrary objects (not just unxifs directories). Take a look at the spec. we can use another directive there.

Member

jbenet commented Jan 19, 2016

@whyrusleeping note that directory sharding will go on top of IPLD, and that it should work for arbitrary objects (not just unxifs directories). Take a look at the spec. we can use another directive there.

@davidar

This comment has been minimized.

Show comment
Hide comment
@rht

This comment has been minimized.

Show comment
Hide comment
@rht

rht Feb 4, 2016

https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

(last updated ~3.5 years ago, but penned ~7 years ago)

rht commented Feb 4, 2016

https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

(last updated ~3.5 years ago, but penned ~7 years ago)

@davidar

This comment has been minimized.

Show comment
Hide comment
@davidar

davidar Feb 4, 2016

Member

@rht yeah I know, but it might still be relevant

Member

davidar commented Feb 4, 2016

@rht yeah I know, but it might still be relevant

@donothesitate

This comment has been minimized.

Show comment
Hide comment
@donothesitate

donothesitate Jan 19, 2017

The question is if we want a HTML static only version, dynamic, or both.
As for static, the storage or filesystem where the data is stored can use compression.

In case of dynamic, with use of a Service Worker, zlib compression with dictionary, xml entries stored compressed, one could quickly fetch article, render as HTML, and link in pre-determined way. With the optional fallback in Service Worker to real wikipedia.

XML Wiki dump compressed with xz in 256k chunks, without dictionary equals the size of the bzip2 xml dump, and that is 13GB. Given English and pre-made zlib dictionary, I believe one can get to a nice number.

As for search a js variant for the terms only w/ suggestions of top terms could function well.

Edit: I'm being tempted by zim files. Having each cluster as a (raw) block.
Edit: Extracted a 1/1000 sparse sample of enwiki xml dump (105/13MB):
https://ipfs.io/ipfs/QmVYQwcq5jMnEjL1oXiFhED8Gp7S1um1wBHEjJrqWH3bzb/enwiki-20170101-pages-articles-1000th-sample.xml.7z

Edit: The only way to have good compression via widespread compression methods seems to be clustering. Compressing per record results in 4-5x the size. Which leads to storage compression.
The only other way could be a purpose crafted dictionary + huffman coder.

donothesitate commented Jan 19, 2017

The question is if we want a HTML static only version, dynamic, or both.
As for static, the storage or filesystem where the data is stored can use compression.

In case of dynamic, with use of a Service Worker, zlib compression with dictionary, xml entries stored compressed, one could quickly fetch article, render as HTML, and link in pre-determined way. With the optional fallback in Service Worker to real wikipedia.

XML Wiki dump compressed with xz in 256k chunks, without dictionary equals the size of the bzip2 xml dump, and that is 13GB. Given English and pre-made zlib dictionary, I believe one can get to a nice number.

As for search a js variant for the terms only w/ suggestions of top terms could function well.

Edit: I'm being tempted by zim files. Having each cluster as a (raw) block.
Edit: Extracted a 1/1000 sparse sample of enwiki xml dump (105/13MB):
https://ipfs.io/ipfs/QmVYQwcq5jMnEjL1oXiFhED8Gp7S1um1wBHEjJrqWH3bzb/enwiki-20170101-pages-articles-1000th-sample.xml.7z

Edit: The only way to have good compression via widespread compression methods seems to be clustering. Compressing per record results in 4-5x the size. Which leads to storage compression.
The only other way could be a purpose crafted dictionary + huffman coder.

@flyingzumwalt flyingzumwalt referenced this issue in ipfs/distributed-wikipedia-mirror May 1, 2017

Closed

Gather background info from other repositories and add to this one #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment