Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Notes on building and manipulating large IPLD graphs in JavaScript and LevelDB. #1416

Closed
mikeal opened this issue Jun 29, 2018 · 7 comments
Closed
Assignees

Comments

@mikeal
Copy link
Contributor

mikeal commented Jun 29, 2018

Because I need to use the data, and as a good project to learn the
ecosystem, I took on the task of loading gharchive into IPFS/IPLD.

A few things about gharchive.

  • A single gzipped JSON file for every hour of GitHub activity.
  • Typical hour in 2018 ranges from 20K-70K JSON objects.
  • Does not contain any code, just the activity metadata for a range
    of GitHub activities.
  • Files for 2018 range from 20-100megs compressed.

Because of what I need to do with the data, I need each event
object to live in 3 paths.

/repos/:owner/:repo/:year/:month/:day/:time
/actors/:login/:year/:month/:day/:time
/timestamps/:year/:month/:day/:hour/:minute/:hash

So an hour of data will translate into 60K-210K inserts deep
into the graph.

Because I'll eventually want to pull parts of this data into the browser
I decided to compress the JSON objects with gzip. So, each event becomes a
raw node of gzipped json.

I started at the top of the IPFS stack and worked my way down as each layer
couldn't satisfy the performance requirements (eventually each layer would
take longer than an hour to insert an hour of data from the archive).

Before I explain the libraries and methods I finally ended up at, here's
the current performance parsing through 2018 on my laptop.

{ filename: '2018-01-07-14.json.gz' }
{ cid: 'zdpuAt6Hs61V1NPn7sudZeXmRsHC6RcKP7XSer4BcDvP7T5ta',
  events: 46886,
  walked: 106402,
  totalTime: '173s',
  processTime: '78s',
  graphBuildTime: '80s' }

Note: the times for "processing" vs. "graph build" are not entirely accurate
since, as you'll read later, a lot of pre-optimizations for the graph build
phase happen during processing.

So, with 7 days of activity already in the graph it takes 173 seconds to load an
hour of data. The increases in time are also leveling off the more data that gets
inserted. This set had (46886 gzipped JSON blocks + 106402 new dag-cbor nodes)
writes and probably ~70K reads of dag-cbor nodes.

Performance

While there are some performance limitations of merkle graphs compared to
other common database structures there's one key advantage:
predictability.

Typically in databases you take a set of writes for a transaction and have
to commit them in strict reverse order. Often, you need information from
each of those writes in order to perform the next write. But with a merkle
graph you could compute the entire graph before a single block was written
to disc. This means that optimizing write performance on disc comes down
to writing batches of key/value pairs as fast as the underlying store can
handle.

The basic strategy for performance I took is in 3 categories.

  • level batch writes.
  • Bulk graph building.
  • Plain old JS performance. For a minute there I forgot about this cause
    I was too focused on IO and my largest performance regression was in a
    simple function implemented poorly. Afterwards I got in the habit of
    perf profiling every few days.

ipld-store

I needed a new bulk interface closer to leveldb so I built a
simple key/value store for ipld.

The most important part here is the bulk interface it exposes. Rather
than wait to commit pending transactions until flush() is called, this
bulk interface is continuously writing a queue of transactions using the
level.batch() interface. I've spent enough time in the level ecosystem
to know that the batch interface is significantly more performant than trying
to send a lot of concurrent writes to leveldb. However, I still don't want
to queue too many transactions as Node.js will get very slow, so I have a 1GB
limit on the total size of the pending blocks. The bulk.put() interface
returns a promise that immediately resolves unless this limit is reached so
that a consumer can still implement some back-pressure.

ipld-complex-graph-builder

I really liked js-ipld-graph-builder but I needed something
a little better suited to working with such an incredibly large and
complicated graph, and I needed something that would take advantage of
the ipld-store's bulk interface.

It currently only supports inserts, cause that's all I need right now :)

All inserts into the graph have their block sent immediately to the bulk
writer and the CID for the link put into a nested Map object for the graph build later on.

Additionally, all the reads the graph will need later when it
builds the graph on flush() are "primed".
By the time the graph build actually happens all the dag-cbor
nodes it needs to read are already in a Set() as a resolved or unresolved
promise (if the read hasn't returned yet).

I also needed to find an efficient way to shard a few parts of the graph.

Rather than using hamt, which would require occasional re-balanching, I
went with a "static" approach. I already have a rough idea of the total
keys I'll need at a few layers (a couple million actors and owners) so I
can just create a two level shard up-front using simple numeric hashing.

The final structure means, rather than the previous paths, you end up with
this:

/repos/:shard1/:shard2/:owner/:repo/:year/:month/:day/:time
/actors/:shard1/:shard2/:login/:year/:month/:day/:time
/timestamps/:year/:month/:day/:hour/:minute/:hash

Additionally, the occasional edge node ends up being larger than the current
serializers can handle. This is rare and the node doesn't mutate so hamt
is a bit of overkill and there isn't a simple off-the-shelf way to support it
with dag-cbor yet. This is my current workaround
which transparently breaks up and re-assembles nodes limited to 500 keys.
It is not ideal and it complicates the block writer a bit but it works fine for
now and doesn't break any of the linking so the whole graph is still visible/pinnable
by any other IPLD implementation (IPFS).

Conclusions

I found this experiment incredibly validating of IPLD and the
primitives we're using. Without using any of the current implementations,
without any code from IPFS I was able to build a fully compatible graph.
Once I write a small library to share the ipld-store on bitswap I can
tell IPFS to pin it.

However, messaging is sort of a mess. IPFS has some history now and the
older something is in this ecosystem the faster you find it. I found the old
files API before MFS, I found the object API before the dag API. I
found dag-pb before dag-cbor. All of these translated into me spending
time using an interface or library I would eventually discard.

I'll be writing and thinking about this a lot more in the coming weeks, but
I'm starting to wonder if the concepts and features we present that are
familiar to users (like MFS) are actually making it more difficult for
people to learn our stack in the long run.

Most developers are used to client/server development. There's a lot of
assumptions they carry around as part of that which have to be unwound in
order to write decentralized applications. The more familiar the interface we
present the less adapted their mental model is as they use it.

I end up talking to developers all the time who have these assumptions.
In the coming months, as I talk to them about a decentralized web, I'm going
to do it through IPLD/CID and see if it clicks a little better than a
filesystem metaphor.

I don't think concepts like merkle trees are actually more difficult to
understand than many of the concepts developers have to learn for
client/server development, they're just different and information about them
less abundant.

I also think there's a lot of use cases these solve, like offline, that we can
present for people to learn without needing to also explain the whole p2p
stack.


You can see the final code for pulling the data out of gharchive and
building the graph here.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 29, 2018

Also, the profiler is showing me a few places in dependencies that we're spending a bit too much time. These deps are used all over the IPFS/IPLD stack so I'm digging in to see what I can do. Sent one PR already tjmehta/is-circular#7

@daviddias
Copy link
Member

Woa, this sort of experimentation and feedback is incredibly useful! Thank you so much for giving IPLD a spin and writing all of this up! ❤️ ❤️ ❤️ ❤️ ❤️ ❤️

On Legacy APIs of Alpha Software

Very valid point and one that I've also noticed to be a big pain for the users. We've been trying to guide users by ordering the APIs by their current importance-- https://github.com/ipfs/js-ipfs#files -- and only have examples that demonstrate how to use the new ones (e.g. Traversing Graphs with DAG -- https://github.com/ipfs/js-ipfs/tree/master/examples).

Given that IPFS is still Alpha, it might make sense to eclipse the now legacy APIs and focus all the docs and examples on the ones we want users to check first.

There is still some work that needs to happen in the newer Files API and DAG API, namely:

Graph Building and IPLD APIs

I would rather support upgrading the IPLD API to enable your use case and others rather than trying to inject your ipld-store into bitswap :)

Flush, Patch and other controls have been discussed multiple times as good primitives to have in addition to dag.{get, put}. It is just missing a DRI that can lead the design and implementation.

Same goes to bulk/batch primitives for writing to disk. We currently follow what is defined in https://github.com/ipfs/interface-datastore and that is what https://github.com/ipfs/js-ipfs-repo expects for its multiple backends.

We can upgrade all of these and serve all of our users 🙌🏽

Nevertheless, I agree, it is excellent that you can write code that generates the whole IPLD graph without touching IPFS at all and only later plumb it in ⚡️

@daviddias
Copy link
Member

@vmx @pgte, pinging you to check out this thread too :)

@mikeal
Copy link
Contributor Author

mikeal commented Jun 29, 2018

I would rather support upgrading the IPLD API to enable your use case and others rather than trying to inject your ipld-store into bitswap :)

These are optimized for a very specific use case. I made a lot of tradeoff's that I may not necessarily recommend for IPFS proper. This is so optimized for a single user building a graph on a single machine that it makes the experience basically unusable for more than one user. I would never want a multi-user daemon to allow a single user to consumer 200M-1G for days at a time. But because the difference in performance changes the time to load this data from months to days, I'll gladly write a little library like this to do it locally.

That said, I would love to push this feedback into a more generic flush()/patch() API, but I probably won't recommend the exact approach I took here and will recommend building some constraints into the API to limit the amount of resources a single batch can use.

As for serving ipld-store on bitswap, I'm only going to write that so that I can pin the graph I'm building on an IPFS node and then close the ipld-store node. Like I said, this wasn't built around the same constraints you would want for a daemon so I have no interest in creating one beyond getting the content into IPFS :)

In fact, it could be written not to serve ipld-store but any simple key/value store as it wouldn't need any of the bulk interfaces, it doesn't need to do any writes at all.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 29, 2018

BTW, it's still running, this is the latest.

{ filename: '2018-01-11-14.json.gz' }
{ cid: 'zdpuAnLeDgAyb3D83uGAwDzZQyz7cgrUiR1GoxbocCWsBLvyi',
  events: 78820,
  walked: 156897,
  totalTime: '297s',
  processTime: '166s',
  graphBuildTime: '126s' }

Wow, that's 78820 raw nodes and 156897 cbor nodes processed and written in 297 seconds. Even with 11 days of data and a lot of blocks that need to get gc'd clogging up leveldb it's doing 793 nodes a second on my Macbook with the worst filesystem ever.

@daviddias daviddias added the status/ready Ready to be worked label Jul 2, 2018
@daviddias daviddias added the ipld label Oct 1, 2019
@SgtPooki SgtPooki self-assigned this May 17, 2023
@SgtPooki
Copy link
Member

js-ipfs is being deprecated in favor of Helia. You can #4336 and read the migration guide.

Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterwards (see #4336).

@BigLep @achingbrain this is a really cool analysis put together by @mikeal that we should probably port to a blog post or other.

Note: getting this CID re-created and pinned somewhere so we can explore in explore.ipld.io with the new updates would be really cool

@BigLep
Copy link
Contributor

BigLep commented Jun 7, 2023

This is a great historical issue. It's being closed for now as no immediate planned action is being taken, but I have added it as a "notable issue" in #4336

@BigLep BigLep closed this as completed Jun 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants