Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

live joins? #3

Closed
dominictarr opened this Issue · 14 comments

2 participants

@dominictarr
Collaborator

Would it be possible to use level-live-stream on the joins
so that you can get a changes feed of the relations as they are added?

also, is it possible to do open ended joins, perhaps:

db.join([{
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

which might return:

{author1: 'mcollina', module: 'levelgraph', author2: 'rvagg', module2: 'levelup'}

Is that correct?

@mcollina
Owner

These are two questions: 1) live stream joins and 2) open ended joins.

1) Live stream joins are not currently supported. Doing that is very different from plain old joins, and require some fairly new algorithms that I need to study. Moreover, as stuff are implemented at the moment everything will blow up in a huge memory leak. In the end, that's another library.

2) Yes. If it's not working it's a bug. Have you tried to pour NPM data into levelgraph? If you do, let me know!! :)

@dominictarr
Collaborator

I have a module for doing joins here that I use here https://github.com/dominictarr/level-inverted-index/blob/map-reduce/index.js#L99-L106

Yes, I have some levelup stuff for working with npm (which is a great dataset) over here https://github.com/dominictarr/npmd

will try doing this in levelgraph and see what happens.

@mcollina
Owner

Your module for joins seems more efficient than the current approach, which is quite naive (but it works).
At the moment I am opening a new stream for each of the query condition for each of the possible results.
it is as bad as it sound, but it allow multiple possible solutions to be discovered.
Moreover this implementation limits the subsequent queries only to the 'bounds' variables, i.e. if the first condition finds out that x = 42, then the second condition reuses that result.

Would you like to try an implementation of the join using the your module? I believe it may be possible to get better throughput at the expense of RAM (the big joinhash in your module).
The problem with this approach is that it cannot use the knowledge of the result of the other conditions until the reduce phase. How can you handle binding multiple variables, when your getKey method have to return a string?
Still, it may be very tricky to get it done right, but implementing it will also allow to do live joins.

Ideally these two approach might be complementary, as a proper database can use multiple join strategy. I bet a pluggable system will be nice.

It would also be cool to get some numbers with a real dataset, so we can know which one works better (the NPM one will be super).

@dominictarr
Collaborator

Hmm, so I think using a hash-join you will need to have multiple joins.
I'm not sure it applies in the case of graphs, but if you wanted to join on two values, you could just hash a combination of keys.

If the second condition refuses the result,
does this mean my query will not return modules that I have have written, that I also depend on?

Agree, we need more generic join modules. Also, we can use db.getApproximateSize(range) to estimate which sides of the join is smaller, and thus which should be in memory.

Another thing - leveldb is probably pretty smart about caching - so maybe reading from the database is not so bad.

@mcollina
Owner

Hmm, so I think using a hash-join you will need to have multiple joins.
I'm not sure it applies in the case of graphs, but if you wanted to join on two values, you could just hash a combination of keys.

The problem with hashing is that you can't hash what you do not know. In your query, the first condition is about variables author1 and module. So you can produce an hash on these two, but you must ignore module2 and author2.

If the second condition refuses the result,
does this mean my query will not return modules that I have have written, that I also depend on?

I'm not getting this question, but I'll try answering.
It will returns all the modules you have written that you depend on.
The first condition try binding some variables, but it cannot bind all of them. So it emits a "partial" solution, then the second condition search the store for triples matching the pattern and the already bound variables, and then it emits the matching context.

Another thing - leveldb is probably pretty smart about caching - so maybe reading from the database is not so bad.

It seems so, I think that having a real use case that performs badly can drive the optimization.

@dominictarr
Collaborator

Agree about the "real use case that performs badly". I'm gonna change to 0.10, and try sticking npm into levelgraph and running this query. Unfortunately, I have more pressing "real work" but I'll get to this!

@mcollina
Owner

No hurry!
Adding 0.8 support should be easy, if you need it just ping me :).

You might also want to use #4, I believe it's stable enough.

@dominictarr
Collaborator

it's time to move to 10 anyway. would have done this on sunday, but I encounterd a few bugs, in my stuff and fixed them instead.

@mcollina
Owner

I have been thinking about this live joining. It is doable, albeit not fast and/or optimizable on the first run.

How about something:

db.join([{
  stream: yourStream // note: this will only work for the very first condition.
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

It should be easy to implement.
@dominictarr if you need it, I can give it a go.

@mcollina
Owner

Another possible syntax:

db.join(yourstream, [{
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})
@mcollina
Owner

The only take in this solution is that the live-stream will only be matched by the first condition, while the rest will be based on the internal graph's data.

Another solution might be:

db.join([{
  live: true, // works only on the first condition
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

This is actually what I like most, it will need to depend on sublevel, but it is ok.

The real question is: should this be forwarding all "internal" matches first, or just the "live" results?

@dominictarr
Collaborator

by "internal", you mean it should match the current values in the database?

Most definately - that is extremely useful - I think the most useful is to be able to match the history, the live changes,
and either just the history or live changes.

However, since you are using streams for this, it is pretty straightforward to adapt the idea to both.

@mcollina
Owner

By 'internal' I mean the current values of the database. Not sure about how to support the history (if it is not what is currently in the db).

The Join are implemented like a pipeline. An empty solution start, then it is denied or augmented with variable bindings at each step/condition. Implementing a live join will be starting from a live stream of triples matching the first condition. How does it sound?

What about the interface?

@mcollina mcollina referenced this issue
Open

Live join support #24

3 of 5 tasks complete
@mcollina
Owner

Closing in favor of #24

@mcollina mcollina closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.