cache queries #12

calvinmetcalf · 2014-01-05T16:31:31Z

currently all non-http queries are done from scratch each time, we could save the result from the map query to a _local document and the sequence number, subsequent queries to avoid having to iterate through the whole database, we could even listen to the changes feed and update cache every time a document is created/updated.

calvinmetcalf · 2014-01-05T21:03:51Z

we can probably save a document with change sequence, keyLookup, and results, this would mean we'd need to create the key lookup every time.

daleharvey · 2014-01-05T22:04:08Z

I dont think we should save the results as a whole, just the output of the map'd objects, then all view lookups will be just range queries against a natural index (ie fast)

https://github.com/daleharvey/pouchdb/issues/99?source=cc

calvinmetcalf · 2014-01-05T22:41:59Z

what do you mean by output of mapped objects?

daleharvey · 2014-01-06T02:09:28Z

the map will emit objects with a key, just store the objects against that key

calvinmetcalf · 2014-01-06T10:35:21Z

yeah looking back the key store doesn't give any benefits, but iif I store them against their original doc ID instead of mapped Key, I can update them incrementally

daleharvey · 2014-01-06T15:21:52Z

Hmm that reminds me, not sure if they way indexeddb / couchdb handle
duplicate keys is compatible, we can only have a single _id document, but I
think keys can be duplicated

On 6 January 2014 04:35, Calvin Metcalf notifications@github.com wrote:

yeah looking back the key store doesn't give any benefits, but iif I store
them against their original doc ID instead of mapped Key, I can update them
incrementally

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-31639403
.

calvinmetcalf · 2014-01-06T15:54:58Z

they can, I was planning to do this in an adapter neutral way so that that kind of thing doesn't matter.

daleharvey · 2014-01-06T16:11:42Z

my original plan for that was be to use seperate pouch stores to store the results, duplicated keys make that slightly but not a lot more complicated, You will need 2 indexes, one by key and one by _id either way, _id at the least for deletions, the important part is that query({startkey: ..... turns into a linear range query

neojski · 2014-01-06T16:24:59Z

Recently I'm also thinking about this but I still don't feel like implementing b+tree in js using indexeddb (and then leveldb and so on) and it looks like it's the only way to make reduce fast enough...

calvinmetcalf · 2014-01-06T16:30:44Z

quick aside have you seen @nolanlawson's merged pull #11 that adds key index lookup.

so we can save the results array, the key lookup, and an id lookup which saves which keys are produced by which id which should make deleting and updating easier, though you'd have to rebuild both the results and the key lookup if you add a document or update one and it emits new keys.

neojski · 2014-01-06T16:44:18Z

I'd have to take a deep look at mapreduce as a whole and in partical at #11, thanks. But it looks like without trees it is impossible to do it right. And implementing trees using indexeddb looks bad.

nolanlawson · 2014-01-06T17:45:18Z

Just piping in, I think actually indexeddb might offer a nice solution to building up the btree. It allows complex keys, and the sorting is pretty similar to how CouchDB does it. Scroll down from here to where it says:

For purposes of comparison, all Arrays are greater than all String, Date and Number values; all String values are greater than all Date and Number values; and all Date values are greater than all Number values.

There are a lot of edge cases to cover (e.g. CouchDB converts any appearance of null, undefined, NaN, Infinity (but not the empty string!) to just null, it converts dates to json, idb doesn't support objects), but it seems we can basically convert one complex key format to another.

@daleharvey Yeah, Couch does allow duplicate keys, but we could get around this by making each key a [key, docid] array, which is unique and as a bonus handles the case where multiple docs have the same key. I would even go one step further and make it [key, isDeleted, docid] so we could easily omit deleted docs.

That being said, I think Calvin's solution is the simplest, and we'll probably have to do it anyway for websql.

calvinmetcalf · 2014-01-06T19:27:05Z

@nolanlawson can't a doc emit the same key multiple times ?

daleharvey · 2014-01-06T19:51:46Z

yeh I imagined it would be more like key: [{docid: doc1, emitted: object}, {docid:doc2......}], we dont need a btree, just linear queries, btree is just one way to implement them but idb / websql / level gives us them anyway

I cant think if any way that saving the results array as a whole is going to be much of an optimisation, it still requires loading the entire resultset to be able to filter by key which is the main part that needs removed

nolanlawson · 2014-01-06T20:18:42Z

@calvinmetcalf oops, didn't think of that edge case! Testing it out, it seems CouchDB considers function(doc){emit(doc.foo, null);emit(doc.foo, null);} to be equivalent to function(doc){emit(doc.foo, null);}, but our local adapter is just doubling of results. I'll add a bug and write a fix.

@daleharvey that works too, but the only downside is that you'll need to do a secondary sort on docid when you pull the data out, and the logic for offset/limit could get pretty gnarly, whereas {key : [key,docid] ... } gives it to us for free.

You're right about saving the entire results array; the only optimization is in terms of coding time. ;)

daleharvey · 2014-01-06T20:41:45Z

if {key : [key,docid] ... } can work for the duplicate keys for a single doc situation, then yeh that would be best, I would check what happens with function(doc){emit(doc.foo, doc.bar);emit(doc.foo, doc.baz);} here as well, just in case the contents matters, we might be stepping into 'weird edge case territory that we dont need to support' though

calvinmetcalf · 2014-01-06T21:27:54Z

check the cache branch for a super rough way of implementing this (how rough ? doesn't actually work).

The idea is to refactor it so that the buildIndices function builds a cacheable object and then buildQuarry is the function that can either use a fresh or cached version. Will take a fresh look in the morning.

nolanlawson · 2014-01-06T21:50:20Z

@daleharvey : sorry, I made a mistake - our implementation actually seems to be correct. I'll write a unit test to confirm, but I don't think we need to write any code to chase this edge case.

nolanlawson · 2014-01-06T22:38:51Z

Okay, the case of multiple emits with the same key was much less serious than I thought. It's just a one-line code change, doesn't really impact performance. See #18.

neojski · 2014-01-07T15:38:27Z

@daleharvey, would it be possible to implement log(n) reduce without explicitly implementing b+tree? I thought not. Maybe I missed something and we're just talking about map (not reduce part)

daleharvey · 2014-01-07T15:48:18Z

oh yes sorry I am talking about the map side of things, the reduce is harder to optimise, particularly without a btree, but I think getting map only fast for people is still a pretty big win, they are used far more often than reduce queries as far as I can tell

nolanlawson · 2014-01-09T07:13:30Z

Same here, I was only thinking of the map side, although I did look around and find this breakdown of CouchDB's map/reduce implementation if anyone's curious.

Unfortunately I also discovered that neither IE 10 nor 11 support complex keys in IDB (SO question, a script to confirm). I also tried the spec's suggestion of using a Sequence<DOMString> for the keyPath, but assuming they mean an array of strings, it doesn't seem to be supported by Chrome/Firefox/Opera ("The keyPath option is not a valid key path"). So I guess there goes my suggestion of leveraging IDB's range queries.

calvinmetcalf · 2014-01-11T01:07:51Z

Ok so we need a storage design for the cache, we can't use a strait up object because it will coerce all the objects to strings, but we can't use an es6 map because it stores objects by reference not by value. We also need it sorted on key based on the pouchCollate algorithm, updatable based on the document _id (a string) and possibly a joined document.

I can start on that, it might end up living in another repo. Also is there a data structure that is good for this that i'm not thinking of ?

nolanlawson · 2014-01-11T01:24:31Z

After I learned we can't use complex keys in IE, I had a crazy idea to convert all the keys to a string representation that would sort lexically by pouchCollate, which would allow us to sort in websql/idb rather than in memory. Obviously there would be an upper limit on the size of strings that we could use in it. And obviously, it's crazy. But other than that, I had no ideas.

nolanlawson · 2014-01-11T01:26:17Z

Oh yeah, and don't forget that docs in the mapreduce index are technically sorted by [key, docid, value], where key and value are sorted according to pouchCollate. It's a weird edge case I ran into in the unit tests.

calvinmetcalf · 2014-01-11T02:10:31Z

The problem there is that (and I considered it) is it would consider [1]
and "[1]" the same key.

I have the beginnings of am idea which could work.
On Jan 10, 2014 8:26 PM, "Nolan Lawson" notifications@github.com wrote:

Oh yeah, and don't forget that docs in the mapreduce index are technically
sorted by [key, docid, value], where key and value are sorted according
to pouchCollate. It's a weird edge case I ran into in the unit tests.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-32082574
.

nolanlawson · 2014-01-12T03:59:43Z

I wasn't thinking of just sorting based on the stringified form; I'm thinking of going nuts and building up a base64-encoded string that sorts by pouchCollate. I've already started hacking up an example.

Basically it builds up a string that looks like this:

collateIndex1 valueEncoded1 collateIndex2 valueEncoded2 collateIndex3 valueEncoded3 [etc.]

where collationIndex is e.g. 1 for nulls, 2 for booleans, 3 for numbers, etc. The encoded value can just be btoa for strings (which is good enough), and for numbers it only has to be a little more complicated to make sure they sort decimals/negatives correctly. I'm using a basic base64 encoding, but the base64 alphabet could vary based on how the database sorts strings (e.g. I think sqlite sorts case-insensitive, and idb sorts based on javascript string order, but I may be wrong).

As long as we choose an alphabet that agrees with the database, though, and as long as the user doesn't have super-long strings or crazy nested arrays/objects or anything like that, I think sorting in the database on a string could be an easy duct-tape solution. This algorithm, btw, is already correctly sorting this list of heterogeneous values correctly, which is kind of neat.

calvinmetcalf · 2014-01-13T15:44:52Z

@nolanlawson fyi this conditional doesn't do what you think it does, process is always going to truthy, process.browser is truthy only if you are in a browser. But btoa is only going to be available in some browsers, (firefox yes chrome no) so you might just want to check if global.btoa is a function.

I'm online most week days (irc) if you have javascript questions

nolanlawson · 2014-01-13T18:51:39Z

Good to know; there are lots of subtleties with browserify I was missing. I can add some fixes to make btoa work cross-platform.

neojski · 2014-01-30T18:37:03Z

If we want to have this as separate stuff we'd need to have some eventEmitter for destroy to clean-up the additional view databse. Any suggestions?

calvinmetcalf · 2014-01-30T19:09:27Z

I've been meening to open a pull to make PouchDB databases event emitters.

On Thu, Jan 30, 2014 at 1:37 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

If we want to have this as separate stuff we'd need to have some
eventEmitter for destroy to clean-up the additional view databse. Any
suggestions?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-33718193
.

-Calvin W. Metcalf

neojski · 2014-02-03T23:16:01Z

For anyone interested: I have first implementation of cached-views. It passes the tests but it's not incremental at all. I plan to add the incremental update feature tomorrow.
If you want to run it you have to have git pouchdb and https://github.com/neojski/collate/tree/indexable_string.

nolanlawson · 2014-02-10T08:23:33Z

About these lines
I'm a fan of CouchDB's system, where the views are named based on the md5sum of the contents of the map/reduce document, e.g. mrview/c29b82e2f86e926e869e0c343f1c1d46.view. That way, if anything changes (even whitespace), all the existing view data gets replaced.

I'm also +1 for using the mrview name scheme, if for no other reason than I always read it as "Mr. View" and I find that funny.

nolanlawson · 2014-02-10T08:31:13Z

I'm also more inclined to think we should imitate CouchDB's API in terms of having an explicit split between "temporary views" and "saved views." Magically caching the views may be simpler from the user's point of view, but it's a leaky abstraction when dealing with the http adapter. I'd be more inclined to have something like

query() // temp view or lookup by name, i.e. status quo

createIndex() // create a new saved view, equivalent to POSTing a _design doc

nolanlawson · 2014-02-10T08:39:36Z

In general, though, awesome work, and I'm still psyched that you made that crazy indexable string actually work. I'm gonna keep posting comments here, but I think at this point you can also just make a PR and we'll proceed from there.

nolanlawson · 2014-02-10T08:46:29Z

About these lines
Good catch on adding i to the view key: [row.key, row.id, row.value, i]. Docs can emit the same key/value pair multiple times, so the ids for each row have to be uniquified. That's why you added it, right?

Question: how can we look up a view row by its document (e.g. when updating/deleting docs)? Per my spec it looks like you've implemented the first table type, but not the second one.

neojski · 2014-02-10T09:19:16Z

Yeah, that i is exactly due to multiple emits for one document. I haven't implemented "incremental views" yet but that's right it's gonna be the second db for that.
I had some important work to do this weekend so I didn't manage to finish it. I'll make a PR once I make it incremental. This PR will work like that: (simplest but the least performant but we'll so how good it is once it's written)

"mutex" on query
update view with db.changes
allDocs to actually do the query

The main part is already written so I hope to do it soon!

neojski · 2014-02-11T18:10:23Z

I faced the following problem:
When I get db.query I create a PouchDB db to store the things. Let's say we add PouchDB event emmiting capabilities to inform mapreduce that some db has been destroyed. But what about the following scenario:

create db
initialize mapreduce (do db.query)
destroy db
suppose the db got destroyed but we didn't get the notification of that. (js crashed, power down, whatever)
db got recreated but with completely different data
db.query - we still think that we have the same db because the name is the same

Only solution I came up with is adding some unique id to pouchdb so that someone can remember it and be sure that these db differ. It could be returned by db.info. Any ideas?

calvinmetcalf · 2014-02-11T18:23:44Z

well no because there's nothing to prevent them from disabling map reduce
futzing around with the data and then enableling it again, mapreduce cache
should probobly always assume that there may be one or more missed changes.

On Tue, Feb 11, 2014 at 1:10 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

I faced the following problem:
When I get db.query I create a PouchDB db to store the things. Let's say
we add PouchDB event emmiting capabilities to inform mapreduce that some db
has been destroyed. But what about the following scenario:

create db

initialize mapreduce (do db.query)

destroy db

suppose the db got destroyed but we didn't get the notification of
that. (js crashed, power down, whatever)

db got recreated but with completely different data

db.query - we still think that we have the same db because the name is
the same

Only solution I came up with is adding some unique id to pouchdb so that
someone can remember it and be sure that these db differ. It could be
returned by db.info. Any ideas?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-34785712
.

-Calvin W. Metcalf

neojski · 2014-02-11T19:07:42Z

My mapreduce implementation is using changes feed and it is updated only when db.query is issued. So there's no such thing as disabling mapreduce because it's not continuously looking for changes.

The problem is about mapreduce having no mechanism of knowing whether it's talking with the same db or db with the same name because it missed the destroy command

calvinmetcalf · 2014-02-11T19:30:45Z

you probably want to store a cache id of some sort in the db as a
_local/name document so it knows it's the right db

On Tue, Feb 11, 2014 at 2:07 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

My mapreduce implementation is using changes feed and it is updated only
when db.query is issued. So there's no such thing as disabling mapreduce
because it's not continuously looking for changes.

The problem is about mapreduce having no mechanism of knowing whether it's
talking with the same db or db with the same name because it missed the
destroy commandhttps://github.com//issues/12#issuecomment-34785712

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-34792575
.

-Calvin W. Metcalf

neojski · 2014-02-11T19:54:47Z

Yeah, if using _local is safe (may someone be angry when I use his _local/_pouchdb_mapreduce key?) then it's probably good idea.

calvinmetcalf · 2014-02-11T20:00:10Z

_local better be safe because we use it during replication

On Tue, Feb 11, 2014 at 2:54 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

Yeah, if using _local is safe (may someone be angry when I use his
_local/_pouchdb_mapreduce key?) then it's probably good idea.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-34798367
.

-Calvin W. Metcalf

nolanlawson · 2014-02-12T02:13:09Z

@neojski Isn't there a global db seq property? I seem to remember seeing it in the WebSQL table. If we could expose that in the API, then you could know what "version" the database is on.

neojski · 2014-02-12T07:44:12Z

@nolanlawson Are you talking about me using setSeq and getSeq? I have to remember that by myself because it's neither the seq of db nor seq of view index. I remember it so that I can retrieve changes starting from that seq. (incremental update)

If not - seq does not solve that problem because it does not help me distinguish between db with the same name but completely different. (problem described 5 comments above)

nolanlawson · 2014-02-13T04:21:32Z

Yeah, I forgot that seq is an integer and not related to the rev. Nevermind!

Following daleharvey/pouchdb#1658 and pouchdb/mapreduce#12, this adds the toIndexableString method, which allows us to emulate CouchDB's standard collation to a reasonable degree of fidelity (e.g. no ICU ordering for strings).

neojski · 2014-03-24T17:36:15Z

#68. Congratulations @nolanlawson :-)

nolanlawson · 2014-03-24T20:40:08Z

You too @neojski. After three months of debate and about a half-dozen false starts, we finally figured it out. I think the key breakthrough was your insight with \u0000 in the indexable string; I was convinced it was a dumb idea until you fixed it. 😃

nolanlawson mentioned this issue Jan 6, 2014

Handle multiple emits #14

Closed

This was referenced Mar 3, 2014

ability for plugins to store data pouchdb/pouchdb#1549

Closed

(#54) - fix startkey/endkey logic when descending #61

Closed

nolanlawson mentioned this issue Mar 15, 2014

Persisted indexes for map/reduce pouchdb/pouchdb#1658

Closed

nolanlawson mentioned this issue Mar 17, 2014

(daleharvey/pouchdb#1658) - add indexable string pouchdb/collate#11

Closed

neojski closed this as completed Mar 24, 2014

situssoft mentioned this issue Feb 11, 2015

How to call query angular-pouchdb/angular-pouchdb#39

Closed

cache queries #12

cache queries #12

Comments

calvinmetcalf commented Jan 5, 2014

calvinmetcalf commented Jan 5, 2014

daleharvey commented Jan 5, 2014

calvinmetcalf commented Jan 5, 2014

daleharvey commented Jan 6, 2014

calvinmetcalf commented Jan 6, 2014

daleharvey commented Jan 6, 2014

calvinmetcalf commented Jan 6, 2014

daleharvey commented Jan 6, 2014

neojski commented Jan 6, 2014

calvinmetcalf commented Jan 6, 2014

neojski commented Jan 6, 2014

nolanlawson commented Jan 6, 2014

calvinmetcalf commented Jan 6, 2014

daleharvey commented Jan 6, 2014

nolanlawson commented Jan 6, 2014

daleharvey commented Jan 6, 2014

calvinmetcalf commented Jan 6, 2014

nolanlawson commented Jan 6, 2014

nolanlawson commented Jan 6, 2014

neojski commented Jan 7, 2014

daleharvey commented Jan 7, 2014

nolanlawson commented Jan 9, 2014

calvinmetcalf commented Jan 11, 2014

nolanlawson commented Jan 11, 2014

nolanlawson commented Jan 11, 2014

calvinmetcalf commented Jan 11, 2014

nolanlawson commented Jan 12, 2014

calvinmetcalf commented Jan 13, 2014

nolanlawson commented Jan 13, 2014

neojski commented Jan 30, 2014

calvinmetcalf commented Jan 30, 2014

neojski commented Feb 3, 2014

nolanlawson commented Feb 10, 2014

nolanlawson commented Feb 10, 2014

nolanlawson commented Feb 10, 2014

nolanlawson commented Feb 10, 2014

neojski commented Feb 10, 2014

neojski commented Feb 11, 2014

calvinmetcalf commented Feb 11, 2014

neojski commented Feb 11, 2014

calvinmetcalf commented Feb 11, 2014

neojski commented Feb 11, 2014

calvinmetcalf commented Feb 11, 2014

nolanlawson commented Feb 12, 2014

neojski commented Feb 12, 2014

nolanlawson commented Feb 13, 2014

neojski commented Mar 24, 2014

nolanlawson commented Mar 24, 2014