Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache queries #12

Closed
calvinmetcalf opened this issue Jan 5, 2014 · 85 comments
Closed

cache queries #12

calvinmetcalf opened this issue Jan 5, 2014 · 85 comments

Comments

@calvinmetcalf
Copy link
Member

currently all non-http queries are done from scratch each time, we could save the result from the map query to a _local document and the sequence number, subsequent queries to avoid having to iterate through the whole database, we could even listen to the changes feed and update cache every time a document is created/updated.

@calvinmetcalf
Copy link
Member Author

we can probably save a document with change sequence, keyLookup, and results, this would mean we'd need to create the key lookup every time.

@daleharvey
Copy link
Member

I dont think we should save the results as a whole, just the output of the map'd objects, then all view lookups will be just range queries against a natural index (ie fast)

https://github.com/daleharvey/pouchdb/issues/99?source=cc

@calvinmetcalf
Copy link
Member Author

what do you mean by output of mapped objects?

@daleharvey
Copy link
Member

the map will emit objects with a key, just store the objects against that key

@calvinmetcalf
Copy link
Member Author

yeah looking back the key store doesn't give any benefits, but iif I store them against their original doc ID instead of mapped Key, I can update them incrementally

@daleharvey
Copy link
Member

Hmm that reminds me, not sure if they way indexeddb / couchdb handle
duplicate keys is compatible, we can only have a single _id document, but I
think keys can be duplicated

On 6 January 2014 04:35, Calvin Metcalf notifications@github.com wrote:

yeah looking back the key store doesn't give any benefits, but iif I store
them against their original doc ID instead of mapped Key, I can update them
incrementally


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-31639403
.

@calvinmetcalf
Copy link
Member Author

they can, I was planning to do this in an adapter neutral way so that that kind of thing doesn't matter.

@daleharvey
Copy link
Member

my original plan for that was be to use seperate pouch stores to store the results, duplicated keys make that slightly but not a lot more complicated, You will need 2 indexes, one by key and one by _id either way, _id at the least for deletions, the important part is that query({startkey: ..... turns into a linear range query

@neojski
Copy link
Member

neojski commented Jan 6, 2014

Recently I'm also thinking about this but I still don't feel like implementing b+tree in js using indexeddb (and then leveldb and so on) and it looks like it's the only way to make reduce fast enough...

@calvinmetcalf
Copy link
Member Author

quick aside have you seen @nolanlawson's merged pull #11 that adds key index lookup.

so we can save the results array, the key lookup, and an id lookup which saves which keys are produced by which id which should make deleting and updating easier, though you'd have to rebuild both the results and the key lookup if you add a document or update one and it emits new keys.

@neojski
Copy link
Member

neojski commented Jan 6, 2014

I'd have to take a deep look at mapreduce as a whole and in partical at #11, thanks. But it looks like without trees it is impossible to do it right. And implementing trees using indexeddb looks bad.

@nolanlawson
Copy link
Member

Just piping in, I think actually indexeddb might offer a nice solution to building up the btree. It allows complex keys, and the sorting is pretty similar to how CouchDB does it. Scroll down from here to where it says:

For purposes of comparison, all Arrays are greater than all String, Date and Number values; all String values are greater than all Date and Number values; and all Date values are greater than all Number values.

There are a lot of edge cases to cover (e.g. CouchDB converts any appearance of null, undefined, NaN, Infinity (but not the empty string!) to just null, it converts dates to json, idb doesn't support objects), but it seems we can basically convert one complex key format to another.

@daleharvey Yeah, Couch does allow duplicate keys, but we could get around this by making each key a [key, docid] array, which is unique and as a bonus handles the case where multiple docs have the same key. I would even go one step further and make it [key, isDeleted, docid] so we could easily omit deleted docs.

That being said, I think Calvin's solution is the simplest, and we'll probably have to do it anyway for websql.

@calvinmetcalf
Copy link
Member Author

@nolanlawson can't a doc emit the same key multiple times ?

@daleharvey
Copy link
Member

yeh I imagined it would be more like key: [{docid: doc1, emitted: object}, {docid:doc2......}], we dont need a btree, just linear queries, btree is just one way to implement them but idb / websql / level gives us them anyway

I cant think if any way that saving the results array as a whole is going to be much of an optimisation, it still requires loading the entire resultset to be able to filter by key which is the main part that needs removed

@nolanlawson
Copy link
Member

@calvinmetcalf oops, didn't think of that edge case! Testing it out, it seems CouchDB considers function(doc){emit(doc.foo, null);emit(doc.foo, null);} to be equivalent to function(doc){emit(doc.foo, null);}, but our local adapter is just doubling of results. I'll add a bug and write a fix.

@daleharvey that works too, but the only downside is that you'll need to do a secondary sort on docid when you pull the data out, and the logic for offset/limit could get pretty gnarly, whereas {key : [key,docid] ... } gives it to us for free.

You're right about saving the entire results array; the only optimization is in terms of coding time. ;)

@daleharvey
Copy link
Member

if {key : [key,docid] ... } can work for the duplicate keys for a single doc situation, then yeh that would be best, I would check what happens with function(doc){emit(doc.foo, doc.bar);emit(doc.foo, doc.baz);} here as well, just in case the contents matters, we might be stepping into 'weird edge case territory that we dont need to support' though

@calvinmetcalf
Copy link
Member Author

check the cache branch for a super rough way of implementing this (how rough ? doesn't actually work).

The idea is to refactor it so that the buildIndices function builds a cacheable object and then buildQuarry is the function that can either use a fresh or cached version. Will take a fresh look in the morning.

@nolanlawson
Copy link
Member

@daleharvey : sorry, I made a mistake - our implementation actually seems to be correct. I'll write a unit test to confirm, but I don't think we need to write any code to chase this edge case.

@nolanlawson
Copy link
Member

Okay, the case of multiple emits with the same key was much less serious than I thought. It's just a one-line code change, doesn't really impact performance. See #18.

@neojski
Copy link
Member

neojski commented Jan 7, 2014

@daleharvey, would it be possible to implement log(n) reduce without explicitly implementing b+tree? I thought not. Maybe I missed something and we're just talking about map (not reduce part)

@daleharvey
Copy link
Member

oh yes sorry I am talking about the map side of things, the reduce is harder to optimise, particularly without a btree, but I think getting map only fast for people is still a pretty big win, they are used far more often than reduce queries as far as I can tell

@nolanlawson
Copy link
Member

Same here, I was only thinking of the map side, although I did look around and find this breakdown of CouchDB's map/reduce implementation if anyone's curious.

Unfortunately I also discovered that neither IE 10 nor 11 support complex keys in IDB (SO question, a script to confirm). I also tried the spec's suggestion of using a Sequence<DOMString> for the keyPath, but assuming they mean an array of strings, it doesn't seem to be supported by Chrome/Firefox/Opera ("The keyPath option is not a valid key path"). So I guess there goes my suggestion of leveraging IDB's range queries.

@calvinmetcalf
Copy link
Member Author

Ok so we need a storage design for the cache, we can't use a strait up object because it will coerce all the objects to strings, but we can't use an es6 map because it stores objects by reference not by value. We also need it sorted on key based on the pouchCollate algorithm, updatable based on the document _id (a string) and possibly a joined document.

I can start on that, it might end up living in another repo. Also is there a data structure that is good for this that i'm not thinking of ?

@nolanlawson
Copy link
Member

After I learned we can't use complex keys in IE, I had a crazy idea to convert all the keys to a string representation that would sort lexically by pouchCollate, which would allow us to sort in websql/idb rather than in memory. Obviously there would be an upper limit on the size of strings that we could use in it. And obviously, it's crazy. But other than that, I had no ideas.

@nolanlawson
Copy link
Member

Oh yeah, and don't forget that docs in the mapreduce index are technically sorted by [key, docid, value], where key and value are sorted according to pouchCollate. It's a weird edge case I ran into in the unit tests.

@calvinmetcalf
Copy link
Member Author

The problem there is that (and I considered it) is it would consider [1]
and "[1]" the same key.

I have the beginnings of am idea which could work.
On Jan 10, 2014 8:26 PM, "Nolan Lawson" notifications@github.com wrote:

Oh yeah, and don't forget that docs in the mapreduce index are technically
sorted by [key, docid, value], where key and value are sorted according
to pouchCollate. It's a weird edge case I ran into in the unit tests.


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-32082574
.

@nolanlawson
Copy link
Member

I wasn't thinking of just sorting based on the stringified form; I'm thinking of going nuts and building up a base64-encoded string that sorts by pouchCollate. I've already started hacking up an example.

Basically it builds up a string that looks like this:

collateIndex1 valueEncoded1 collateIndex2 valueEncoded2 collateIndex3 valueEncoded3 [etc.]

where collationIndex is e.g. 1 for nulls, 2 for booleans, 3 for numbers, etc. The encoded value can just be btoa for strings (which is good enough), and for numbers it only has to be a little more complicated to make sure they sort decimals/negatives correctly. I'm using a basic base64 encoding, but the base64 alphabet could vary based on how the database sorts strings (e.g. I think sqlite sorts case-insensitive, and idb sorts based on javascript string order, but I may be wrong).

As long as we choose an alphabet that agrees with the database, though, and as long as the user doesn't have super-long strings or crazy nested arrays/objects or anything like that, I think sorting in the database on a string could be an easy duct-tape solution. This algorithm, btw, is already correctly sorting this list of heterogeneous values correctly, which is kind of neat.

@calvinmetcalf
Copy link
Member Author

@nolanlawson fyi this conditional doesn't do what you think it does, process is always going to truthy, process.browser is truthy only if you are in a browser. But btoa is only going to be available in some browsers, (firefox yes chrome no) so you might just want to check if global.btoa is a function.

I'm online most week days (irc) if you have javascript questions

@nolanlawson
Copy link
Member

Good to know; there are lots of subtleties with browserify I was missing. I can add some fixes to make btoa work cross-platform.

@neojski
Copy link
Member

neojski commented Jan 30, 2014

If we want to have this as separate stuff we'd need to have some eventEmitter for destroy to clean-up the additional view databse. Any suggestions?

@calvinmetcalf
Copy link
Member Author

I've been meening to open a pull to make PouchDB databases event emitters.

On Thu, Jan 30, 2014 at 1:37 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

If we want to have this as separate stuff we'd need to have some
eventEmitter for destroy to clean-up the additional view databse. Any
suggestions?


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-33718193
.

-Calvin W. Metcalf

@neojski
Copy link
Member

neojski commented Feb 3, 2014

For anyone interested: I have first implementation of cached-views. It passes the tests but it's not incremental at all. I plan to add the incremental update feature tomorrow.
If you want to run it you have to have git pouchdb and https://github.com/neojski/collate/tree/indexable_string.

@nolanlawson
Copy link
Member

About these lines
I'm a fan of CouchDB's system, where the views are named based on the md5sum of the contents of the map/reduce document, e.g. mrview/c29b82e2f86e926e869e0c343f1c1d46.view. That way, if anything changes (even whitespace), all the existing view data gets replaced.

I'm also +1 for using the mrview name scheme, if for no other reason than I always read it as "Mr. View" and I find that funny.

@nolanlawson
Copy link
Member

I'm also more inclined to think we should imitate CouchDB's API in terms of having an explicit split between "temporary views" and "saved views." Magically caching the views may be simpler from the user's point of view, but it's a leaky abstraction when dealing with the http adapter. I'd be more inclined to have something like

query() // temp view or lookup by name, i.e. status quo
createIndex() // create a new saved view, equivalent to POSTing a _design doc

@nolanlawson
Copy link
Member

In general, though, awesome work, and I'm still psyched that you made that crazy indexable string actually work. I'm gonna keep posting comments here, but I think at this point you can also just make a PR and we'll proceed from there.

@nolanlawson
Copy link
Member

About these lines
Good catch on adding i to the view key: [row.key, row.id, row.value, i]. Docs can emit the same key/value pair multiple times, so the ids for each row have to be uniquified. That's why you added it, right?

Question: how can we look up a view row by its document (e.g. when updating/deleting docs)? Per my spec it looks like you've implemented the first table type, but not the second one.

@neojski
Copy link
Member

neojski commented Feb 10, 2014

Yeah, that i is exactly due to multiple emits for one document. I haven't implemented "incremental views" yet but that's right it's gonna be the second db for that.
I had some important work to do this weekend so I didn't manage to finish it. I'll make a PR once I make it incremental. This PR will work like that: (simplest but the least performant but we'll so how good it is once it's written)

  1. "mutex" on query
  2. update view with db.changes
  3. allDocs to actually do the query

The main part is already written so I hope to do it soon!

@neojski
Copy link
Member

neojski commented Feb 11, 2014

I faced the following problem:
When I get db.query I create a PouchDB db to store the things. Let's say we add PouchDB event emmiting capabilities to inform mapreduce that some db has been destroyed. But what about the following scenario:

  1. create db
  2. initialize mapreduce (do db.query)
  3. destroy db
  4. suppose the db got destroyed but we didn't get the notification of that. (js crashed, power down, whatever)
  5. db got recreated but with completely different data
  6. db.query - we still think that we have the same db because the name is the same

Only solution I came up with is adding some unique id to pouchdb so that someone can remember it and be sure that these db differ. It could be returned by db.info. Any ideas?

@calvinmetcalf
Copy link
Member Author

well no because there's nothing to prevent them from disabling map reduce
futzing around with the data and then enableling it again, mapreduce cache
should probobly always assume that there may be one or more missed changes.

On Tue, Feb 11, 2014 at 1:10 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

I faced the following problem:
When I get db.query I create a PouchDB db to store the things. Let's say
we add PouchDB event emmiting capabilities to inform mapreduce that some db
has been destroyed. But what about the following scenario:

  1. create db
  2. initialize mapreduce (do db.query)
  3. destroy db
  4. suppose the db got destroyed but we didn't get the notification of
    that. (js crashed, power down, whatever)
  5. db got recreated but with completely different data
  6. db.query - we still think that we have the same db because the name is
    the same

Only solution I came up with is adding some unique id to pouchdb so that
someone can remember it and be sure that these db differ. It could be
returned by db.info. Any ideas?


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-34785712
.

-Calvin W. Metcalf

@neojski
Copy link
Member

neojski commented Feb 11, 2014

My mapreduce implementation is using changes feed and it is updated only when db.query is issued. So there's no such thing as disabling mapreduce because it's not continuously looking for changes.

The problem is about mapreduce having no mechanism of knowing whether it's talking with the same db or db with the same name because it missed the destroy command

@calvinmetcalf
Copy link
Member Author

you probably want to store a cache id of some sort in the db as a
_local/name document so it knows it's the right db

On Tue, Feb 11, 2014 at 2:07 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

My mapreduce implementation is using changes feed and it is updated only
when db.query is issued. So there's no such thing as disabling mapreduce
because it's not continuously looking for changes.

The problem is about mapreduce having no mechanism of knowing whether it's
talking with the same db or db with the same name because it missed the
destroy commandhttps://github.com//issues/12#issuecomment-34785712


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-34792575
.

-Calvin W. Metcalf

@neojski
Copy link
Member

neojski commented Feb 11, 2014

Yeah, if using _local is safe (may someone be angry when I use his _local/_pouchdb_mapreduce key?) then it's probably good idea.

@calvinmetcalf
Copy link
Member Author

_local better be safe because we use it during replication

On Tue, Feb 11, 2014 at 2:54 PM, Tomasz Kołodziejski <
notifications@github.com> wrote:

Yeah, if using _local is safe (may someone be angry when I use his
_local/_pouchdb_mapreduce key?) then it's probably good idea.


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-34798367
.

-Calvin W. Metcalf

@nolanlawson
Copy link
Member

@neojski Isn't there a global db seq property? I seem to remember seeing it in the WebSQL table. If we could expose that in the API, then you could know what "version" the database is on.

@neojski
Copy link
Member

neojski commented Feb 12, 2014

@nolanlawson Are you talking about me using setSeq and getSeq? I have to remember that by myself because it's neither the seq of db nor seq of view index. I remember it so that I can retrieve changes starting from that seq. (incremental update)

If not - seq does not solve that problem because it does not help me distinguish between db with the same name but completely different. (problem described 5 comments above)

@nolanlawson
Copy link
Member

Yeah, I forgot that seq is an integer and not related to the rev. Nevermind!

nolanlawson added a commit to pouchdb/collate that referenced this issue Mar 17, 2014
Following daleharvey/pouchdb#1658 and
pouchdb/mapreduce#12, this adds the
toIndexableString method, which allows us
to emulate CouchDB's standard collation
to a reasonable degree of fidelity (e.g. no
ICU ordering for strings).
nolanlawson pushed a commit to pouchdb/collate that referenced this issue Mar 17, 2014
Following daleharvey/pouchdb#1658 and
pouchdb/mapreduce#12, this adds the
toIndexableString method, which allows us
to emulate CouchDB's standard collation
to a reasonable degree of fidelity (e.g. no
ICU ordering for strings).
nolanlawson pushed a commit to pouchdb/collate that referenced this issue Mar 17, 2014
Following daleharvey/pouchdb#1658 and
pouchdb/mapreduce#12, this adds the
toIndexableString method, which allows us
to emulate CouchDB's standard collation
to a reasonable degree of fidelity (e.g. no
ICU ordering for strings).
nolanlawson pushed a commit to pouchdb/collate that referenced this issue Mar 24, 2014
Following daleharvey/pouchdb#1658 and
pouchdb/mapreduce#12, this adds the
toIndexableString method, which allows us
to emulate CouchDB's standard collation
to a reasonable degree of fidelity (e.g. no
ICU ordering for strings).
daleharvey pushed a commit to pouchdb/collate that referenced this issue Mar 24, 2014
Following daleharvey/pouchdb#1658 and
pouchdb/mapreduce#12, this adds the
toIndexableString method, which allows us
to emulate CouchDB's standard collation
to a reasonable degree of fidelity (e.g. no
ICU ordering for strings).
@neojski
Copy link
Member

neojski commented Mar 24, 2014

#68. Congratulations @nolanlawson :-)

@neojski neojski closed this as completed Mar 24, 2014
@nolanlawson
Copy link
Member

You too @neojski. After three months of debate and about a half-dozen false starts, we finally figured it out. I think the key breakthrough was your insight with \u0000 in the indexable string; I was convinced it was a dumb idea until you fixed it. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants