Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big amibitous plan for native secondary indexes and purge() #3775

Closed
nolanlawson opened this issue Apr 30, 2015 · 15 comments
Closed

Big amibitous plan for native secondary indexes and purge() #3775

nolanlawson opened this issue Apr 30, 2015 · 15 comments
Labels
enhancement Feature request

Comments

@nolanlawson
Copy link
Member

I've got a plan for how I'd like to proceed with solving both:

  1. faster, "native" secondary indexes (Native secondary indexes #2280), and
  2. the purge() feature (Feature Request - Method to Purge a document #802).

My proposal: move the _query() and _viewCleanup() implementations into the adapters themselves. Yes, this will require writing it three times for IDB/WebSQL/LevelDB, but it offers the benefit of vastly improved map/reduce performance. Which I think we sorely need, because as it stands, we're about 100x slower than native indexes, and 100 is a big number (source: #3039 (comment)).

What would these _query() and _viewCleanup() functions look like? Basically like this: here's the socket-pouch custom adapter defining its own implementations and here's the modest change to mapreduce required to pull off this trick. So basically we would just need to move the _query() and _viewCleanup() implementations into their respective adapters (http already having been written) and deprecate the mapreduce plugin.

How does this affect pouchdb-find, etc.? I think I've already demonstrated with pouchdb-abstract-mapreduce that you can build pouchdb-find on top of map/reduce and it works just peachy. I did the same thing for pouchdb-quick-search. Both codebases could be very easily rewritten to use native _query() and _viewCleanup() functions.

Doesn't it sacrifice modularity to re-integrate the mapreduce plugin? A bit, but it could still be possible to build PouchDB without the _query() and _viewCleanup() functions. We'd just have to have a separate build script that uses some browserify tricks to sub out those functions for stubs that throw "not implemented" errors, and include them in separate files: idb-mapreduce.js, websql-mapreduce.js, leveldb-mapreduce.js. Not a big deal.

What's the migration plan? None at the moment, except that existing secondary indexes would have to be rebuilt. I think this is the easiest fix and not really a big ask for users. Anyway, given that the schema would look different between the old and new implementations, such a migration is inevitable.

What are the other benefits? Plenty. No more awkward filtering of mrview databases in pouchdb-all-dbs, no more IndexedDB destroy() race conditions, no more having to manage "dependentDbs" at all. As well as perf improvements and making it vastly simpler to implement purge(), since you don't have to worry about coordinating the dependent databases.

In general, I think our current map/reduce implementation was a fine solution at the time. It gave us robust secondary indexes for all three adapters, without having to write a lot of adapter-specific code. And we now have tons of map/reduce tests that are going to serve us well going forward. But speed is a priority, and purge() needs to be implemented, and in my mind this is the best way to do it.

If y'all are okay with this proposal, I'll start implementing it and could hopefully have it done within a month or two.

@nolanlawson
Copy link
Member Author

Also, the adapter-specific _query() and _viewCleanup() functions don't have be perfectly user-facing (e.g. with evaling map functions and running reduce and all that jazz). They could be more modest and low-level, and look more like what abstract-mapreduce is doing.

@calvinmetcalf
Copy link
Member

we could dust off the old alternative plan we decided against when doing dependent dbs. The idea was to a striped down storage api for making space where an index can be stored, this would make it possible to make other types of indices that don't map well to map reduce (i.e. geopouch)

@nolanlawson
Copy link
Member Author

I think that's basically what the abstract-mapreduce plan would be. :) Although maybe a bit higher-level than what we originally wrote.

@nolanlawson
Copy link
Member Author

I think I'm going to move mapreduce back into the pouchdb repo, and then expose abstract-mapreduce under extras. It would make things a lot easier to manage if it were a single repo.

@daleharvey
Copy link
Member

So I am a little bit wary of the proposal, in general I think it may be ok to have implementations define their own viewCleanup etc, somewhat worried about query becoming a first class citizen in the adapter api.

My vague idea for secondary indexes is to change our underlying storage format, we currently have

|    BY_ID    |
  { _id: docId, 
    _rev: docRev,
    _winningSeq: etc }

alongside

|    BY_SEQ    |

{ id: seq, 
  data: '{some: json}' }

I suggest we do

|    BY_ID    |
{ _id: 'hello', 
  _winningRev: 'xxx',
  _seq: X, // may have to be an array?
  _rev_tree: [],
  winningDoc: {
    some: json
  },
  revs: {
     // the old docs
  }
}

My main worry with that is creating dynamic indexes with indexeddb doesnt seem particularly easy, I believe there is some movement to support that but its likely gonna take a while.

As for how that API is exposed to the adapters, we may just want to let them pass it through, but honestly I would probably skip implementing native query and focus on native pouchdb-find at this point

@daleharvey
Copy link
Member

Basically CouchDB implemented a fairly clever btree log format because it was dealing with the native filesystem, we are dealing with databases and by not using them as such we are at the least 2x slower

@nolanlawson
Copy link
Member Author

Folding both objectStores into a single one is a great idea, especially given that joins in IndexedDB are very slow.

However, the big problem is the lack of support for multiEntry in IE's IndexedDB. Yes, _seq would have to be an array, and so would a _secondaryIndexes field, but IE wouldn't support that, so we'd have to come up with some kind of workaround. I looked into iegap and asked @dfahlander about it, and it seems it would only add about 5KB and passes the W3C tests, so it's a reasonable solution.

In general, though, I'm not sure that we need to implement the "single objectStore" solution just to add native secondary indexes. And yeah, you can't dynamically add indexes in IndexedDB, so my idea was to create a single index and then prefix them.

Don't worry, though; it's looking more and more like I won't have time for this this summer, so most likely I will just work on stabilizing/speeding up the solution we already have. So there's no hurry. :)

@sgenoud
Copy link
Contributor

sgenoud commented Feb 16, 2016

I have spent a bit of time on the views. So I created this plugin: https://github.com/sgenoud/pouchdb-dr-view

What is in there? Ok, first I have refactored the update part of the query, trying to decouple the write from the read. I hoped it would improve some things, does not seem to do it too much, according to my little tests (perhaps 10% by tweeking the buffer parameters).

Note that it passes the mapreduce test suite.

Additionally it should make it "easier" to implement your another view engine, by exposing some methods that could be overridden (in the same way than the plugins do it for pouchdb).

I have thrown lots of stuff there - there is definitely more work to be done on the API exposed, as well as baking in some performance test to help for developing a more performant view. I'd love to get some feedback, I don't want to sunk too much more time in there is not much support to continue in this direction...

@nolanlawson
Copy link
Member Author

@sgenoud Cool - is there any way you could publish this as a separate plugin that is compatible with the PouchDB plugin ecosystem? In principle, map/reduce is a plugin, meaning that someone who wrote a better version could just publish it as a separate module (e.g. see pouchdb-find).

As for the "direction" we are going in, yeah, we are probably going to work to have the secondary indexes within the adapters themselves, rather than create an entirely separate database. However, if you have suggestions for improving performance in core map/reduce, then we'd be happy to accept them. :)

@sgenoud
Copy link
Contributor

sgenoud commented Feb 18, 2016

@nolanlawson actually it is a compatible pouchdb plugin (but I have not published it yet). I uses the _query hook you added to override the behaviour.

With this plugin I have come to realize (on a leveldb adapter - I have not looked how this translates in the browser) that the slowest part seems to be the changes reading – if I write one document for every change the writer tends to wait for the reader. I was thinking of changing the behaviour to something that actually works more like couch itself – compute all the views of one ddoc together as a group (one read, multiple writes).

I would also be interesting to see what happens when we put the reader in a web worker (or something similar).

On the other hand we could solve the "cold start" problem with something similar to the pouchdb dump tool you created. If the views are stored as pouch databases we can import them easily. You would need to bootstrap from scratch, both the main db and the views (i.e. to make sure the data is consistent). Obviously if the data changes a lot we loose all the wins – but it should cover some common cases.

Generally, I am not sure how you want to implement the map/reduce done more natively - as you need to run an arbitrary javascript function on every document anyway (and get outside of the C layer to do it). You could get wins for pouchdb-find though, because, as far as I understand, it creates map functions that "only" pull some fields (and do not do some crazy things like running a double metaphone).

@nrw
Copy link

nrw commented Apr 22, 2016

is speeding up secondary indexes still on the roadmap or have priorities changed after a year?

@daleharvey
Copy link
Member

My primary goal in #4984 is to dramatically speed up pouchdb-find queries

@nolanlawson
Copy link
Member Author

Seems we can close this issue now that idb-next is well underway

@gr2m
Copy link
Contributor

gr2m commented Dec 13, 2016

Is there another issue we can watch or a place that shows the progress on idb-next? Sorry if I missed something, and thanks for all the great work 🙌

@daleharvey
Copy link
Member

Its @ #5207, its basically the pouchdb-find implementation that needs done then to worry about migrations and verify performance etc. Been hammered at work recently but hoping to get on it soon, very happy if anyone else wants to take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature request
Projects
None yet
Development

No branches or pull requests

6 participants