Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make #stmt.each async, and use stream/pipe #686

Open
Aminadav opened this issue Aug 10, 2016 · 17 comments
Open

Make #stmt.each async, and use stream/pipe #686

Aminadav opened this issue Aug 10, 2016 · 17 comments

Comments

@Aminadav
Copy link

Aminadav commented Aug 10, 2016

I want to handle 10,000,000 rows.
Each row will take 1 second to complete.

I have two question:

  1. Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?
  2. What do you think, about implementation an async version for callback. When there are some async tasks, it needs to wait until the task if finished:
      stmt.each('select * from very_big_table',function(err,rows,next){
          setTimeout(function doAsyncTask(){
                  console.log('When the task will complete (2 seconds), I want to move to the next records?
          },2000)
      })

by the way, Congratulations::
I have permissions, and created new node-sqlite3 tag in StackOverflow:

http://stackoverflow.com/questions/tagged/node-sqlite3

@tmcw
Copy link
Contributor

tmcw commented Oct 3, 2016

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?

.each will give you rows one-by-one. It loads each row as it calls each callback.

Happy to accept a PR implementing a next() callback.

@tracker1
Copy link

tracker1 commented Oct 7, 2016

How about if when .each calls the function, if that function returns a thenable object, it will wait for the thenable (Promise) to resolve before calling each again?

db.each('...', function(err, item) {
  return new Promise((resolve, reject){
    ... do async stuff and resolve at the end...
  });
})

would also work as an async function/fatarrow handler.

db.each('...', async (err, item) => {
  if (err) throw err;
}).then(()=>console.log('success'),(err)=>console.error(err));

This assumes each returns a Promise, and handles waiting in each iterable.

@tmcw
Copy link
Contributor

tmcw commented Oct 7, 2016

Fine with either as a PR, the difference between expressing as async, promises, and callbacks here seems slim. node-sqlite3 currently supports v0.10 (which doesn't support promises), but we're dropping compatibility at the end of the month.

@erikman
Copy link

erikman commented Oct 31, 2016

Is anyone working on a PR for this? otherwise I might have a go at it.

It has not been mentioned here, so I just wanted to point out that you can create a wrapper using Database.prepare, and Statement.get to create your own each function that supports promises/done callback (I have used a similar strategy for creating a stream) but the performance of using Statement.get seems to be about five times slower than Database.each (see https://gist.github.com/erikman/a494925ae6ce95869dd56076fb810831 for my tests), so hopefully an async each will be much faster. :)

@erikman
Copy link

erikman commented Nov 1, 2016

I couldn't stop myself and had a look through the sources yesterday, and it seems more complicated to implement than what I first anticipated. Here are my findings, please correct any mistakes I've made in my reasoning (I have not looked at libuv before either, but I do know my way around general threading issues):

The each function works by starting a worker thread that iterates through the results and posts the results back to the main thread. When an each is executing no other sql-statements will run they will just be scheduled to run later. If we make each asynchronous, then we need to be able to stop the worker and resume it later, otherwise one can create a deadlock by making a callback that waits for another sql-statement.

However breaking the each worker into multiple worker segments may introduce other hazards, because then code like

stmt.each(....).finalize()

will have problems to properly schedule all the each callbacks before the finalize call (I guess there are ways around it but it feels error prone).

So I don't know what your feelings are, but to remain backwards compatible I don't think modifying each is the best way forward (it will be quite different behaviors depending on if there is an async callback or not).

I created a new getMultiple(maxResults, cb) that works like a mix between get and all to share the synchronization cost over multiple rows here: https://github.com/erikman/node-sqlite3/tree/get-multiple and using this I can create a stream that is closer to the performance of each (bit not quite there, 120ms vs 150ms for iterating over 50000 elements from a memory backend). I'll put up the code as an example later, it needs some refactoring as I have some Promise wrappers that are not really needed.

What are your thoughts on this idea?

@ghost
Copy link

ghost commented Dec 17, 2016

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?

From my own testing, stmt.each loads all results into memory. As soon as I step over stmt.each, my memory spikes up to 2.2 GB (and then throws a fatal process out of memory error). So while the API allows you to "step" through the rows, AFAICT the data is loaded at once.

@erikman
Copy link

erikman commented Dec 18, 2016

@paolo-tanium It depends on how fast your javascript handler is and the time allocation between the threads. each will kick off a worker thread that loads rows as fast as possible from the db storage, and then sends back the results as (batched, asynchronous) messages to the javascript thread where the handler is running.

So if you handler is fast enough (faster than loading rows from db) then you shouldn't see a memory spike. On the other hand if the handler is slow then the rows will wait in memory until the handler can process them (and we see a memory spike).

At least in theory, in practice the OS might decide to give lots of time to the loader thread before the handler is invoked and then everything will be loaded into memory even with a fast handler. So you are right that when you have a huge number of rows then you can't use each for processing them without risking that all rows will get loaded into memory before they are processed, specially if you have a memory backend.

@mhat
Copy link

mhat commented Mar 3, 2017

@erikman was there any further progress on getMultiple? I'm experimenting with your fork which seems to work.

@erikman
Copy link

erikman commented Mar 15, 2017

@mhat I have not received any comments on it so I don't really know what to do with it. It works fine for me and I get a performance boost compared to using get. I have used it to create both a stream and an each function that works with promises.

Performance is somewhat slower than each, my assumption is that some time is lost when getMultiple waits for the initial array to be filled before returning them (I have done multiple tests with kicking off the next set of getMultiple immediately to minimize the latency, and also varying the size of the fetched blocks).

Thus I don't know if getMultiple is the best name, and also if (and where) I should add my JavaScript eachAsync implementation (or stream).

@pleerock
Copy link

are you guys talking about streaming support here? I can't believe this package do not support streaming

@erikman
Copy link

erikman commented Jun 26, 2017

@pleerock yes, tangentally :) You can already make your own streams using the Statement#get method, the #getMultiple method discussed above improves performance. I just published my repo for working with sqlite: https://github.com/erikman/sqlutil which contains a sqlite-stream (and some other stuff you probably aren't interested in ;). It uses #getMultiple if available and otherwise falls back to ordinary #get.

@pleerock
Copy link

Sounds great I would like it to be part of official package functionality. Maybe instead of separate package you can provide a PR that author can accept. Why streaming should not be part of sqlite package?

@erikman
Copy link

erikman commented Jun 30, 2017

I agree that it is functionality that should be provided by the official package. When I started the conversation in this thread I was looking for architectural input from the maintainers and have not received any yet, so I won't put time into making a PR at this time.

@jimmywarting
Copy link

jimmywarting commented Jun 8, 2019

I would like a async iterator.

for await (let row of db.each(query)) {
  // code
}
// done

you can be in control of requesting more data when needed if it would return a async generator

class x {
  async * [Symbol.asyncIterator]() {
    while (have_more) {
      yield await get_next
    }
  }

  async [Symbol.iterator]() {
    return this.values()
  }

  async * values() {
    return result_array.values()
  }
}

@wmertens
Copy link
Contributor

wmertens commented Jun 8, 2019 via email

@sujatha97714
Copy link

I came across the same scenario where I need to fetch large number of rows and went with db#each. Later I understood, it still accumulated rows and gave the same result of using db#all performance wise. After spending some time on exploration and experiments, I found a work around using statement#get. Prepare the statement and call stmt#get, to retrieve a row, once the callback is executed, call stmt#get which gives the next row. Perform this in loop until it returns undefined (marks end of result). Of course, this is much slower than db#all & db#each but it saves your code from crash due to out of memory exception especially when dealing with large data.

@wmertens
Copy link
Contributor

@sujatha97714 if this is a problem for you, you'd be better off with better-sqlite. It will be way faster, but it's not async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants