Make #stmt.each async, and use stream/pipe #686

Aminadav · 2016-08-10T09:14:48Z

I want to handle 10,000,000 rows.
Each row will take 1 second to complete.

I have two question:

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?
What do you think, about implementation an async version for callback. When there are some async tasks, it needs to wait until the task if finished:

      stmt.each('select * from very_big_table',function(err,rows,next){
          setTimeout(function doAsyncTask(){
                  console.log('When the task will complete (2 seconds), I want to move to the next records?
          },2000)
      })

by the way, Congratulations::
I have permissions, and created new node-sqlite3 tag in StackOverflow:

http://stackoverflow.com/questions/tagged/node-sqlite3

The text was updated successfully, but these errors were encountered:

tmcw · 2016-10-03T16:17:03Z

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?

.each will give you rows one-by-one. It loads each row as it calls each callback.

Happy to accept a PR implementing a next() callback.

tracker1 · 2016-10-07T20:03:44Z

How about if when .each calls the function, if that function returns a thenable object, it will wait for the thenable (Promise) to resolve before calling each again?

db.each('...', function(err, item) {
  return new Promise((resolve, reject){
    ... do async stuff and resolve at the end...
  });
})

would also work as an async function/fatarrow handler.

db.each('...', async (err, item) => {
  if (err) throw err;
}).then(()=>console.log('success'),(err)=>console.error(err));

This assumes each returns a Promise, and handles waiting in each iterable.

tmcw · 2016-10-07T20:38:53Z

Fine with either as a PR, the difference between expressing as async, promises, and callbacks here seems slim. node-sqlite3 currently supports v0.10 (which doesn't support promises), but we're dropping compatibility at the end of the month.

erikman · 2016-10-31T19:17:07Z

Is anyone working on a PR for this? otherwise I might have a go at it.

It has not been mentioned here, so I just wanted to point out that you can create a wrapper using Database.prepare, and Statement.get to create your own each function that supports promises/done callback (I have used a similar strategy for creating a stream) but the performance of using Statement.get seems to be about five times slower than Database.each (see https://gist.github.com/erikman/a494925ae6ce95869dd56076fb810831 for my tests), so hopefully an async each will be much faster. :)

erikman · 2016-11-01T13:51:24Z

I couldn't stop myself and had a look through the sources yesterday, and it seems more complicated to implement than what I first anticipated. Here are my findings, please correct any mistakes I've made in my reasoning (I have not looked at libuv before either, but I do know my way around general threading issues):

The each function works by starting a worker thread that iterates through the results and posts the results back to the main thread. When an each is executing no other sql-statements will run they will just be scheduled to run later. If we make each asynchronous, then we need to be able to stop the worker and resume it later, otherwise one can create a deadlock by making a callback that waits for another sql-statement.

However breaking the each worker into multiple worker segments may introduce other hazards, because then code like

stmt.each(....).finalize()

will have problems to properly schedule all the each callbacks before the finalize call (I guess there are ways around it but it feels error prone).

So I don't know what your feelings are, but to remain backwards compatible I don't think modifying each is the best way forward (it will be quite different behaviors depending on if there is an async callback or not).

I created a new getMultiple(maxResults, cb) that works like a mix between get and all to share the synchronization cost over multiple rows here: https://github.com/erikman/node-sqlite3/tree/get-multiple and using this I can create a stream that is closer to the performance of each (bit not quite there, 120ms vs 150ms for iterating over 50000 elements from a memory backend). I'll put up the code as an example later, it needs some refactoring as I have some Promise wrappers that are not really needed.

What are your thoughts on this idea?

ghost · 2016-12-17T02:02:25Z

Is node-sqlite3 load all queries into memory, or it use stream/pipe from hard-disk row after row, every time the callback is called?

From my own testing, stmt.each loads all results into memory. As soon as I step over stmt.each, my memory spikes up to 2.2 GB (and then throws a fatal process out of memory error). So while the API allows you to "step" through the rows, AFAICT the data is loaded at once.

erikman · 2016-12-18T14:13:05Z

@paolo-tanium It depends on how fast your javascript handler is and the time allocation between the threads. each will kick off a worker thread that loads rows as fast as possible from the db storage, and then sends back the results as (batched, asynchronous) messages to the javascript thread where the handler is running.

So if you handler is fast enough (faster than loading rows from db) then you shouldn't see a memory spike. On the other hand if the handler is slow then the rows will wait in memory until the handler can process them (and we see a memory spike).

At least in theory, in practice the OS might decide to give lots of time to the loader thread before the handler is invoked and then everything will be loaded into memory even with a fast handler. So you are right that when you have a huge number of rows then you can't use each for processing them without risking that all rows will get loaded into memory before they are processed, specially if you have a memory backend.

mhat · 2017-03-03T05:47:34Z

@erikman was there any further progress on getMultiple? I'm experimenting with your fork which seems to work.

erikman · 2017-03-15T13:36:25Z

@mhat I have not received any comments on it so I don't really know what to do with it. It works fine for me and I get a performance boost compared to using get. I have used it to create both a stream and an each function that works with promises.

Performance is somewhat slower than each, my assumption is that some time is lost when getMultiple waits for the initial array to be filled before returning them (I have done multiple tests with kicking off the next set of getMultiple immediately to minimize the latency, and also varying the size of the fetched blocks).

Thus I don't know if getMultiple is the best name, and also if (and where) I should add my JavaScript eachAsync implementation (or stream).

pleerock · 2017-06-21T12:21:50Z

are you guys talking about streaming support here? I can't believe this package do not support streaming

erikman · 2017-06-26T08:58:39Z

@pleerock yes, tangentally :) You can already make your own streams using the Statement#get method, the #getMultiple method discussed above improves performance. I just published my repo for working with sqlite: https://github.com/erikman/sqlutil which contains a sqlite-stream (and some other stuff you probably aren't interested in ;). It uses #getMultiple if available and otherwise falls back to ordinary #get.

pleerock · 2017-06-27T13:13:56Z

Sounds great I would like it to be part of official package functionality. Maybe instead of separate package you can provide a PR that author can accept. Why streaming should not be part of sqlite package?

erikman · 2017-06-30T07:24:36Z

I agree that it is functionality that should be provided by the official package. When I started the conversation in this thread I was looking for architectural input from the maintainers and have not received any yet, so I won't put time into making a PR at this time.

jimmywarting · 2019-06-08T10:10:25Z

I would like a async iterator.

for await (let row of db.each(query)) {
  // code
}
// done

you can be in control of requesting more data when needed if it would return a async generator

class x {
  async * [Symbol.asyncIterator]() {
    while (have_more) {
      yield await get_next
    }
  }

  async [Symbol.iterator]() {
    return this.values()
  }

  async * values() {
    return result_array.values()
  }
}

wmertens · 2019-06-08T13:43:07Z

You can make a wrapper with a generator that uses db.each to do this… Note that node-sqlite can't stop the .each call. I use a different approach: I use a wrapper that does a LIMIT 100 .all() and then processes those results and then it repeats until it processed all data. It is not atomic but I don't need it to be in my case. https://github.com/StratoKit/strato-db/blob/28c9ef581074e249e050840fa1dc5960be7c88ed/src/JsonModel/JsonModel.js#L644-L650

…

sujatha97714 · 2021-07-10T08:29:13Z

I came across the same scenario where I need to fetch large number of rows and went with db#each. Later I understood, it still accumulated rows and gave the same result of using db#all performance wise. After spending some time on exploration and experiments, I found a work around using statement#get. Prepare the statement and call stmt#get, to retrieve a row, once the callback is executed, call stmt#get which gives the next row. Perform this in loop until it returns undefined (marks end of result). Of course, this is much slower than db#all & db#each but it saves your code from crash due to out of memory exception especially when dealing with large data.

wmertens · 2021-07-10T08:59:28Z

@sujatha97714 if this is a problem for you, you'd be better off with better-sqlite. It will be way faster, but it's not async.

tmcw mentioned this issue Oct 5, 2016

backpressure on .each #375

Closed

erikman mentioned this issue Feb 4, 2017

Add the ability to get rows one by one #763

Closed

cyrilchapon mentioned this issue Sep 6, 2019

Streaming sequelize/sequelize#10347

Open

daniellockyer added the feature request label Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make #stmt.each async, and use stream/pipe #686

Make #stmt.each async, and use stream/pipe #686

Aminadav commented Aug 10, 2016 •

edited

tmcw commented Oct 3, 2016

tracker1 commented Oct 7, 2016 •

edited

tmcw commented Oct 7, 2016

erikman commented Oct 31, 2016

erikman commented Nov 1, 2016

ghost commented Dec 17, 2016

erikman commented Dec 18, 2016

mhat commented Mar 3, 2017

erikman commented Mar 15, 2017

pleerock commented Jun 21, 2017

erikman commented Jun 26, 2017

pleerock commented Jun 27, 2017

erikman commented Jun 30, 2017

jimmywarting commented Jun 8, 2019 •

edited

wmertens commented Jun 8, 2019 via email

sujatha97714 commented Jul 10, 2021

wmertens commented Jul 10, 2021

Make #stmt.each async, and use stream/pipe #686

Make #stmt.each async, and use stream/pipe #686

Comments

Aminadav commented Aug 10, 2016 • edited

tmcw commented Oct 3, 2016

tracker1 commented Oct 7, 2016 • edited

tmcw commented Oct 7, 2016

erikman commented Oct 31, 2016

erikman commented Nov 1, 2016

ghost commented Dec 17, 2016

erikman commented Dec 18, 2016

mhat commented Mar 3, 2017

erikman commented Mar 15, 2017

pleerock commented Jun 21, 2017

erikman commented Jun 26, 2017

pleerock commented Jun 27, 2017

erikman commented Jun 30, 2017

jimmywarting commented Jun 8, 2019 • edited

wmertens commented Jun 8, 2019 via email

sujatha97714 commented Jul 10, 2021

wmertens commented Jul 10, 2021

Aminadav commented Aug 10, 2016 •

edited

tracker1 commented Oct 7, 2016 •

edited

jimmywarting commented Jun 8, 2019 •

edited