Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stream] - callback based API additions #153

Closed
spion opened this issue Dec 12, 2014 · 15 comments
Closed

[stream] - callback based API additions #153

spion opened this issue Dec 12, 2014 · 15 comments

Comments

@spion
Copy link

spion commented Dec 12, 2014

Streams don't have an API that adheres to the node callback convention. This makes them awkward to use with other control flow tools. The following two methods would be the most useful:

readAll([encoding], function(err, data) {}) - on readable streams it collects all data, or is called with the first error
ended(function(err) {}) - Like the previous one, except it doesn't collect any data - its just called when everything ends (either successfully or at the first error). finished could be used for writable streams.

Both of the above would act as if an .on('error', cb) was attached.

The following would also be useful:
readNext(function(err, data) { ... }) Read the next packet or object. The callback is called with the current packet on the "next tick", with the "current packet" changing for subsequent requests right before the callback is called.

Would act as if a .once('error', cb) was attached

And the last method would also be useful, but I'm unsure of the semantics:

writeNext(data, [enc], function(err) { ...}) - The callback would be called when the source is allowed to continue writing. Now I'm not entirely sure what this would mean, but I suppose that for streams with buffers, this would mean "called immediately" if the buffer is not full and called after flush if it is, and for streams without buffers this would be called after flush.

This should be taken as a rough sketch, as I'm not sure I understand enough about node streams internals to give a good proposal :)

edit: sorry for the close/open noise, I accidentally the button.

@phpnode
Copy link

phpnode commented Dec 12, 2014

@spion doesn't your first example negate many of the benefits of using streams? i.e. if you have to buffer the whole response, what is gained by using streams at all? Also, these things could easily be userland functions, so, why not create a module and see how it works?

@spion
Copy link
Author

spion commented Dec 12, 2014

Answering point 1, I don't think so. The reality is that there are valid use cases with relatively small data where the costs (of complexity and performance due to incremental processing) far outweight the benefits. Just look at all the http request modules - the most popular ones always provide ways to collect all the data.

Regarding point 2, I'm experimenting with a module that does some of this (and more) with promises. However I still believe that core should at least be self-consistent.

edit: I'll try and make a module that implements the methods and test out some of the use cases I have in mind.

@phpnode
Copy link

phpnode commented Dec 12, 2014

@spion cool, looking forward to seeing your module. I agree about the lack of consistency but I think node should be looking to (eventually) move to the whatwg streams spec which has pretty good promise integration baked in.

@spion
Copy link
Author

spion commented Dec 12, 2014

@phpnode I agree. I'm just suggesting a relatively cheap (in terms of complexity / backward compatibility) ergonomic improvement that could be done in the meantime. (I want to be able to tell bluebird users to promisifyAll their stream prototypes, haha)

@benjamingr
Copy link
Member

+1 this is an extremely common case that should be solved at the core level.

Most other platforms that provide streams have this method and it is immensely useful for a lot of use cases.

@zensh
Copy link

zensh commented Dec 12, 2014

thunk-stream wrap stream to thunk function:

var thunkStream = require('thunk-stream'),
  stream = require('stream'),
  fs = require('fs');

var readableStream = fs.createReadStream('index.js');
var passStream = new stream.PassThrough();

thunkStream(readableStream)(function (error) {
  if (error) console.error('error', error);
  else console.log('read file end.')
});

thunkStream(passStream)(function (error) {
  console.log('file pass through finished.')
});

readableStream.pipe(passStream);

@chrisdickinson
Copy link
Contributor

Streams don't have an API that adheres to the node callback convention.

Streams use callbacks where the operation in question is representable by a callback. A callback can only represent a single operation that eventually passes or fails. For instance, writable.write(data, enc, callback) uses a callback to represent the operation of having flushed data to the underlying resource, which may only happen once per .write, pass or fail.

As a counterexample, .read(n) does not represent a single read-from-resource operation, due to the highwatermark buffering nature of streams2+ -- a single .read may correspond to zero or more "chunks" of data being passed through the stream. There's no way for consumers of streams to know when the _read resulting from read has stopped .push'ing data.

readAll([encoding], function(err, data) {}) - on readable streams it collects all data, or is called with the first error

Because this can be done in userland with concat-stream; and there's currently no need to collect and buffer an entire stream internally in node, this is not a viable addition to the streams API. Additions to this API should either provide functionality unattainable through userland, or that is directly needed by other core modules.

ended(cb) is a more viable candidate, since it represents a result that may only happen once, pass or fail. However, it conflates two events that are currently emitted by readable streams. Since Streams are (for better or worse) implemented as event emitters, and their .pipe mechanism relies on those two events separately, it is unlikely that an alternate API to get to that state will be added.

Re: readNext and writeNext -- The above explanation of why .read lacks a callback should apply to readNext as well, but I confess I'm not sure what use case these methods serve.

I agree about the lack of consistency but I think node should be looking to (eventually) move to the whatwg streams spec which has pretty good promise integration baked in.

I'm definitely interested in that approach, but unfortunately the whatwg spec is built around promises-as-a-core-primitive, so whatever (non-trivial) amount of work necessary to properly support promises as the primitive representing single-operation-pass-or-fail in node core would have to happen first. i.e., adopting the streams spec (or even cribbing from it) would be great, but it is a very long way off, IMO.

@spion
Copy link
Author

spion commented Dec 13, 2014

Because this can be done in userland with concat-stream; and there's currently no need to collect and buffer an entire stream internally in node, this is not a viable addition to the streams API. Additions to this API should either provide functionality unattainable through userland, or that is directly needed by other core modules.

I disagree. As it is, the current node API is in a no-person's land. Its neither

  • completely low-level / unopinionated, or
  • ergnomic, well crafted and self-consistent.

Since the API can't be made more low-level and less complex without breaking backward compatibility, features that don't cost much yet add immense benefit for consumers should be considered even if they don't provide any benefit to core. Streams are already a "batteries included" feature - so why not provide decent batteries while at it? Otherwise I don't see the point.

That raises another interesting question, what is the actual strategy regarding the current core library? Will it be deprecated in favor of libuv.js or will it be developed further?

Re: readNext and writeNext -- The above explanation of why .read lacks a callback should apply to readNext as well, but I confess I'm not sure what use case these methods serve.

readNext and writeNext aren't really useful for current use cases of streams, but I was thinking they could probably be useful to make streams usable as channels for CSP. On second thought, they're not really designed to do that so its probably a bad idea.

ended(cb) is a more viable candidate, since it represents a result that may only happen once, pass or fail. However, it conflates two events that are currently emitted by readable streams. Since Streams are (for better or worse) implemented as event emitters, and their .pipe mechanism relies on those two events separately, it is unlikely that an alternate API to get to that state will be added.

You could also argue that fs.readFile(..., (err, data) => ...) starts a reading operation and conflates two separate events, error and data arrival. That doesn't make the API format of readFile less useful or practical.

I don't understand how this would interfere with pipe. Afaik, anyone can query the stream for the error or end event. Is that wrong? If its not IMO this is the single biggest improvement that the streams API could get, at least for end-users. I'd like to be able to do this:

var s1 = stream1(opts);
var s2 = stream2(opts);
var s3 = stream3(opts);
s1.pipe(s2).pipe(s3);
async.parallel(s1.ended.bind(s1), s2.ended.bind(s2), s3.ended.bind(s3), done);

@sonewman
Copy link
Contributor

So do you mean from your example that each stream would need a define an ended method. Because having multiple things happening on ended would then become very manual, unless we still have the EventListener pattern.

Also this would work very well for people using the async library! But what if you don't want to use that library? Would this not have sufficed:

var s1 = stream1(opts);
var s2 = stream2(opts);
var s3 = stream3(opts);
s1.pipe(s2).pipe(s3);
s3.on('finish', done);

Unless we inherently needed to inform the next stream that the previous one had ended, (which it already knows about internally anyway).

It seems like we are missing the huge benefits of our stream abstractions.
OK so for a simple http-server responding with a blob of html, it's not so important and we probably want to provide a simpler solution for that use case, since it is much more common, but this has little to do with streams.

To reiterate, the point with streams is that the whole data is not dealt with in one chunk otherwise one would just use or provide an async callback API.

A stream allows a way of handling a series of async events returning data. As described by many before me an array through time. It doesn't really matter how big or small those chunks are, in some cases that stream my never end, because it is a socket and it is open forever.

Similarly doing readable.on('data', data => { }) is a lot like doing [1, 2, 3, 4].forEach(no => { }) We say the forEach takes a callback but it doesn't just happen once. It's like saying we expect forEach to be called only once, containing all the values in an array (isn't that what we started with?).

OK so in other languages we can do the equivalent of readAll but we also have threads, coroutines, forks, etc. We can afford to do infinite loops until the stream ends or errors, because although it appears linear when written, behind the scenes threads and schedulers are parallelising all of this for us.

Don't get me wrong having a way to do light-weight threading (in our case isolated super light-weight JS runtime) would be be awesome, however the single-threaded runtime is what make currently makes iojs / Node!

This kind of defeats the whole flexibility of JavaScript's asynchronous model.

There are also multiple ways to buffer a whole stream, some of which are async and others which would block the rest of the runtime.

IMO it's like we are taking two steps backwards from everything that iojs / Node has given us with it's evented (process.nextTick/setImmediate throw this onto the 'list') model.

What we need is to agree on an abstraction that handles asynchronous and it's internals that deals with this.

with regards to:

I disagree. As it is, the current node API is in a no-person's land. Its neither

completely low-level / unopinionated, or
ergnomic, well crafted and self-consistent.

It may not be as simple as a callback but it is pretty low-level handling piece of data in such granular pieces, that is the point.
It is not most peoples use case, because their use case is a much higher level.

ergonomically perhaps it is a little weird since we have to inherit from a stream and override implementation detail which is then in turn called by internals, which is not a pattern which we would all be used to.

No one is saying that the streams API is perfect but I strongly disagree with it not being well crafted. @isaacs et al. put a great deal of time and work into forming what we currently know of as streams and what we have now is a pattern that makes all streams following that pattern automatically compatible with one another, which IMO is awesome!

@chrisdickinson
Copy link
Contributor

Streams are already a "batteries included" feature - so why not provide decent batteries while at it?

Streams are not a "batteries included" feature. They are an abstraction core uses internally to represent the /data*((end|error)?close)?/ abstraction -- which will always be required by core. The current implementation has features that appear to be "batteries included" at first glance: for instance, highwatermark buffering. However, their raison d'être is not to provide a "batteries included" API, but because other parts of the core API required that implementation for performance considerations.

That raises another interesting question, what is the actual strategy regarding the current core library?

libuv.js would become a dependency of node/iojs, one which the current core library is built on top of. Alternate implementations of core functionality could be built in pure JS.

However, since there's a massive package ecosystem built around the current set of abstractions, there's going to be a strong incentive to stay compatible with what exists at present. In other words, libuv.js won't solve the problem of "how to evolve core" so much as it will make maintaining core more straightforward, and open up the possibility of building different abstractions on top of libuv to more people. Those abstractions are not guaranteed to be "npm-ecosystem-compatible."

You could also argue that fs.readFile(..., (err, data) => ...) starts a reading operation and conflates two separate events, error and data arrival. That doesn't make the API format of readFile less useful or practical.

An errored stream is not the same as an errored callback operation. A stream may have a backing resource which may be inspected on error -- for example, to inspect the buffered items, or to look at a TCP socket's remote address -- before disposal. A callback operation can only represent the final state of the entire transaction.

For example:

     fs.createReadStream
                           finish/
     flow    data   data  end/error    close
      |       |      |       |          |
      V       V      V       V          V
 time ------------------------------------->
      ^                                 ^
      |                                 |
     fs.readFile                      (err, data)

* "finish" is the writable equivalent to the readable "end" event.

Callbacks can't represent the temporal difference between "end/error" and "close". .pipe's machinery relies on being able to differentiate between close, end, error, and finish. This is not to say that callbacks are not useful abstractions, but instead that they are not, by themselves, sufficient for all purposes (NB: s/callbacks/promises/g applies equally well here).

I don't understand how this would interfere with pipe.

It's not that .ended(cb) would interfere with .pipe, it's that it could not replace the operations that .pipe currently relies upon (notably, being able to differentiate close/error/end temporally) -- it could only be implemented as sugar on top of what's there.

One other consideration to keep in mind is that it's not trivial to expand the API surface area of streams reliably: for instance, imagine getting a stream from a package that has a pegged version of readable-stream. The .ended method will not be available on it. There's not a single streams interface -- there are four in core ([v0.10, v0.12].length * 2), and 40 (!) releases of readable-stream. Maintaining compatibility and consistency between these implementations is a delicate process.

Closing this issue for now:

  1. readAll is solved by userland's concat-stream package, and no core module currently requires that functionality.
  2. ended(cb) is API expansion that would be difficult to ensure across versions of readable-stream and builtin streams, and could not replace existing mechanisms used by .pipe.

If you'd like to discuss further, feel free to comment on this issue or find me in IRC on freenode (in #io.js) — I'm always happy to talk streams!

@sonewman
Copy link
Contributor

@chrisdickinson That ascii diagram is a perfect explanation, and should be in the documentation somewhere, if it is not already!

@spion
Copy link
Author

spion commented Dec 14, 2014

var s1 = stream1(opts);
var s2 = stream2(opts);
var s3 = stream3(opts);
s1.pipe(s2).pipe(s3);
s3.on('finish', done);

And this is precisely the problem with core streams. The example looks perfectly fine - after all its no different than the examples given on the main page. But it contains one major flaw: your process will crash and you will have absolutely no idea why because the stack trace will contain no useful info whatsoever. Guess why? Because you forgot to add any error handling.

I've seen this happen over and over again when people use streams. I've done it several times before, just because its awkward to handle all the errors. To me indicates that streams are quite user-hostile in their current state. (By the way, the echo server example given on the main nodejs website will crash on error too).

Lets write the proper stateful, ugly wrapper for this

function pipeline(done_) {
    var isDone = false;
    function done(err) {
        if (isDone) return;
        isDone = true
        done(err);
    }

    var s1 = stream1(opts);
    var s2 = stream2(opts);
    var s3 = stream3(opts);
    s1.pipe(s2).pipe(s3);
    s3.on('finish', done);
    s1.on('error', done);
    s2.on('error', done);
    s3.on('error', done);
}

Of course, you have to be careful not to call that callback more than once. But you can't use the regular control flow abstractions from userland that ensure that for you, because the square pegs of streams wont fit the round holes of callbacks

As far as I can see, the close event happening after the callback has no relation to this whatsoever, as this is just a basic usability enhancement feature.

But yeah sure, I suppose we can solve all this in userland.

@sonewman
Copy link
Contributor

Either way the situation would contain the same complexity, the point is that what you are looking for is something to abstract all of that away from you.

The ended method call idea is a leaky abstraction, because even if the API did not warrant implementation by the developer, the developer still needs to know that that method will be called in the stream lifecycle. In addition having that method as a 'public' (not _ prefixed method) it suggests that this can be called manually. What would the stream do in this event?

The problem with the approach of having one callback to handle all the errors is that you have no visibility of where the error even was, to handle it programatically. It could represent a connection or it could represent some other IO.

You might not want to clear up everything based on one part of you pipeline throwing an error.

The error could be parsing some chunk of data, that chunk could have corrupted somehow from an external service. There are more combinations and possibilities then there exists streams, and there always will be.

The fact is that our current abstraction allows the scoping for implementing a solution to this one use case (even if it might seem like a common or obvious one).

If we had only that abstraction we would have no way to retrofit the flexibility, which would be far more problematic. This is further an example of of the fact that this is in fact a lower level API, because there are more things to worry about.

The components are more raw and you have to consider handling their errors. IMO that is how it should be.

You have to deal with individual stream errors in most other languages as well, so it is no different here.

But yeah sure, I suppose we can solve all this in userland.

This is exactly where this problem should be solved if it is not already!

@spion
Copy link
Author

spion commented Dec 14, 2014

@sonewman

The error could be parsing some chunk of data, that chunk could have corrupted somehow from an external service. There are more combinations and possibilities then there exists streams, and there always will be.

Then that is an error in usage of the stream library. There should be differentiation between known unrecoverable errors that must end the stream (e.g. socket/handle closed), and intermittent recoverable failures. A case in point is dnode. It uses the fail event for non-fatal errors rather than reusing error events.

Are you saying that since there are no clearly defined semantics (even though its implied that error is fatal to the stream, node core doesn't state this anywhere) for userland, streams are basically unfixable and we'd have to start all over again on top of libuv if we wanted abstractions with saner, well defined behavior? If that is so, I agree.

But I don't think itsnecessary to go that far. Forgetting to add an error handler already stops your stream entirely (together with maybe hundreds of other streams by other users, might I add). I think that defines the meaning of the error event quite clearly.

@sonewman
Copy link
Contributor

Well having a separate error to handle core protocol errors and one for stream specific designed errors could be useful. IMO I think it is better to set a code attribute on the Error, that way it can be handled by the implementer.
This kinda links to #92 because we potentially need better implementation/understanding/documentation about errors for users.

Are you saying that since there are no clearly defined semantics (even though its implied that error is fatal to the stream, node core doesn't state this anywhere) for userland, streams are basically unfixable and we'd have to start all over again on top of libuv if we wanted abstractions with saner, well defined behavior? If that is so, I agree.

I am not saying we have to scrap what we have at all. I don't think that what we have in inherently broken either. The internals of streams are a little complex to get your head around at first glance. But I have scoured through every line of the current streams3 implementation and I believe it is incredibly well thought out.

It is also very performant based on what it does! I mean if you have a stream, and it does not apply back pressure data is literally just passes straight through it. Yet the backing up of data in the case of back pressure works incredibly well.

Some of the internal hooks are there serve core, but again I stand by the fact that they exist for managing low level behaviour.

Perhaps a redesign of the internals with a fresh slate and experimentation with different programming paradigms could make the internals simpler, and even more performant. Particularly if they are based off of simpler core / libuv.js API's.

However in terms of streams in core and userland, these are specific implementations of particular streams, not the core abstraction.

The documentation is already good with resources like @substack's https://github.com/substack/stream-handbook although since there has been no releases of v0.12 regular users have not had the benefit of what streams3 has to offer, without using the (readable-stream)[https://github.com/isaacs/readable-stream] module, and so documenting new features from streams3 is not yet as widely available.

(For example the use of writev method is awesome for regular http use cases, since we can define that method to handle all of the data when end is called and send the lot in one tcp write. This is way more performant for a basic use case and what most of the popular frameworks do internally)

It seems what you are looking for is a streamlike abstraction, which is built on top of streams which gives a similar yet more broad API, allowing you to combining a stream-like API with more common callback API methods too. As said already a lot of this functionality is scattered in third party modules and does already exist.

In terms of the actually streams API for handling low level chunks of data over time, I do think it deserves discussion about possible ideas and improvements, which will aid and define it's use case. This could mean there is room for promises, alternative paradigms, or it could mean we stick with what we have.

To me it seems like there needs to be a clear distinction and discussion between a low level stream implementation with a flexible API and a higher-level abstraction for more common use-cases in application development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants