Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Rough new C++ streams API #16414

Closed
wants to merge 2 commits into from
Closed

[WIP] Rough new C++ streams API #16414

wants to merge 2 commits into from

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Oct 23, 2017

[DO NOT MERGE... This is a work in progress concept]

Consider everything in this mutable at this point... this is just a concept up for discussion

@mcollina @Fishrock123 @trevnorris @addaleax

This is an extremely rough WIP C/C++ level concept for the "new" pull-stream API we discussed.

@addaleax ... for context, this was something that @mcollina , @Fishrock123 , @trevnorris and I began brainstorming over dinner one night in Vancouver.

Look at the doc/io_wip.md to get a sense of how it works. This is extremely rough at this point but should cover most of the use cases. There is still a ton that would need to be worked through on this.

Please ask questions in the comments here and be as brutal as you'd like. If we want to move forward on something like this we need to make sure we get it right.

@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. labels Oct 23, 2017
@jasnell jasnell added the wip Issues and PRs that are still a work in progress. label Oct 23, 2017
@addaleax
Copy link
Member

One question upfront: This looks like it’s a mix between a C and a C++ API, which decreases readability a lot … can we pick one? 😄 (I’d prefer C++ – if we want something like public bindings for N-API, we can always do that the way N-API currently works.)

@jasnell
Copy link
Member Author

jasnell commented Oct 23, 2017

can we pick one?

awww ;-) ... yeah, we certainly can. I went with this simply because it was easy to hack together and I didn't want to bikeshed on stuff too much just yet. But yes, it would be better to go with one or the other.

@addaleax addaleax changed the title [WIP] Rough new streams API [WIP] Rough new C++ streams API Oct 23, 2017
IO_ERROR_EOF = -768,
// Source will not copy data into caller provided buffer
IO_ERROR_MUST_NO_COPY = -1024,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are errors distinguished any further than this? I think you’d need at least a way to forward all errno errors…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, not yet. This is just a rough error reporting mechanism right now, it would need to be significantly improved :-)

io_pull_set_pull_cb(&pull, AfterPull);

int status = mySource->Pull(&pull, &buffer.len, &buffer, 1,
IO_PULL_FLAG_MUST_COPY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that pull is copied into memory managed by the source object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull is just an async handle, really. Used to maintain context in the callback.

The IO_PULL_FLAG_MUST_COPY tells the Source that it must copy it's data into the provided buffer rather than providing pointers to it's own buffers in the callback.


### Example 5: Binding a Source and Sink

Binding is roughly the equivalent of `pipe()` in Readable, with the notable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I’d prefer to call this ‘piping’ then?

And just to be clear, I understand correctly that only one binding/piping operation is happening for any given source, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to call this 'piping' then?

I considered that, but given that it is pull based rather than push based, I wanted to avoid accidentally conflating the two.

... only one binding/piping operation is happening for any given source, right?

That's the idea but we need to decide that for certain. When we discussed this over dinner in Vancouver, the idea was that data flow would be either One-Source-to-One-Sink, or One-Sink-Many-Sources, in order to keep things as simple as possible.

IO_PULL_FLAG_NONE = 0x0,
// Callback must be invoked synchronously (also sets the
// IO_PULL_FLAG_MUST_COPY and IO_PULL_FLAG_STRICT_LENGTH flags)
IO_PULL_FLAG_SYNC = 0xD,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this imply IO_PULL_FLAG_MUST_COPY?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the only way for the Source to provide pointers to it's own buffers is via the callback, which is not being used here. This is telling the Source: fill these buffers now and return immediately when you do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe Pull could yield its own buffers synchronously, if there are any? This sounds like something that would otherwise create unnecessary memcpy()s otherwise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually the intent. The two approaches are:

  1. caller allocates the buffer, calls pull, Source memcpy's data into those
  2. caller calls pull, Source provides pointers to it's own buffers without memcpy

Currently, for option 2, the Source must use the callback to deliver it's own buffer pointers.

We could change this, however, so that Pull can yield buffers back to allow Source to provide it's own pointers without requiring the callback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw; one of my (and others’) pain points with streams is that they delay a lot of events, whether necessary or not.

It might be nice if we could drop this requirement, and calls would be allowed to call their callback synchronously or asynchronously, whichever fits more nicely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree. With the callbacks here, that's exactly what I'm allowing for unless the code calling Pull explicitly says that it needs sync processing. Unless that flag is set, the Source is largely free to invoke the callback whenever it wants, sync or async, once the data is available. Unless the code calling Pull sets that sync flag, it must not make any assumptions at all about when the callback is going to be invoked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(hopefully I didn't misunderstand your comment there tho :-) ...)

// Peek at what data is available, do not actually read.
IO_PULL_FLAG_PEEK = 0x2,
// Puller considers the length given to be strict,
// Source must not overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflow? i.e. return more data than requested?

Copy link
Member Author

@jasnell jasnell Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are a couple of options here:

  1. The caller allocates it's own io_buf_t with a specific size, tells the Source that it must fill those (this implies that overflowing is not allowed)

  2. The caller allocates it's own io_buf_t with a specific size, tells the Source to provide data up to length and no more. Source may choose to fill the provided io_buf_t, or provide pointers to it's own allocated buffers. The caller may or may not allow the Source provided buffers to be larger than what it asked for.

@mscdex
Copy link
Contributor

mscdex commented Oct 23, 2017

Does this increase the likelihood/possibility of a pull-based JS stream API or is this mainly benefiting C++ node core internals only?

@jasnell
Copy link
Member Author

jasnell commented Oct 23, 2017

@mscdex ... yes, the intent would be to extend a pull-based JS stream API also. I believe @Fishrock123 has been exploring that. The idea would be to make this and that sync up :-)

Another goal of this is to allow us to consistently bind at the native layer when possible... that is, if we have a JS Source and a JS Sink, both of which are backed by C/C++ Sink an C/C++ Source, we can do the Bind at the native layer to more efficiently transfer the data without pulling it into the JS layer... much like we do now with the TLS and http2 implementations writing to their underlying sockets.

@addaleax
Copy link
Member

has been exploring that. The idea would be to make this and that sync up :-)

Fwiw, we have that; .on('readable') + .read() is pretty much a pull stream.

@jasnell
Copy link
Member Author

jasnell commented Oct 23, 2017

Fwiw, we have that

Yes, I should have clarified: the idea would be a separate implementation that does not carry all of the existing Streams 1, 2, and 3 cruft along with it. It would be a separate API that is intentionally not backwards compatible with the existing stuff. Someone could write code that bridges the two, but that would be secondary.

@mscdex
Copy link
Contributor

mscdex commented Oct 23, 2017

Fwiw, we have that; .on('readable') + .read() is pretty much a pull stream.

That's not quite the same thing as far as this API goes (from what I can tell). This pull-based API lets you supply a Buffer to use for writing the data, whereas 'readable' and read() is returning a new Buffer every time, which can be expensive (for parsers and other modules that consume the entirety of the data received).

@jasnell
Copy link
Member Author

jasnell commented Oct 23, 2017

That's not quite the same thing as far as this API goes ...

Yep. This API actually supports both approaches. The caller can supply a buffer(s) to fill (and require that the Source use those), or may allow the Source to provide it's own buffer.

enum io_pull_status {
IO_PULL_STATUS_OK = 0x0,
// After this call, wait before asking again
IO_PULL_STATUS_WAIT = 0x1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to just wait in the source and return when data is available?

If the sink needed to I suppose it could peek or unbind on a timeout, but it seems better to wait in the source for data if there will be some rather than polling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to just wait in the source and return when data is available?

Not necessarily. In the http/2 implementation, for instance, the callback from nghttp2 is sync. If data is not currently available, the stream is put into a deferred state until nghttp2 is explicitly told that data is now available. That is precisely the kind of use case this is meant to support.

The way Pull is defined, we have a simple set of primitives that allow the following scenarios, which can be used based on the needs of the Source and Sink:

  1. Sync Pull ... Source waits to return until data is available (Blocking-Pull)
  2. Async Pull ... Source waits to call callback until data is available (Nonblocking-Pull)
  3. Sync Pull With Deferred Polling ... Source returns immediately saying data is not yet available, tells caller to check back later
  4. Sync Pull With Deferred Signal ... Source returns immediately saying data is not yet available, will let caller know later when data is available (see Sink::Signal)
  5. Async Pull With Deferred Polling (like above but using callback)
  6. Async Pull With Deferred Signal (like above but using callback)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the stream is put into a deferred state until nghttp2 is explicitly told that data is now available. That is precisely the kind of use case this is meant to support.

So what you're saying is "4. Sync Pull With Deferred Signal"? i.e. a notification push to the sink? That would make sense but I was under the assumption we were avoiding that flow more entirely. I'll try it in the JS impl when I can.

@Fishrock123
Copy link
Member

Fwiw, we have that; .on('readable') + .read() is pretty much a pull stream.

Right, in a way. (Just way more complex than necessary.) And that's kinda a good thing. It means we should, with high certainty, be able to shim the ends of these "streams" to Streams3 - allowing for compatibility with the existing Streams3 ecosystem.

@Fishrock123
Copy link
Member

Aight, my work thus far is here: https://github.com/Fishrock123/bob

Rough, but works for fs. I'll update it to share some naming conventions with this soon-ish. I think some modifications are needed to make buffered transforms work correctly, going to try to do a zlib one soon.

"Isn't that just pull streams?" — in concept, yes. Error flow and buffer allocation is different than pull-stream though and should better suit our needs, I think.


@jasnell How does the bind structure work here to ferry re-calls to Pull()? Does it need to have it's own functions inbetween? Mine does binding more directly form the sink right now, allowing direct calls from sinks and sources.

@jasnell
Copy link
Member Author

jasnell commented Oct 24, 2017

Aight, my work thus far is here ...

Awesome :-) ... I'll take a look in detail tomorrow.

How does the bind structure work here to ferry re-calls to Pull

TBD at this point. I'm hoping to work up an implementation of that by end of week.

@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2017

@mcollina
Copy link
Member

I would prefer to land this into a separate Node repo, and prove a point by porting some bits of the previous API.

@jasnell
Copy link
Member Author

jasnell commented Oct 27, 2017

+1 on that @mcollina ... should I do like I did with the http2 repo and create a new nodejs/new-streams repo?

@mcollina
Copy link
Member

Yes go ahead!

@jasnell
Copy link
Member Author

jasnell commented Oct 27, 2017

I've got a few higher priorities items to clear out but will do that by tuesday of next week.

virtual int Pull(io_pull_t* handle,
size_t* length = 0,
io_buf_t* bufs = nullptr,
size_t count = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

virtual ~Source() {}

virtual int Pull(io_pull_t* handle,
size_t* length = 0,
Copy link
Member

@Fishrock123 Fishrock123 Nov 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this length written to? (Guaranteed to be?)

What does the passed-in length do? Set the maximum?

@Fishrock123
Copy link
Member

It seems like the two things my JS implementation is not currently capable of covering is peek and sync-required read. Maybe I'm missing something else?

Note that my impl currently requires callbacks and requires a buffer (either passed form the sink or directly from the source) be passed to the callback, and that the sink always uses that one.

@gishmel
Copy link
Contributor

gishmel commented Dec 1, 2017

I would like to help with this effort. Is there anything that is low hanging fruit to do in order to get involved in this effort? Thanks for all your help and hard work on node.js core and I look forward to working with you all more in the future.

Fishrock123 added a commit to Fishrock123/bob that referenced this pull request Dec 1, 2017
Updated naming to be more like James' C++ impl:
nodejs/node#16414

Fixed some bugs. stdout-source works again.
@BridgeAR
Copy link
Member

What is the status here? Any further progress?

@jasnell
Copy link
Member Author

jasnell commented Feb 10, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. wip Issues and PRs that are still a work in progress.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants