Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDP 43: Large Events #5056

Closed
andreipaduroiu opened this issue Aug 10, 2020 · 8 comments
Closed

PDP 43: Large Events #5056

andreipaduroiu opened this issue Aug 10, 2020 · 8 comments

Comments

@andreipaduroiu
Copy link
Member

andreipaduroiu commented Aug 10, 2020

Problem description
Issue to discuss PDP 43:

https://github.com/pravega/pravega/wiki/PDP-43-Large-Events

Problem location
Supporting Events larger than 8MB.

Suggestions for an improvement
Design and implement.

@andreipaduroiu andreipaduroiu self-assigned this Aug 10, 2020
@andreipaduroiu andreipaduroiu changed the title PDP 43: Large Appends PDP 43: Large Events Aug 11, 2020
@andreipaduroiu
Copy link
Member Author

Comments from @claudiofahey. Answers inline below.

Once nice-to-have add-on for this would be to allow EventStreamWriter.writeEvent to write events slightly larger than 8 MiB. Maybe a writer usually writes 1 MiB events but sometimes events serialize to 9 MiB. Your proposal now would require the user to perform their own serialization. One possibility is for writeEvent to perform the serialization to a ByteBuffer (as normal), and if it is >8 MiB, then use a micro transaction. You could do something similar on the reader side.

This sounds like an interesting idea. However it is unrelated to this PDP. It can be treated as a separate change even with today's API. If we choose to do this, we will need to be very careful around the messaging around this. I am concerned that if we allow a 1MB overflow, then what's going to prevent a user from asking from more? For example, if we increase the limit to 9MB, then someone can essentially repeat your ask and say "why can't we do it for 10MB? 11MB? ....".

I think a nice pattern is:

try (byteStreamWriter = EventStreamWriter.beginAppend(routingKey)) {
  byteStreamWriter.write(...);
  somethingThatMightThrow();
  byteStreamWriter.write(...);  
  byteStreamWriter.commit();
}
// If close() is called before commit(), it should abort.

Sounds good. This will eliminate the need for abort. We will need to properly document the behavior of close in this case. Note: in case we were thinking of overriding flush with the semantics of commit, then that wouldn't be appropriate IMO. A user could periodically call flush to make sure that whatever they wrote is durably persisted, but that doesn't mean the entire Event is uploaded.

To avoid excessive aborts, should we recommend that such streams do not use dynamic scaling? Scaling could be performed by administrators during maintenance windows or periods of low activity.

The ordering guarantee should be based on the order of writeEvent() and ByteStreamWriter.commit() calls. Can we do this?

We can, but we need to make sure we use the same connection as the parent EventStreamWriter. Otherwise we cannot guarantee order on different TCP connections.

Probably for a different PDP, but it would be nice to have the ability to write multiple large events to multiple routing keys in a distributed transaction. That should have a nearly identical client API.

There is nothing preventing you from doing this with the current proposal. We’ll likely need to wire this up in the Transactional writer. No need for changes on the reader side. We might choose to do it in this PDP if it proves not to be too much extra work.

@tkaitchuck
Copy link
Member

Accept Events larger than 8MB. Maximum should be hardcoded to 1024MB.

Can we soften that language a bit: like maybe the goal of this PDP is to increase the size from 8mb to 1024MB. (Because it is not a goal to prevent future increases to this limit)

TS will have a maximum of 4 Extended Attributes

Is 4 a special number here? Or is it just chosen to be a small value?

The proposal for the wire protocol and server side features sound good to me. However I do not like the proposed client API changes. This would not be a simple drop in upgrade for any existing client. They would be required to code to the new API on both the reader and the writer sider.

I think we may need two APIs to take two different approaches. One approach for medium size events that is a drop in solution for existing users that don't have to think about what is happening. This of course has its limits because it means the user application is not streaming in the contents of the event nor processing it in a streaming fashion. For those applications, it makes the most sense for them to actually write and read back a stream object. This allows different stream objects can be retrieved in rapid succession and handed off to parallel processing threads.

@andreipaduroiu
Copy link
Member Author

This would not be a simple drop in upgrade for any existing client.

Why not? Any existing code will continue to work as it currently does. Existing code should not attempt to write more than 8MB at once since that's forbidden by the API. This will not change.

The client application developer will need to write new code to handle events larger than 8MB.

@derekm
Copy link
Contributor

derekm commented Aug 18, 2020

The proposal goes to certain lengths to honor memory consumption patterns imposed by previous limits, however intentional those limits were.

The addition of streaming APIs is great, but a developer should have the option of retaining access to the old APIs with the new large events. The defensive nature of the proposal may be a good safety net, but, as an informed user, I might want to turn off the safety net that prevents getEvent() from loading the large event into memory.

Existing code may not be intentionally written with awareness of an existing 8MB limit, and payloads may be apt to grow beyond that limit one day. If this feature is delivered, and I hit the old limit, I could assess the situation and decide to turn off the memory pressure safeguard as the only breaking change to get my app working again. As my payloads grow into the hundreds of MB, I can then take the time to adopt the streaming APIs, as I should.

@tkaitchuck
Copy link
Member

Why not? Any existing code will continue to work as it currently does. Existing code should not attempt to write more than 8MB at once since that's forbidden by the API. This will not change.

You aren't describing an upgrade. They would gain no new capabilities.

@derekm
Copy link
Contributor

derekm commented Aug 18, 2020

I just want to reason a bit about the 1 GiB max event size.

On-disk event headers are the same as wire protocol request headers, with message type of 0, as according to:

https://pravega.io/docs/v0.7.1/wire-protocol/#protocol

For Length in the wire protocol, the upper bits remain zero and messages should be less than 2^24, which is 16 MiB. But now we're saying on-disk wire protocol message type 0 can use Length up to 2^30. This PDP appears to leave in place the constraint on Length for the on-the-wire wire protocol (i.e., < 2^24) and proposes a new limit for the on-disk format.

Assuming Length 4-byte integer is signed because we're in Java, theoretical max is 2^31-1 (~2 GiB) anyway.

Are there advantages to starting at half of the theoretical max length? Or can we use that last bit from the outset and raise the limit to its natural 2 GiB barrier?

When asked to break the new barrier, do we define a new message type that has a longer Length? Or do we use partial events of 2 GiB on-disk? I realize those questions are out of scope for the PDP, but I might rather start with those being the next questions, instead of when to raise the limit again being the next question.

Cf., https://en.wikipedia.org/wiki/Large-file_support & https://en.wikipedia.org/wiki/2_GB_limit

@shiveshr
Copy link
Contributor

shiveshr commented Aug 27, 2020

To tackle the "segment sealed" situation -- since we are talking about "large" events, how about we actually create a regular transaction which will create "txn segments".

Everything in the approach stays same except on subtle difference -- instead of shadowing stream segments, the client will "shadow" the transaction segments.
Basically the client would first ask controller to create a txn.. and then it would create "transient" segments to shadow the transaction segment. And write everything into the transient segment, and merge a length header to it when entire event has been written. And when the event with its length header is ready, the client would merge it into the "transaction" segment.
The advantage is that the txn segments are stable and dont seal or scale.
and controller and perform rolling txn if the corresponding stream segment was sealed.

Using a txn has additional advantages too -
If the user wants, they can write more than one large event. And for different routing keys too.
For each large event there will be exactly one "transient segment" shadowing the txn segment. and all of these events will be merged into their respective txn segments when they are completely flushed.
And after the user has written all events (which are effectively merged into "stable" transaction segments), it simply issues a "commit" command to controller for the transaction which will merge the txn segments into stream segments.

If the stream had not scaled, the txn segments will be merged into corresponding stream segments.
If the stream had scaled, the txn will simply be rolled over.

We get two benefits with this -

  1. We solve the "sealed segment" problem. No buffering, no replays.
  2. We also extend the scheme to "multiple" large events with different routing keys being written atomically.

And the cost of using transaction for doing this even for a single event is going to be insignificant compared to the latency of writing the whole "large" event.

thoughts?

@andreipaduroiu
Copy link
Member Author

Superseded by #6052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants