-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteToMessageDecoder avoid using discardSomeReadBytes() #9850
Conversation
@Scottmitch there was a leak reported:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM... only one comment which I think is also the reason for the leak that was reported (which is mainly caused by an error in the SnappFrameDecoderTest but did not show up before because we did not allocate a new buffer if there is nothing left to read).
// of this cumulation, and if we use discardSomeReadBytes the user code will see corrupted data. If we attempt | ||
// to check the ReferenceCount == 1 to see if discardSomeReadBytes is safe to call this may result in the | ||
// cumulation buffer growing very large in the case the asynchronous access pattern is persistent. | ||
if (cumulation != null && cumulation.readerIndex() >= cumulation.capacity() >>> 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Scottmitch should we just set cumulation
to null
if cumulation.isReadable() == false
? I think there is no need to do allocate anything in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is checked in channelRead()
and the behavior would be different than what was done in discardSomeReadBytes
so I'm not convinced it is necessary here. This may benefit calls from channelReadComplete()
but presumably the channel read finally block should catch the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it can’t harm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it won't harm but it is redundant (unless I'm missing something). channelReadComplete()
generally doesn't consume data so I'm assuming why it didn't previously need to account for !cumulation.isReadable()
(because this condition would be caught in the prior channelRead()
call, see below). I updated the java doc to clarify the method may null
out the cumulation
to be clear it is allowed, but can you clarify which use case you had in mind where we would not clear/reset the cumulation
?
// channelRead() code:
if (cumulation != null && !cumulation.isReadable()) {
numReads = 0;
releaseCumulation(); //-> release the cumulation here if it has been fully drained
} else if (++ numReads >= discardAfterReads) {
// We did enough reads already try to discard some bytes so we not risk to see a OOME.
// See https://github.com/netty/netty/issues/4275
numReads = 0;
tryDiscardSomeReadBytes(ctx.alloc());
}
Hey thanks @Scottmitch. I have coincidentally been looking at this area a lot lately and am iterating on some possibly-related streamlining (taken longer than anticipated). So I have quite a few questions/comments already, but probably still need to digest this proposal a bit more. For now though I'm struggling a bit to understand the premise. In cases that the cumulation buffer is being referenced/used elsewhere (e.g. retained slices passed to other threads in prior |
…DecoderTest` Motivation: We did not correctly close the `EmbeddedChannel` which would lead to not have `handlerRemoved(...)` called. This can lead to leaks. Beside this we also did not correctly consume produced data which could also show up as a leak. Modifications: - Always call `EmbeddedChannel.finish()` - Ensure we consume all produced data and release it Result: No more leaks in test. This showed up in #9850 (comment).
Leak itself should be fixed by #9851 |
…DecoderTest` (#9851) Motivation: We did not correctly close the `EmbeddedChannel` which would lead to not have `handlerRemoved(...)` called. This can lead to leaks. Beside this we also did not correctly consume produced data which could also show up as a leak. Modifications: - Always call `EmbeddedChannel.finish()` - Ensure we consume all produced data and release it Result: No more leaks in test. This showed up in #9850 (comment).
…DecoderTest` (#9851) Motivation: We did not correctly close the `EmbeddedChannel` which would lead to not have `handlerRemoved(...)` called. This can lead to leaks. Beside this we also did not correctly consume produced data which could also show up as a leak. Modifications: - Always call `EmbeddedChannel.finish()` - Ensure we consume all produced data and release it Result: No more leaks in test. This showed up in #9850 (comment).
@njhill - Agreed if
The question is do we lose much by allocating a new buffer (considering the default allocator is pooled) in the (relatively rare in most supported protocols and only if carefully segmented data delivery) event where compaction is triggered?
When new data comes in, if there is data in the cumulation, then this new data will be copied to the cumulation (at the position of the writer index). Can you clarify the scenario you had in mind? |
@Scottmitch can you rebase ? |
Motivation: ByteToMessageDecoder maintains a ByteBuf to aggregate data in the event a full message requires multiple socket read calls (aka the cumulator). The cumulator buffer may grow over time and so ByteToMessageDecoder periodically calls discardSomeReadBytes() in an attempt to reclaim unused space and compact the buffer. Calling discardSomeReadBytes() will modify the underlying buffer and if the application has any views into the buffer (e.g. slice().retain()) their view would be corrupted, and therefore the refCnt() must be 1 in order for discardSomeReadBytes() to be called on the buffer. However if an application habitually processes the data on another thread this strategy may mean that discardSomeReadBytes() cannot be called in a timely manner due to the reference count not being 1 and the buffer may grow relatively large. Modifications: - Instead of basing the buffer space reclamation on if the buffer is "shared", base the criteria on how much "wasted" space exists in the accumulation buffer. The default policy is if the buffer is at least half "wasted" space based upon its original capacity then reclaim space. - Instead of using ByteBuf#discardSomeReadBytes() to reclaim space, allocate a new buffer and swap the old buffer. This avoids any risk of existing views on the data being corrupted. - Trigger points are added (but made private) such that base classes can maintain indexes on the cumulation data and reset these indexes when the underlying cumulation buffer changes. This can be useful to avoid re-parsing data if partial data is parsed before it is all received. These methods can be made public later if/when they are used. Result: ByteToMessageDecoder reclaims cumulation buffer space according to wasted space instead of if the buffer is shared.
b3209a5
to
0a6db96
Compare
Thanks @Scottmitch
We are talking about the case where refCnt > 1, right? In this case the cumulator will always call
I'm not sure that using unreleasable buffers is the best way to achieve this though. As far as I can see, servicetalk has a thread safety exposure related to the above - if the cumulator writes beyond the buffer's current capacity then an internal realloc will be performed which won't be safely published to other threads. Would it not be simpler to either:
or
The first has the advantage of less custom code and allows for explicit release of (direct) buffers in cases that they aren't passed to user code. The second avoids retain/release volatile accesses. But both fix the mentioned issue and should avoid this need for this or any other change on the netty side. WDYT? |
In this scenario reference count is always
This will not work. If ref counting isn't exposed to users you need to avoid the case where Netty releases the buffers (e.g. when writing). Otherwise you end up getting IllegalRefCountExceptions if you write a buffer and re-use it. So at some level unreleasable is necessary.
Representing the ref cnt as
I'd like to dig into this to understand the scenario you had in mind. IIUC here is the scenario you are describing:
The AT has its own slice, so the indexes/capacity are not shared. The underlying data is shared, but can the AT see partial data or a corrupted form of the data during copy? After the data is parsed, the parsed section of data is not modified. The resize operation involves allocation new memory, copying from old memory, and assigning the reference to data to the new memory the AT may see the old memory or the new memory. However the memory is reclaimed by GC (ref counting is not reclaiming memory, thanks for making me investigate in more detail apple/servicetalk#893 :) ), the relative indexes into the buffer from any slices should remain consistent because data isn’t moving (just extending the capacity), the content of the parsed data will not be modified. IIUC the AT should see the same contents in memory either way, can you clarify if I’m missing something? |
Right but this scenario involves a specific approach to a specific use-case (namely the particular way custom unreleasable bufs are being used to shield users from ref counting), whereas the motivation in the PR description and explanatory code comments imply the changes benefit standard netty (ref-counting based) usage. As far as I can see the changes are only in support of this niche case, to the possible detriment of other/existing cases, while introducing public API change that must then be supported going forward and might inhibit our ability to make other optimizations here (the class is quite central as you know)
I think in this case it's reasonable and semantically consistent to consider the buffer "permanently" shared, or to just think of netty's
Which optimization where? If this was done then the need for any change on the netty side (e.g. this PR) goes away
I have not looked deeply at the servicetalk code but can't imagine why this couldn't work in principle. When control of a ByteBuf passes to the user then it's wrapped in a ST buffer and so that ownership stake will remain indefinitely. If a ByteBuf is extracted from a ST buffer passed from the user, you can just call
The indices are fine, it's that the newly allocated/copied mem (e.g. byte array for heap buffer) is not safely published - specifically AT might see the buffer's updated ref to the new array before it sees that array's newly copied contents. apple/servicetalk#893 by itself doesn't fix this, and wouldn't be necessary at all if one of the other approaches was taken. Really sorry for the pushback I just still can't see how this is a positive change on the netty side, especially given that there appear to be simpler ways to achieve the desired outcome outside of netty. And even more sorry if there's something I still have missed or misunderstood! FWIW I do like the included minor simplifications to the cumulators' logic and the added use of |
@Scottmitch I think independently of this it would be universally beneficial and safe (including to servicetalk usecase) to relax the cumulator check a bit to allow appending if it won't trigger a realloc. WDYT? int required = in.readableBytes();
if (required > cumulation.maxWritableBytes()
|| (required > cumulation.maxFastWritableBytes() && cumulation.refCnt() > 1)
|| cumulation.isReadOnly()) {
// expand here I guess the only danger is if there is some decoder out there that retains a duplicate rather than a slice, and then proceeds to use the writable space for other purposes. Hard to think that this would be a real-life case though. |
I'm referring to "reclaim in space" behavior (existing Netty behavior) which is an optimization (assuming the buffer is known to be shared) over the "allocate and copy" approach (proposed in this PR).
What I meant to highlight here is because we don't expose reference counting to the users we cannot allow the reference count to get to 0 and the underlying memory be released until the memory is no longer referenced. The initial suggestion of "increment the ref cnt when passing a buffer to users" and “call super.retain() in the constructor” may lead to this result (e.g. if a user writes the same underlying buffer and retains a reference for use later). An alternative approach to using unreleasable buffers, as you eluded to, is incrementing the reference count every time a buffer flows from Netty->ST[->User] and then also from [User->]ST->Netty (e.g. making sure all cases which may decrement refcnt have a corresponding increment such that the count never reaches 0). Both ways is requires as if the ref cnt isn’t incrementing when going from Netty->ST then we would run into similar issue this PR is attempting to address (e.g. optimizations which try to infer if a buffer is shared based upon ref count break). There are pros/cons to each approach. For example “unreleasable buffers” reduces the surface area and removes reference counting from the equation, but non-memory reclamation based uses of the ref cnt API (e.g. is the buffer shared) may uncovered less-exercised code paths (e.g. this discussion). The “manage ref counts on entry/exit points” uses refcounts in a more "traditional" fashion, but has a larger surface area and cases maybe missed as code evolves. Independent of this PR and discussion, in Netty it is possible to generate “unreleasable buffers” from an
Thanks for clarifying! Yes, this is problematic.
No need to apologize, the discussion is helpful/welcome!
I guess the difference would be “ownership” if we are assuming “refCnt() > 1” is a proxy for “this buffer is shared so ByteToMessageDecoder shouldn’t touch the data” then this maybe problematic. Folks can always supply their own |
Thanks @Scottmitch, I agree this is good discussion! Stand by for another wall of text... :) My perspective on the unreleasable buffer thing is a bit different. From a netty pov I would not consider there to be such a thing as an "unreleasable buffer". Really the precise count itself has no meaning to anyone inspecting a
I actually don't think anything special is needed for the Netty->ST->User case. This is standard netty semantics - in general when you give a So then all you need to be careful about is if/when a ByteBuf is extracted from a ST buffer. There are two cases - either the ByteBuf is accessed transiently (e.g. just to read/write some data), or it's kept/passed-along for some purpose. In the first case there is no problem, in the second you're essentially stealing ownership and so first need to call This does not seem like much of a burden, especially as the conversion is always done via the FWIW I did a quick search of the ST code and found that this is in fact already done in most places, with only two methods where it would need to be added (
Again my view is that using
I'm not in favour of this option but I still can't see why a dedicated compaction "allocate and copy" operation would ever be needed (e.g. the
This is actually part of what I've been iterating on for some time now (see #8931, #9843). I too think it makes complete sense for the cumulator to subsume responsibility for discarding, but don't see why a new subinterface would be needed for this - it's sufficient to only care about discarding when performing a Finally, those referenced PRs are really precursors to a bigger optimization I have that should hopefully eliminate the need for any explicit discarding as a side effect. I had been trying to get it into at least a candidate "finished" state but maybe I'll share in a WIP state for feedback given how long it's taking. |
@njhill @Scottmitch this is a very interesting discussion and I still need to make my mind up here... That said I think I agree with @Scottmitch that using So whatever we do at the end I think we should get rid of the |
Seems like we are talking past each other here ... "wrapping" is mixing in another approach as opposed to incrementing the reference count. I think we both understand that writing a
To be clear, I didn't say this. I'm just pointing out the pros/cons of the different approaches. This PR is meant to facilitate discussions and if the Netty community feels that the current
To be fair I expect the way ST handles refCnt is somewhat inconsistent due to the introduction of unreleasable buffers over time. It is certainly possible but more investigation is required.
Agreed, the buffer size doesn't need to be increased during compaction.
If there are other options that is great! I don't think #8931 will work "as is" bcz Remaining Discussion (lets re-focus):
|
After thinking a bit more about this I would prefer to not depend on this as it is error-prone, especially if retain / release calls are done in different threads.
Are we talking about 4.x or next major here ? |
Discussed offline with @njhill. In summary...
thanks for review/discussion @njhill / @normanmaurer ! |
…DecoderTest` (netty#9851) Motivation: We did not correctly close the `EmbeddedChannel` which would lead to not have `handlerRemoved(...)` called. This can lead to leaks. Beside this we also did not correctly consume produced data which could also show up as a leak. Modifications: - Always call `EmbeddedChannel.finish()` - Ensure we consume all produced data and release it Result: No more leaks in test. This showed up in netty#9850 (comment).
Motivation:
ByteToMessageDecoder maintains a ByteBuf to aggregate data in the event
a full message requires multiple socket read calls (aka the cumulator).
The cumulator buffer may grow over time and so ByteToMessageDecoder
periodically calls discardSomeReadBytes() in an attempt to reclaim
unused space and compact the buffer. Calling discardSomeReadBytes()
will modify the underlying buffer and if the application has any
views into the buffer (e.g. slice().retain()) their view would be
corrupted, and therefore the refCnt() must be 1 in order for
discardSomeReadBytes() to be called on the buffer. However if an
application habitually processes the data on another thread this
strategy may mean that discardSomeReadBytes() cannot be called in a
timely manner due to the reference count not being 1 and the buffer may
grow relatively large.
Modifications:
"shared", base the criteria on how much "wasted" space exists in the
accumulation buffer. The default policy is if the buffer is at least
half "wasted" space based upon its original capacity then reclaim
space.
allocate a new buffer and swap the old buffer. This avoids any risk of
existing views on the data being corrupted.
can maintain indexes on the cumulation data and reset these indexes
when the underlying cumulation buffer changes. This can be useful to
avoid re-parsing data if partial data is parsed before it is all
received. These methods can be made public later if/when they are used.
Result:
ByteToMessageDecoder reclaims cumulation buffer space according to
wasted space instead of if the buffer is shared.