Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSEARCH-2886 Use of BufferedWriter in GsonEntity may lead to MalformedInputException when input contains 4-byte unicode characters #1545

Closed
wants to merge 6 commits into from

Conversation

Projects
None yet
2 participants
@yrodiere
Copy link
Member

yrodiere commented Sep 21, 2017

https://hibernate.atlassian.net//browse/HSEARCH-2886

Was originally proposed in #1510, and feedback wasn't very enthusiastic. But the original goal was mainly to make code clearer, whereas this time it actually fixes a bug.

@yrodiere

This comment has been minimized.

Copy link
Member Author

yrodiere commented Sep 21, 2017

Note: this should be backported to 5.8.

@@ -101,6 +101,7 @@ public void testContentIsRepeatable() {
list.add( buildEmptyJSON() );
try ( GsonHttpEntity entity = new GsonHttpEntity( gson, list ) ) {
final byte[] productionOne = produceContentWithCustomEncoder( entity );
entity.close();

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

I'm confused about this commit. The entity is in an auto-closeable block so it should get closed?
Also, why close it here when it's still used on the next line?
Finally, if I revert this nothing seems to fail.

This comment has been minimized.

Copy link
@yrodiere

yrodiere Sep 22, 2017

Author Member

We are re-using the entity on the next line, expecting it to "rewind". So we need to send a message to tell it to rewing, and this message is the close() method. We actually can re-use it after calling close() because that's the definition of being "repeatable".
I know it works without closing it, it's just that it's not required that it does... So why enforce it? Besides, we also test for "repeatability" further in the same file, and there we close the entity before re-using it.

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 22, 2017

Member

great point! Thanks, sorry that makes perfect sense I probably reviewed this too late in the evening

@@ -135,7 +135,8 @@ public GsonHttpEntity(Gson gson, List<JsonObject> bodyParts) {
Objects.requireNonNull( bodyParts );
this.gson = gson;
this.bodyParts = bodyParts;
this.contentLength = attemptOnePassEncoding();
this.contentLength = -1;
attemptOnePassEncoding();

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

I can live with this change since you seem to love it, but IMO it's looking horrible :)
it's a function and you're refactoring it into updating contentLenght by side effect maybe?

I do realize that several other state fields in this class are mutated as side effect by working on the various buffers and progress pointers, but I'd still try to avoid such things whenever possible.

This comment has been minimized.

Copy link
@yrodiere

yrodiere Sep 22, 2017

Author Member

it's a function and you're refactoring it into updating contentLenght by side effect maybe?

Yeah, that's what "attempt" means ;)

I do realize that several other state fields in this class are mutated as side effect by working on the various buffers and progress pointers, but I'd still try to avoid such things whenever possible.

Sure, it there was an easy way to avoid side-effect completely, I would. But since there is not, I'd rather keep the side-effects in the same method, so that the purpose of the method is clearer.

But I'm not sure we'll be able to agree on that one :)

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 22, 2017

Member

ok. I didn't mean it in negative way, just pointing out that such changes are a bit subjective and sometimes like to better understand the thought process.
I guess what puts me off is the explicit assign to -1, which we might change our mind about in the next line. But I'll take it :)

* Previously used pages that are available for reuse.
*/
private final Deque<ByteBuffer> reusablePages = new LinkedList<>();

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

This patch is making significant changes to a commit in the same PR, could you squash them together?

Also the commit message doesn't make much sense :) I guess you simply mean you decided you prefer to keep it simple?

The JVM will always reuse blocks as necessary in some smart way, but that's not the same as reusing the same pages as we would have done here. You're allocating objects and buffers, let's not talk around it :D

+1 to keep it simple and I agree that allocating these should not be a problem but that's for testing to measure.

This comment has been minimized.

Copy link
@yrodiere

yrodiere Sep 22, 2017

Author Member

Ok, I'll squash it. The point of keeping the commit separate was merely to discuss the matter.

if ( nextBodyToEncodeIndex == bodyParts.size() ) {
// The buffer's current content size is the final content size,
// as we know the entire content has been encoded already,
// and we also know no content was consumed from the buffer yet.
hintContentLength( writer.buffer.contentSize() );

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

that's an awesome optimisation! Great idea

/**
* Filled pages to be written, in write order.
*/
private final Deque<ByteBuffer> needWritingPages = new LinkedList<>();

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

I hate LinkedList in performance-sensitive code :)

I understand what you're thinking: we expect this to be very small and possibly not ever grow at.. still an ArrayList is always more efficient, even a small one. There even has been a proposal to deprecate LinkedList as there's no reason to ever use it, but it was decided to keep it as it's actually interesting for students.

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

in conclusion: a small ArrayList initialized at for example 5 should do the trick and still be somewhat better.

A point to remember is that while the size of allocations matter, the amount of objects being allocated matters as well, and finally the physical layout in memory; in both cases you have just pointers going to some other random location (wherever the page is) but with the array at least you load all pointers contiguously.

This comment has been minimized.

Copy link
@yrodiere

yrodiere Sep 22, 2017

Author Member

Ok, thanks, I fixed this. I need a Deque though, so I'll go with an ArrayDeque.

*
* @author Yoann Rodiere
*/
class AutoGrowingPagedBuffer {

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 21, 2017

Member

I like the design idea behind this class, but I'm wondering if you're not just merely re-writing the inner class I had created within GsonHttpEntity into a separate file.

When one class exists purely to simplify another one, and every single line of its design is influenced by this other class, there's a very high degree of coupling so I generally think it should be an inner class of GsonHttpEntity.

Thinking about your design process: I feel like you created this in a new file to have a clean slate; that's understandable as you were taking "ownership" in your mind of the twisted stuff I had created before :)
Still, don't you think it belongs to be moved back in there now that you've fleshed it out?

This comment has been minimized.

Copy link
@yrodiere

yrodiere Sep 22, 2017

Author Member

I agree that ProgressiveCharBufferWriter heavily relies on AutoGrowingPageBuffer. I agree we could merge ProgressiveCharBufferWriter with AutoGrowingPageBuffer.
But I'm not sure I agree there is a high degree of coupling between GsonHttpEntity and AutoGrowingPageBuffer. ProgressiveCharBufferWriter does not depend on GsonHttpEntity at all. And theoretically, we could perfectly well use ProgressiveCharBufferWriter for another type of entity.
Also, there's the matter of code being easily understood: nested classes are nice and all, but when files grow in size it becomes harder to understand the full extent of what they do (at least it's more difficult for me).

I'll merge AutoGrowingPageBuffer with ProgressiveCharBufferWriter, but leave them in a separate file. Would that be ok to you?

This comment has been minimized.

Copy link
@Sanne

Sanne Sep 22, 2017

Member

Yes sounds good. I agree that excessively large classes need to be split.

@yrodiere yrodiere force-pushed the yrodiere:HSEARCH-2886 branch from 13a7bb9 to be099ed Sep 22, 2017

@yrodiere yrodiere force-pushed the yrodiere:HSEARCH-2886 branch 3 times, most recently from 30a54a5 to 9cca34c Sep 22, 2017

yrodiere added some commits Aug 11, 2017

HSEARCH-2886 Simplify GsonHttpEntity by only ever storing byte buffers
We used to have two levels of buffering, one before encoding (char
buffers) and one after encoding (byte buffers). This implied to
frequently copy char buffers, and more critically to split the char
stream arbitrarily, which resulted in encoding errors when characters
stored over multiple "char" (more than 2 bytes) were split between
two buffers.

After this commit, we encode everything on-the-fly, and store the
resulting bytes in buffers whenever flow control pushes back.
HSEARCH-2886 Elasticsearch: do not switch to chunked encoding when th…
…e very last body part overflows to a second buffer page

We used to switch to chunked encoding when the very last body part
forced us to use a second buffer, but this is no longer necessary,
since we now know the content-size in that specific case (because we
encode input immediately).

@yrodiere yrodiere force-pushed the yrodiere:HSEARCH-2886 branch from 9cca34c to 4fb8434 Sep 22, 2017

@yrodiere

This comment has been minimized.

Copy link
Member Author

yrodiere commented Sep 22, 2017

@Sanne Hopefully this will be better. I replaced the LinkedList with an ArrayDeque and merged AutoGrowingCharBuffer with ProgressiveCharBufferWriter. You were right about those two, we'd better merge them.
By the way, I also added an optimization: in the resumePendingWrites() method: we'll now try to flush the current page only if it's full. This may reduce page allocations in some situations where output is written in small chunks.

Our two remaining disagreements are now addressed in separate commits. If don't want to merge these, well... As you want.

@Sanne

This comment has been minimized.

Copy link
Member

Sanne commented Sep 22, 2017

Great work! Just becasue I'm a bit paranoid and I finally have a bit of time, I'll profile it a bit to see how it looks like in practice but consider it accepted.

@Sanne

This comment has been minimized.

Copy link
Member

Sanne commented Sep 22, 2017

So there's some bad news: this is now back to allocating quite some buffers.
In a one minute sample under load I'm measuring:

  • 1,7 GB of byte arrays because of the necessary ByteBuffer.allocate
  • 1,4 GB of char buffers because of the CharBuffer.wrap in org.hibernate.search.elasticsearch.util.impl.ProgressiveCharBufferWriter.write(char[], int, int)

This second one is puzzling as it should just wrap! Wondering how to interpret it.

But I'll merge it, as correctness comes first and the bug makes this urgent!
We can explore improvements later and optionally as it's not very bad (this is an extreme test).

@Sanne

This comment has been minimized.

Copy link
Member

Sanne commented Sep 22, 2017

Merged to master and backported to 5.8

@Sanne Sanne closed this Sep 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.