Enhancement/performance optimizations #473

puzpuzpuz · 2019-04-12T13:49:47Z

Optimization 1. Increases start size of buffer for output network messages (see ObjectDataOutput class) to 1 KB. This eliminates most of buffer allocations and copying, thus improves write throughput for many scenarios, like benchmark/MapPutRunner.js.
Optimization 2. Replaces string serialization/deserialization with standard API by default. The old serialization can be enabled serializationConfig.stringSerializationPolicy as a fallback, as members and other clients may have issues with 4 byte UTF-8 chars (see the readme update for more details).
Adds a new benchmark, benchmark/MapGetRunner.js, that does lots of map.get() for 100 KB string values.
Also introduces usage of safe-buffer polyfill library for Buffer operations into the client library and tests. Thus, on newer versions of Node there will be no deprecated warnings. Once we decide to abandon Node <5.10, we can simply remove this library without changing the code, except for imports.
Updates Bluebird library to 3.5.4 as there were some important fixes in its recent versions (see https://github.com/petkaantonov/bluebird/releases/tag/v3.5.2).
Updated tests accordingly to all changes.

This PR is a part of the following PRD: https://hazelcast.atlassian.net/wiki/spaces/PM/pages/1630634069/Node.js+Client+Performance+Research

Note: there are some issues with SSL-related tests on Jenkins, so they're failing. I didn't investigate the reason, but it's not related with changes in this PR, as the same failure occur for other PRs. Tests run successfully on my machine.

puzpuzpuz · 2019-04-15T06:29:27Z

verify

puzpuzpuz · 2019-04-16T07:54:49Z

verify

puzpuzpuz · 2019-04-17T06:27:06Z

verify

src/ClientMessage.ts

puzpuzpuz · 2019-04-17T09:40:51Z

verify

src/serialization/ObjectData.ts

test/serialization/DefaultSerializersLiveTest.js

mustafaiman · 2019-04-23T17:47:32Z

Generally, I am ok with the implementation of the ideas here. However, native utf serialization is breaking compatibility with all the other native clients and members as far as i understand. So if I put some strings that happen to include 4 byte utf characters, then I will not be able to read those from Java or other native clients. It is so until string serialization is fixed in HZ4. Is that correct? Moreover, if I am understanding this correctly, since the size field describing the string is changed, this may cause other clients to read garbage and completely fail if they try to read what is written by Node.js client.

HZ4 is a long time away now and it is not clear that string serialization is going to be fixed on member and all other clients' side yet. Even if it is going to be fixed, the procedure in clients development is to do the changes on Java client first and replicate it to other clients after.

Is fixing utf8 serialization agreed upon by clients team? @sancar @mdumandag

puzpuzpuz · 2019-04-24T07:06:29Z

@mustafaiman

So if I put some strings that happen to include 4 byte utf characters, then I will not be able to read those from Java or other native clients. It is so until string serialization is fixed in HZ4. Is that correct?

Yes, that's correct.

But not all non-Java clients might be affected. There are some clients that already use standard APIs for UTF-8 strings serialization. For example, Python client is already using standard API calls for writes and reads of string values (although for reads it does additional checks for each byte before making the standard API call, so reads are a mixture of custom implementation and standard API usage). As Python client is using standard API for writes, it may already be sending corrupted (from the custom, non-standard serialization point of view, of course) string binaries to IMDG side. And potentially it may be not able to deserialize those strings. But I didn't do any experiments with the Python client, so that's just a hypothesis.

Moreover, if I am understanding this correctly, since the size field describing the string is changed, this may cause other clients to read garbage and completely fail if they try to read what is written by Node.js client.

I did some experiments and tried to deserialize invalid (from UTF-8 perspective) binaries with Buffer#toString. It never throws. Instead it produces strings with some garbage chars. So, we should be safe here.

I could add some tests that cover the case when ObjectDataInput reads "garbage", so we could be more confident about safety of this workaround in future. WDYT?

Is fixing utf8 serialization agreed upon by clients team?

It was discussed on the client channel between Jaromir and me, but I don't recall if Sancar and Metin participated in that discussion.

In general, that's true that we're loosing backwards compatibility and compatibility with other clients in 4 byte UTF-8 char cases. The safest way would be to wait until we change the client protocol properly. But on the other hand, the performance benefit of this optimization for string deserialization is very high and Node.js client still has 0.x version, which allows breaking changes. Also there is a properly documented config option for restoring the old behavior and the new behavior is enabled by the default.

src/ClientMessage.ts

sancar · 2019-04-24T07:39:46Z

src/config/StringSerializationPolicy.ts

+ * Using this policy, one can control the
+ * serialization type of strings.
+ */
+export enum StringSerializationPolicy {


We need to decide the config in 4.0 planning in java client first.
Then we can revisit this pr.

mustafaiman · 2019-04-24T12:36:58Z

@puzpuzpuz

I did some experiments and tried to deserialize invalid (from UTF-8 perspective) binaries with Buffer#toString. It never throws. Instead it produces strings with some garbage chars. So, we should be safe here.

This is more about what comes after garbage strings. If you are reading more bytes than you are supposed to read, then you are corrupting the buffer. This may not be a problem where string is the last item to read. However, if there are more items in the message, those are corrupted and may lead to arbitrary failures.

I see the urgent need to fix the serialization protocol. I am ok with the changes here technically. I think it is up to clients team to decide how to move further from here.

puzpuzpuz · 2019-04-24T13:40:24Z

@mustafaiman

This is more about what comes after garbage strings. If you are reading more bytes than you are supposed to read, then you are corrupting the buffer. This may not be a problem where string is the last item to read. However, if there are more items in the message, those are corrupted and may lead to arbitrary failures.

I don't see how this might be happening. Even if buffer#toString here reads more than the required string binary contents, nothing wrong will happen. The buffer contents aren't affected by the read and the pos field of ObjectDataInput is correctly updated once the string was deserialized.

If you still see a scenario when the buffer or something else gets corrupted, could you describe it in more details?

mustafaiman · 2019-04-24T14:17:23Z

I was talking about the scenario where Node.js client writes somethin and Java clients tries to read it. Though, I still think both cases are problematic.

String serialization writes number of characters and a series of bytes after. The length of the series of bytes after differs between the new and the old format. Take an arbitrary string that has 10 characters but requires 40 bytes in the standard UTF8 format. However, Java client and members still expects the old format. They read the string character by character. They are going to mess up reading characters and they will end up trying to read a different number of bytes than 40. That is because our broken UTF8 algorithm may think a character takes 3 bytes (or 6 bytes) whereas the character takes actually 4 bytes. Then the client will read less(more) number of bytes and its pointer is going to be corrupted. When it tries to read the next thing, it is going to start mid string which is structurally garbage. This may crash the client or the member.

Looks like you know the scenario. This is from the performance TDD:

e.g. they don't serialize 4 byte chars, like some less common CJK characters and emoji, correctly. For those chars existing serialization generates 6 bytes sequence which is correctly deserialized into the original string by the reverse algorithm. As an example, for this char existing serialization produces ed a0 bd ed b8 ad, while according to the standard it should be f0 9f 98 ad. For a string binary processed with the standard serialization members and other clients may throw a UTFDataFormatException (error code 64) when they deserialize the value. However, in most cases members don't deserialize string values at all, e.g. in a map.set() => map.get() scenario members store and return the binary data.

All I am saying is that if the member or client throws an exception because a string is malformed, then that member does not know the actual number of bytes that it should have read. Then, there is no chance that it updates the pointer correctly.

puzpuzpuz · 2019-04-24T15:05:59Z

@mustafaiman thanks for the clarification. Now I understand your concern.

And yes, I'm aware of it. I can see two problematic scenarios:

Node client reads something (with string values) that was serialized on Java side or another client.
Java client/member or another client reads something (with string values) that was serialized by Node client.

Note: both of them describe the case with 4 byte char only, which is not that common (but still it may happen in real world scenarios).

The first scenario is the worst one, because, as you've mentioned Node client might read corrupted string and continue with reading the buffer from a wrong position after that. In this case, there might be no error thrown and the end app will just get corrupted data.

In the second scenario, it should end up with a UTFDataFormatException as binary representation of those chars falls out of the range check (but this needs to be double checked). So, this should be a fail-fast scenario which is good.

mdumandag · 2019-04-30T07:41:02Z

verify

README.md

* Fix package-lock.json to match package.json * Remove deprecated Buffer constructor calls * Add polyfill Buffer library for Node 4 compatibility * Fix code style issues * Add greedy buffer allocation for ObjectDataOutput * Add map.get based benchmark * Upgrade Bluebird library (there were some important fixes) * Improve readUTF/writeUTF performance for ObjectData * Fix deserialization in ObjectDataInput#readUTF * Fix null string serialization in ObjectDataOutput#writeUTF * Add string serialization config setting (legacy/standard) * Revert changes in PortableTest * Add documentation for the new string serialization config option * Migrate tests to safe-buffer * Improve default serializers integration tests * Fix readme section for string serialization * Improve MapGetRunner benchmark * Fix code issues after code review * Fix readme issues for string serialization

puzpuzpuz added the Type: Enhancement label Apr 12, 2019

puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch 5 times, most recently from 77f685e to f16c2ab Compare April 13, 2019 11:47

puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from ea0b2b3 to e69cbcf Compare April 16, 2019 07:53

puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from a77ef28 to 84cc8c9 Compare April 16, 2019 09:13

puzpuzpuz changed the title ~~[WIP] Enhancement/reduce buffer allocations~~ [WIP] Enhancement/performance optimizations Apr 16, 2019

puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch 2 times, most recently from f6b82e6 to efa7a2d Compare April 17, 2019 08:49

puzpuzpuz commented Apr 17, 2019

View reviewed changes

src/ClientMessage.ts Show resolved Hide resolved

puzpuzpuz changed the title ~~[WIP] Enhancement/performance optimizations~~ Enhancement/performance optimizations Apr 17, 2019

puzpuzpuz requested a review from mustafaiman April 17, 2019 09:41

mustafaiman reviewed Apr 22, 2019

View reviewed changes

src/serialization/ObjectData.ts Outdated Show resolved Hide resolved

src/serialization/ObjectData.ts Show resolved Hide resolved

test/serialization/DefaultSerializersLiveTest.js Outdated Show resolved Hide resolved

sancar reviewed Apr 24, 2019

View reviewed changes

src/ClientMessage.ts Show resolved Hide resolved

sancar reviewed Apr 24, 2019

View reviewed changes

mdumandag reviewed Apr 30, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from b93c289 to 2e5a33e Compare April 30, 2019 08:59

puzpuzpuz added 19 commits April 30, 2019 11:59

Fix package-lock.json to match package.json

bcf2b11

Remove deprecated Buffer constructor calls

b8be631

Add polyfill Buffer library for Node 4 compatibility

9ebbeae

Fix code style issues

b727257

Add greedy buffer allocation for ObjectDataOutput

6cfd553

Add map.get based benchmark

0704863

Upgrade Bluebird library (there were some important fixes)

4ecc083

Improve readUTF/writeUTF performance for ObjectData

f3e2b69

Fix deserialization in ObjectDataInput#readUTF

0b11b45

Fix null string serialization in ObjectDataOutput#writeUTF

ac54e16

Add string serialization config setting (legacy/standard)

9a7ea5f

Revert changes in PortableTest

84c2009

Add documentation for the new string serialization config option

dea248d

Migrate tests to safe-buffer

58e84f2

Improve default serializers integration tests

f34c7fc

Fix readme section for string serialization

95b9ed6

Improve MapGetRunner benchmark

f1206c6

Fix code issues after code review

525c601

Fix readme issues for string serialization

df07a01

mdumandag approved these changes Apr 30, 2019

View reviewed changes

puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from 2e5a33e to df07a01 Compare April 30, 2019 09:04

asimarslan approved these changes Apr 30, 2019

View reviewed changes

mdumandag merged commit 4d81f32 into hazelcast:master Apr 30, 2019

puzpuzpuz deleted the enhancement/reduce-buffer-allocations branch April 30, 2019 13:20

puzpuzpuz mentioned this pull request May 14, 2021

Memory and performance optimization [API-422] hazelcast/hazelcast-go-client#555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement/performance optimizations #473

Enhancement/performance optimizations #473

puzpuzpuz commented Apr 12, 2019 •

edited

Loading

puzpuzpuz commented Apr 15, 2019

puzpuzpuz commented Apr 16, 2019

puzpuzpuz commented Apr 17, 2019

puzpuzpuz commented Apr 17, 2019

mustafaiman commented Apr 23, 2019

puzpuzpuz commented Apr 24, 2019 •

edited

Loading

sancar Apr 24, 2019

mustafaiman commented Apr 24, 2019

puzpuzpuz commented Apr 24, 2019 •

edited

Loading

mustafaiman commented Apr 24, 2019

puzpuzpuz commented Apr 24, 2019 •

edited

Loading

mdumandag commented Apr 30, 2019

Enhancement/performance optimizations #473

Enhancement/performance optimizations #473

Conversation

puzpuzpuz commented Apr 12, 2019 • edited Loading

puzpuzpuz commented Apr 15, 2019

puzpuzpuz commented Apr 16, 2019

puzpuzpuz commented Apr 17, 2019

puzpuzpuz commented Apr 17, 2019

mustafaiman commented Apr 23, 2019

puzpuzpuz commented Apr 24, 2019 • edited Loading

sancar Apr 24, 2019

Choose a reason for hiding this comment

mustafaiman commented Apr 24, 2019

puzpuzpuz commented Apr 24, 2019 • edited Loading

mustafaiman commented Apr 24, 2019

puzpuzpuz commented Apr 24, 2019 • edited Loading

mdumandag commented Apr 30, 2019

puzpuzpuz commented Apr 12, 2019 •

edited

Loading

puzpuzpuz commented Apr 24, 2019 •

edited

Loading

puzpuzpuz commented Apr 24, 2019 •

edited

Loading

puzpuzpuz commented Apr 24, 2019 •

edited

Loading