Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement/performance optimizations #473

Merged

Conversation

puzpuzpuz
Copy link
Contributor

@puzpuzpuz puzpuzpuz commented Apr 12, 2019

  • Optimization 1. Increases start size of buffer for output network messages (see ObjectDataOutput class) to 1 KB. This eliminates most of buffer allocations and copying, thus improves write throughput for many scenarios, like benchmark/MapPutRunner.js.
  • Optimization 2. Replaces string serialization/deserialization with standard API by default. The old serialization can be enabled serializationConfig.stringSerializationPolicy as a fallback, as members and other clients may have issues with 4 byte UTF-8 chars (see the readme update for more details).
  • Adds a new benchmark, benchmark/MapGetRunner.js, that does lots of map.get() for 100 KB string values.
  • Also introduces usage of safe-buffer polyfill library for Buffer operations into the client library and tests. Thus, on newer versions of Node there will be no deprecated warnings. Once we decide to abandon Node <5.10, we can simply remove this library without changing the code, except for imports.
  • Updates Bluebird library to 3.5.4 as there were some important fixes in its recent versions (see https://github.com/petkaantonov/bluebird/releases/tag/v3.5.2).
  • Updated tests accordingly to all changes.

This PR is a part of the following PRD: https://hazelcast.atlassian.net/wiki/spaces/PM/pages/1630634069/Node.js+Client+Performance+Research

Note: there are some issues with SSL-related tests on Jenkins, so they're failing. I didn't investigate the reason, but it's not related with changes in this PR, as the same failure occur for other PRs. Tests run successfully on my machine.

@puzpuzpuz puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch 5 times, most recently from 77f685e to f16c2ab Compare April 13, 2019 11:47
@puzpuzpuz
Copy link
Contributor Author

verify

@puzpuzpuz puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from ea0b2b3 to e69cbcf Compare April 16, 2019 07:53
@puzpuzpuz
Copy link
Contributor Author

verify

@puzpuzpuz puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from a77ef28 to 84cc8c9 Compare April 16, 2019 09:13
@puzpuzpuz puzpuzpuz changed the title [WIP] Enhancement/reduce buffer allocations [WIP] Enhancement/performance optimizations Apr 16, 2019
@puzpuzpuz
Copy link
Contributor Author

verify

@puzpuzpuz puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch 2 times, most recently from f6b82e6 to efa7a2d Compare April 17, 2019 08:49
@puzpuzpuz puzpuzpuz changed the title [WIP] Enhancement/performance optimizations Enhancement/performance optimizations Apr 17, 2019
@puzpuzpuz
Copy link
Contributor Author

verify

src/serialization/ObjectData.ts Outdated Show resolved Hide resolved
src/serialization/ObjectData.ts Show resolved Hide resolved
test/serialization/DefaultSerializersLiveTest.js Outdated Show resolved Hide resolved
@mustafaiman
Copy link
Contributor

Generally, I am ok with the implementation of the ideas here. However, native utf serialization is breaking compatibility with all the other native clients and members as far as i understand. So if I put some strings that happen to include 4 byte utf characters, then I will not be able to read those from Java or other native clients. It is so until string serialization is fixed in HZ4. Is that correct? Moreover, if I am understanding this correctly, since the size field describing the string is changed, this may cause other clients to read garbage and completely fail if they try to read what is written by Node.js client.

HZ4 is a long time away now and it is not clear that string serialization is going to be fixed on member and all other clients' side yet. Even if it is going to be fixed, the procedure in clients development is to do the changes on Java client first and replicate it to other clients after.

Is fixing utf8 serialization agreed upon by clients team? @sancar @mdumandag

@puzpuzpuz
Copy link
Contributor Author

puzpuzpuz commented Apr 24, 2019

@mustafaiman

So if I put some strings that happen to include 4 byte utf characters, then I will not be able to read those from Java or other native clients. It is so until string serialization is fixed in HZ4. Is that correct?

Yes, that's correct.

But not all non-Java clients might be affected. There are some clients that already use standard APIs for UTF-8 strings serialization. For example, Python client is already using standard API calls for writes and reads of string values (although for reads it does additional checks for each byte before making the standard API call, so reads are a mixture of custom implementation and standard API usage). As Python client is using standard API for writes, it may already be sending corrupted (from the custom, non-standard serialization point of view, of course) string binaries to IMDG side. And potentially it may be not able to deserialize those strings. But I didn't do any experiments with the Python client, so that's just a hypothesis.

Moreover, if I am understanding this correctly, since the size field describing the string is changed, this may cause other clients to read garbage and completely fail if they try to read what is written by Node.js client.

I did some experiments and tried to deserialize invalid (from UTF-8 perspective) binaries with Buffer#toString. It never throws. Instead it produces strings with some garbage chars. So, we should be safe here.

I could add some tests that cover the case when ObjectDataInput reads "garbage", so we could be more confident about safety of this workaround in future. WDYT?

Is fixing utf8 serialization agreed upon by clients team?

It was discussed on the client channel between Jaromir and me, but I don't recall if Sancar and Metin participated in that discussion.

In general, that's true that we're loosing backwards compatibility and compatibility with other clients in 4 byte UTF-8 char cases. The safest way would be to wait until we change the client protocol properly. But on the other hand, the performance benefit of this optimization for string deserialization is very high and Node.js client still has 0.x version, which allows breaking changes. Also there is a properly documented config option for restoring the old behavior and the new behavior is enabled by the default.

* Using this policy, one can control the
* serialization type of strings.
*/
export enum StringSerializationPolicy {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide the config in 4.0 planning in java client first.
Then we can revisit this pr.

@mustafaiman
Copy link
Contributor

@puzpuzpuz

I did some experiments and tried to deserialize invalid (from UTF-8 perspective) binaries with Buffer#toString. It never throws. Instead it produces strings with some garbage chars. So, we should be safe here.

This is more about what comes after garbage strings. If you are reading more bytes than you are supposed to read, then you are corrupting the buffer. This may not be a problem where string is the last item to read. However, if there are more items in the message, those are corrupted and may lead to arbitrary failures.

I see the urgent need to fix the serialization protocol. I am ok with the changes here technically. I think it is up to clients team to decide how to move further from here.

@puzpuzpuz
Copy link
Contributor Author

puzpuzpuz commented Apr 24, 2019

@mustafaiman

This is more about what comes after garbage strings. If you are reading more bytes than you are supposed to read, then you are corrupting the buffer. This may not be a problem where string is the last item to read. However, if there are more items in the message, those are corrupted and may lead to arbitrary failures.

I don't see how this might be happening. Even if buffer#toString here reads more than the required string binary contents, nothing wrong will happen. The buffer contents aren't affected by the read and the pos field of ObjectDataInput is correctly updated once the string was deserialized.

If you still see a scenario when the buffer or something else gets corrupted, could you describe it in more details?

@mustafaiman
Copy link
Contributor

I was talking about the scenario where Node.js client writes somethin and Java clients tries to read it. Though, I still think both cases are problematic.

String serialization writes number of characters and a series of bytes after. The length of the series of bytes after differs between the new and the old format. Take an arbitrary string that has 10 characters but requires 40 bytes in the standard UTF8 format. However, Java client and members still expects the old format. They read the string character by character. They are going to mess up reading characters and they will end up trying to read a different number of bytes than 40. That is because our broken UTF8 algorithm may think a character takes 3 bytes (or 6 bytes) whereas the character takes actually 4 bytes. Then the client will read less(more) number of bytes and its pointer is going to be corrupted. When it tries to read the next thing, it is going to start mid string which is structurally garbage. This may crash the client or the member.

Looks like you know the scenario. This is from the performance TDD:

e.g. they don't serialize 4 byte chars, like some less common CJK characters and emoji, correctly. For those chars existing serialization generates 6 bytes sequence which is correctly deserialized into the original string by the reverse algorithm. As an example, for this char existing serialization produces ed a0 bd ed b8 ad, while according to the standard it should be f0 9f 98 ad. For a string binary processed with the standard serialization members and other clients may throw a UTFDataFormatException (error code 64) when they deserialize the value. However, in most cases members don't deserialize string values at all, e.g. in a map.set() => map.get() scenario members store and return the binary data.

All I am saying is that if the member or client throws an exception because a string is malformed, then that member does not know the actual number of bytes that it should have read. Then, there is no chance that it updates the pointer correctly.

@puzpuzpuz
Copy link
Contributor Author

puzpuzpuz commented Apr 24, 2019

@mustafaiman thanks for the clarification. Now I understand your concern.

And yes, I'm aware of it. I can see two problematic scenarios:

  1. Node client reads something (with string values) that was serialized on Java side or another client.
  2. Java client/member or another client reads something (with string values) that was serialized by Node client.

Note: both of them describe the case with 4 byte char only, which is not that common (but still it may happen in real world scenarios).

The first scenario is the worst one, because, as you've mentioned Node client might read corrupted string and continue with reading the buffer from a wrong position after that. In this case, there might be no error thrown and the end app will just get corrupted data.

In the second scenario, it should end up with a UTFDataFormatException as binary representation of those chars falls out of the range check (but this needs to be double checked). So, this should be a fail-fast scenario which is good.

@mdumandag
Copy link
Contributor

verify

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@puzpuzpuz puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from b93c289 to 2e5a33e Compare April 30, 2019 08:59
@puzpuzpuz puzpuzpuz force-pushed the enhancement/reduce-buffer-allocations branch from 2e5a33e to df07a01 Compare April 30, 2019 09:04
@mdumandag mdumandag merged commit 4d81f32 into hazelcast:master Apr 30, 2019
@puzpuzpuz puzpuzpuz deleted the enhancement/reduce-buffer-allocations branch April 30, 2019 13:20
harunalpak pushed a commit to harunalpak/hazelcast-nodejs-client that referenced this pull request Dec 8, 2022
* Fix package-lock.json to match package.json

* Remove deprecated Buffer constructor calls

* Add polyfill Buffer library for Node 4 compatibility

* Fix code style issues

* Add greedy buffer allocation for ObjectDataOutput

* Add map.get based benchmark

* Upgrade Bluebird library (there were some important fixes)

* Improve readUTF/writeUTF performance for ObjectData

* Fix deserialization in ObjectDataInput#readUTF

* Fix null string serialization in ObjectDataOutput#writeUTF

* Add string serialization config setting (legacy/standard)

* Revert changes in PortableTest

* Add documentation for the new string serialization config option

* Migrate tests to safe-buffer

* Improve default serializers integration tests

* Fix readme section for string serialization

* Improve MapGetRunner benchmark

* Fix code issues after code review

* Fix readme issues for string serialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants