-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement/performance optimizations #473
Enhancement/performance optimizations #473
Conversation
77f685e
to
f16c2ab
Compare
verify |
ea0b2b3
to
e69cbcf
Compare
verify |
a77ef28
to
84cc8c9
Compare
verify |
f6b82e6
to
efa7a2d
Compare
verify |
Generally, I am ok with the implementation of the ideas here. However, native utf serialization is breaking compatibility with all the other native clients and members as far as i understand. So if I put some strings that happen to include 4 byte utf characters, then I will not be able to read those from Java or other native clients. It is so until string serialization is fixed in HZ4. Is that correct? Moreover, if I am understanding this correctly, since the size field describing the string is changed, this may cause other clients to read garbage and completely fail if they try to read what is written by Node.js client. HZ4 is a long time away now and it is not clear that string serialization is going to be fixed on member and all other clients' side yet. Even if it is going to be fixed, the procedure in clients development is to do the changes on Java client first and replicate it to other clients after. Is fixing utf8 serialization agreed upon by clients team? @sancar @mdumandag |
Yes, that's correct. But not all non-Java clients might be affected. There are some clients that already use standard APIs for UTF-8 strings serialization. For example, Python client is already using standard API calls for writes and reads of string values (although for reads it does additional checks for each byte before making the standard API call, so reads are a mixture of custom implementation and standard API usage). As Python client is using standard API for writes, it may already be sending corrupted (from the custom, non-standard serialization point of view, of course) string binaries to IMDG side. And potentially it may be not able to deserialize those strings. But I didn't do any experiments with the Python client, so that's just a hypothesis.
I did some experiments and tried to deserialize invalid (from UTF-8 perspective) binaries with I could add some tests that cover the case when
It was discussed on the client channel between Jaromir and me, but I don't recall if Sancar and Metin participated in that discussion. In general, that's true that we're loosing backwards compatibility and compatibility with other clients in 4 byte UTF-8 char cases. The safest way would be to wait until we change the client protocol properly. But on the other hand, the performance benefit of this optimization for string deserialization is very high and Node.js client still has 0.x version, which allows breaking changes. Also there is a properly documented config option for restoring the old behavior and the new behavior is enabled by the default. |
* Using this policy, one can control the | ||
* serialization type of strings. | ||
*/ | ||
export enum StringSerializationPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to decide the config in 4.0 planning in java client first.
Then we can revisit this pr.
This is more about what comes after garbage strings. If you are reading more bytes than you are supposed to read, then you are corrupting the buffer. This may not be a problem where string is the last item to read. However, if there are more items in the message, those are corrupted and may lead to arbitrary failures. I see the urgent need to fix the serialization protocol. I am ok with the changes here technically. I think it is up to clients team to decide how to move further from here. |
I don't see how this might be happening. Even if If you still see a scenario when the buffer or something else gets corrupted, could you describe it in more details? |
I was talking about the scenario where Node.js client writes somethin and Java clients tries to read it. Though, I still think both cases are problematic. String serialization writes number of characters and a series of bytes after. The length of the series of bytes after differs between the new and the old format. Take an arbitrary string that has 10 characters but requires 40 bytes in the standard UTF8 format. However, Java client and members still expects the old format. They read the string character by character. They are going to mess up reading characters and they will end up trying to read a different number of bytes than 40. That is because our broken UTF8 algorithm may think a character takes 3 bytes (or 6 bytes) whereas the character takes actually 4 bytes. Then the client will read less(more) number of bytes and its pointer is going to be corrupted. When it tries to read the next thing, it is going to start mid string which is structurally garbage. This may crash the client or the member. Looks like you know the scenario. This is from the performance TDD:
All I am saying is that if the member or client throws an exception because a string is malformed, then that member does not know the actual number of bytes that it should have read. Then, there is no chance that it updates the pointer correctly. |
@mustafaiman thanks for the clarification. Now I understand your concern. And yes, I'm aware of it. I can see two problematic scenarios:
Note: both of them describe the case with 4 byte char only, which is not that common (but still it may happen in real world scenarios). The first scenario is the worst one, because, as you've mentioned Node client might read corrupted string and continue with reading the buffer from a wrong position after that. In this case, there might be no error thrown and the end app will just get corrupted data. In the second scenario, it should end up with a |
verify |
b93c289
to
2e5a33e
Compare
2e5a33e
to
df07a01
Compare
* Fix package-lock.json to match package.json * Remove deprecated Buffer constructor calls * Add polyfill Buffer library for Node 4 compatibility * Fix code style issues * Add greedy buffer allocation for ObjectDataOutput * Add map.get based benchmark * Upgrade Bluebird library (there were some important fixes) * Improve readUTF/writeUTF performance for ObjectData * Fix deserialization in ObjectDataInput#readUTF * Fix null string serialization in ObjectDataOutput#writeUTF * Add string serialization config setting (legacy/standard) * Revert changes in PortableTest * Add documentation for the new string serialization config option * Migrate tests to safe-buffer * Improve default serializers integration tests * Fix readme section for string serialization * Improve MapGetRunner benchmark * Fix code issues after code review * Fix readme issues for string serialization
ObjectDataOutput
class) to 1 KB. This eliminates most of buffer allocations and copying, thus improves write throughput for many scenarios, likebenchmark/MapPutRunner.js
.serializationConfig.stringSerializationPolicy
as a fallback, as members and other clients may have issues with 4 byte UTF-8 chars (see the readme update for more details).benchmark/MapGetRunner.js
, that does lots ofmap.get()
for 100 KB string values.safe-buffer
polyfill library for Buffer operations into the client library and tests. Thus, on newer versions of Node there will be no deprecated warnings. Once we decide to abandon Node <5.10, we can simply remove this library without changing the code, except for imports.This PR is a part of the following PRD: https://hazelcast.atlassian.net/wiki/spaces/PM/pages/1630634069/Node.js+Client+Performance+Research
Note: there are some issues with SSL-related tests on Jenkins, so they're failing. I didn't investigate the reason, but it's not related with changes in this PR, as the same failure occur for other PRs. Tests run successfully on my machine.