Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory consumption while recovering damaged data file (in scenario in-file storage) #1255

Closed
Jerito-kun opened this issue Jun 23, 2022 · 7 comments · Fixed by #1259 or #1266
Closed

Comments

@Jerito-kun
Copy link

High memory consumption while recovering damaged data file (in scenario in-file storage)
In case, when (for unknown reason) one of data file was damaged (for example: bad EOF mark), server make a note in log file:

{"level":"info","time":"2022-06-23T13:14:37+03:00","message":"Server is ready"}
{"level":"info","time":"2022-06-23T13:14:38+03:00","message":"STREAM: Recovering the state..."}
{"level":"error","time":"2022-06-23T13:14:43+03:00","message":"STREAM: Verification of last message for file "C:\\ProgramData\\Some_folder\\NSS\\data\\Some_file.Log\\msgs.1.dat" failed: unable to read last record: unexpected EOF"}
{"level":"error","time":"2022-06-23T13:14:43+03:00","message":"STREAM: Error with index file "C:\\ProgramData\\Some_folder\\NSS\\data\\Some_file.Log\\msgs.1.idx": Verification of last message for file "C:\\ProgramData\\Some_folder\\NSS\\data\\Some_file.Log\\msgs.1.dat" failed: unable to read last record: unexpected EOF. Truncating and recovering from data file"}

Memory usage exceeds more than 1,6 Gb instead of ~30 Mb in normal operating mode.

image

Problem gone by deleting the damaged file or restarting server, but in such case damaged file continues to exist and other software, whose using this file to operate, cannot start and work.
This sutuation occurs both on Windows and Linux hosts, including 0.24.6 software version.

@Jerito-kun
Copy link
Author

Please let me know if I should provide some other information.

@kozlovic
Copy link
Member

Do you have the store file that we could use to reproduce? It is possible that a corruption causes the server to think that a payload/etc.. is of a size bigger than it is, but this is just speculation at this point.

I am not clear on this:

Problem gone by deleting the damaged file or restarting server, but in such case damaged file continues to exist and other software, whose using this file to operate, cannot start and work.

If you delete the file, which then resolves the issue, how can the file "continue to exist"?

@Jerito-kun
Copy link
Author

Do you have the store file that we could use to reproduce?

Yes, sure. Could you please tell me, which way can I send you these files?

If you delete the file, which then resolves the issue, how can the file "continue to exist"?

I mean the situation, when someone kills server during file repair and starting it again — in that case damaged file exists, but stays not repaired.

@kozlovic
Copy link
Member

Yes, sure. Could you please tell me, which way can I send you these files?

If you have the corruption, you could send the whole datastore directory to ivan@nats.io.

@kozlovic
Copy link
Member

kozlovic commented Jul 6, 2022

@Jerito-kun As suspected, the high memory usage is due to a corrupted record that indicates that the message payload is ~1.6GB which then cause a memory allocation of that size. Note that the buffer is then released, so the actual memory in use is not that much and when garbage collection will kick in, the memory should be reclaimed (I have verified with -profile <port> that this memory was allocated but not currently in-use).

Given that, I am not sure what is the best course of action here. I could make reading a record fail if it means that the server would have to create a buffer of a certain size, but which? hard-code? a new option? Or leave as-is, again knowing that the memory was allocated, but not currently in-use.

@Jerito-kun
Copy link
Author

I think, a new option will be the best.

kozlovic added a commit that referenced this issue Jul 15, 2022
In case of memory corruption, it is possible that the record size
is way greater than it should, which would cause the server to
create a buffer of the wrong size in the attempt to read the
record. This new option will limit how big the buffer needed
to read the record from disk can be.

Resolves #1255

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
kozlovic added a commit that referenced this issue Jul 15, 2022
In case of memory corruption, it is possible that the record size
is way greater than it should, which would cause the server to
create a buffer of the wrong size in the attempt to read the
record. This new option will limit how big the buffer needed
to read the record from disk can be.

Resolves #1255

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Member

@Jerito-kun I added that in PR 1259 and 1260. This will be part of the next release.

kozlovic added a commit that referenced this issue Jul 29, 2022
Reverted addition of record_size_limit

But still address the memory usage caused by a corrupted data message
on recovery.

By using the expected record size from the index file, when checking
that the last message matches the index information, we would find
out that the index's stored message record size does not match the
record size in the ".dat" file and would not allocate the memory
to read the rest of the message.

The record_size_limit that was added to solve that issue would have
likely caused a lot of issues if mis-used.

Resolves #1255

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants