High memory consumption while recovering damaged data file (in scenario in-file storage) #1255

Jerito-kun · 2022-06-23T10:29:55Z

High memory consumption while recovering damaged data file (in scenario in-file storage)
In case, when (for unknown reason) one of data file was damaged (for example: bad EOF mark), server make a note in log file:

{"level":"info","time":"2022-06-23T13:14:37+03:00","message":"Server is ready"}
{"level":"info","time":"2022-06-23T13:14:38+03:00","message":"STREAM: Recovering the state..."}
{"level":"error","time":"2022-06-23T13:14:43+03:00","message":"STREAM: Verification of last message for file "C:\\ProgramData\\Some_folder\\NSS\\data\\Some_file.Log\\msgs.1.dat" failed: unable to read last record: unexpected EOF"}
{"level":"error","time":"2022-06-23T13:14:43+03:00","message":"STREAM: Error with index file "C:\\ProgramData\\Some_folder\\NSS\\data\\Some_file.Log\\msgs.1.idx": Verification of last message for file "C:\\ProgramData\\Some_folder\\NSS\\data\\Some_file.Log\\msgs.1.dat" failed: unable to read last record: unexpected EOF. Truncating and recovering from data file"}

Memory usage exceeds more than 1,6 Gb instead of ~30 Mb in normal operating mode.

Problem gone by deleting the damaged file or restarting server, but in such case damaged file continues to exist and other software, whose using this file to operate, cannot start and work.
This sutuation occurs both on Windows and Linux hosts, including 0.24.6 software version.

Jerito-kun · 2022-06-27T12:08:44Z

Please let me know if I should provide some other information.

kozlovic · 2022-06-27T17:16:52Z

Do you have the store file that we could use to reproduce? It is possible that a corruption causes the server to think that a payload/etc.. is of a size bigger than it is, but this is just speculation at this point.

I am not clear on this:

Problem gone by deleting the damaged file or restarting server, but in such case damaged file continues to exist and other software, whose using this file to operate, cannot start and work.

If you delete the file, which then resolves the issue, how can the file "continue to exist"?

Jerito-kun · 2022-06-28T09:57:49Z

Do you have the store file that we could use to reproduce?

Yes, sure. Could you please tell me, which way can I send you these files?

If you delete the file, which then resolves the issue, how can the file "continue to exist"?

I mean the situation, when someone kills server during file repair and starting it again — in that case damaged file exists, but stays not repaired.

kozlovic · 2022-06-28T17:05:36Z

Yes, sure. Could you please tell me, which way can I send you these files?

If you have the corruption, you could send the whole datastore directory to ivan@nats.io.

kozlovic · 2022-07-06T20:54:25Z

@Jerito-kun As suspected, the high memory usage is due to a corrupted record that indicates that the message payload is ~1.6GB which then cause a memory allocation of that size. Note that the buffer is then released, so the actual memory in use is not that much and when garbage collection will kick in, the memory should be reclaimed (I have verified with -profile <port> that this memory was allocated but not currently in-use).

Given that, I am not sure what is the best course of action here. I could make reading a record fail if it means that the server would have to create a buffer of a certain size, but which? hard-code? a new option? Or leave as-is, again knowing that the memory was allocated, but not currently in-use.

Jerito-kun · 2022-07-13T08:08:33Z

I think, a new option will be the best.

In case of memory corruption, it is possible that the record size is way greater than it should, which would cause the server to create a buffer of the wrong size in the attempt to read the record. This new option will limit how big the buffer needed to read the record from disk can be. Resolves #1255 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic · 2022-07-18T17:40:29Z

@Jerito-kun I added that in PR 1259 and 1260. This will be part of the next release.

Reverted addition of record_size_limit But still address the memory usage caused by a corrupted data message on recovery. By using the expected record size from the index file, when checking that the last message matches the index information, we would find out that the index's stored message record size does not match the record size in the ".dat" file and would not allocate the memory to read the rest of the message. The record_size_limit that was added to solve that issue would have likely caused a lot of issues if mis-used. Resolves #1255 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic mentioned this issue Jul 15, 2022

[ADDED] FileStore: RecordSizeLimit to limit single record read size #1259

Merged

kozlovic closed this as completed in #1259 Jul 18, 2022

kozlovic mentioned this issue Jul 26, 2022

new fstore option RecordSizeLimit does not pass to the constructor via AllOptions() #1261

Closed

kozlovic mentioned this issue Jul 29, 2022

[FIXED] Check expected record size before loading the payload #1266

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory consumption while recovering damaged data file (in scenario in-file storage) #1255

High memory consumption while recovering damaged data file (in scenario in-file storage) #1255

Jerito-kun commented Jun 23, 2022

Jerito-kun commented Jun 27, 2022

kozlovic commented Jun 27, 2022

Jerito-kun commented Jun 28, 2022

kozlovic commented Jun 28, 2022

kozlovic commented Jul 6, 2022

Jerito-kun commented Jul 13, 2022

kozlovic commented Jul 18, 2022

High memory consumption while recovering damaged data file (in scenario in-file storage) #1255

High memory consumption while recovering damaged data file (in scenario in-file storage) #1255

Comments

Jerito-kun commented Jun 23, 2022

Jerito-kun commented Jun 27, 2022

kozlovic commented Jun 27, 2022

Jerito-kun commented Jun 28, 2022

kozlovic commented Jun 28, 2022

kozlovic commented Jul 6, 2022

Jerito-kun commented Jul 13, 2022

kozlovic commented Jul 18, 2022