bug/limit with big queue #551
Replies: 10 comments 6 replies
-
|
Approximately how large is the queue? |
Beta Was this translation helpful? Give feedback.
-
|
A bit more digging on this... The issue might actually go a bit deeper than thought. In some cases nzbget did start and *arrs added some new stuff, then it crashs (some time later windows crashed occasionally, but at that point I didn't connect these two issues). Now, funny or not, I'm at a point where windows crashs reproducable with nzbget causing it. The minidump shows something like hits: `DRIVER_OVERRAN_STACK_BUFFER (f7) Debugging Details:KEY_VALUES_STRING: 1 BUGCHECK_CODE: f7 BUGCHECK_P1: ffff9c2abe8c2038 BUGCHECK_P2: 98cb21269ae1 BUGCHECK_P3: ffff6734ded9651e BUGCHECK_P4: 0 FILE_IN_CAB: 050225-18468-01.dmp TAG_NOT_DEFINED_202b: *** Unknown TAG in analysis list 202b DUMP_FILE_ATTRIBUTES: 0x21808 FAULTING_THREAD: ffffd081748cf080 SECURITY_COOKIE: Expected 000098cb21269ae1 found ffff9c2abe8c2038 BLACKBOXBSD: 1 (!blackboxbsd) BLACKBOXNTFS: 1 (!blackboxntfs) BLACKBOXPNP: 1 (!blackboxpnp) BLACKBOXWINLOGON: 1 CUSTOMER_CRASH_COUNT: 1 PROCESS_NAME: nzbget.exe STACK_TEXT: SYMBOL_NAME: nt!_report_gsfailure+25 MODULE_NAME: nt IMAGE_NAME: ntkrnlmp.exe IMAGE_VERSION: 10.0.26100.3912 STACK_COMMAND: .process /r /p 0xffffd08fa4bd5080; .thread 0xffffd081748cf080 ; kb BUCKET_ID_FUNC_OFFSET: 25 FAILURE_BUCKET_ID: 0xF7_MISSING_GSFRAME_nt!_report_gsfailure OSPLATFORM_TYPE: x64 OSNAME: Windows 10 FAILURE_ID_HASH: {82d2c1b5-b0cb-60a5-9a5d-78c8c4284f84} Followup: MachineOwner So, here are some thoughts. Please note, my coding-days are far behind, but I'm trying to get back at it. One thing that pops up in my thoughts is the size of that queue directory, especially the amount of state-files. It has grown to 2.671.827 files (no joke). That number alone already breaks (signed) int32. So if there is a counter or something in nzbget... It also puts the windows limit of files per directory (which is defined by an unsigned int32 it seems) within reach. While totally not the usual queue size, I'd say it's somewhat easy to reach if you let some *arrs go crazy and tend to use the queue as bookmark for nzbs. Just noting, that there is a technical limitation here and things can go even wilder than just having nzbget crash as it seems. Besides, the current queue concept shows some flaws when it grows. Such an amount of files, flat-saved within a single directory can become problematic to handle. It takes a while to open that directory alone, even from a fast ssd. Fun stuff to witness. It also wastes a lot of space in such cases, and nzbget takes ages to start up if it thinks it has to rebuild some files. A few ideas of different ideas do pop up, but changing how the queue is saved will of course break backward-compatibility, and my skills aren't there yet anyway. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the investigation
This is exactly what came to mind. The problem lies in the line: DiskState.cpp:473:DiskState::LoadQueue
I completely agree. Addressing this efficiently requires a significant architectural change. I suppose, switching to a database like SQLite could solve the performance issues with large queues and huge amount of files, but it would indeed involve a lot of code refactorings. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, agreed, sqlite would probably be the most elegant and future-proof solution. Putting these files into sub-dirs, just stupidly using their first digit (or a few) as directory-name, could be a doable quick and dirty fix that effectively eases the situation significantly. On the weekend I managed to get the dev-env running and nzbget compiling (thanks to the nice documentation here). I made some first attempts. 466 and 473 had my attention for a moment, but things didn't change and the frustration was high. "size()" depending on architecture was "THE" hint I needed, since I first tried to use uint32, but compiled for 64bit. Now the outcome makes sense. Thanks man! :) With uint64 I'm out of crashland, but It's not enough to make things work again. So I found a few more pieces that needed to be touched, LoadProgress and some loops. Now I'm halfway there I guess, it starts now, loads the queue, keeps running, but there is still something going wrong with attaching all the infos to a job. ...more hunting to do. |
Beta Was this translation helpful? Give feedback.
-
|
I could use a little idea/hint, where to look. With all my experimental upgrades of "a few" variables to 64bit i got the daemon running and loading/saving the queue just fine, running for hours.... BUT! as soon as I open the webui it crashes within a minute, bevore that, the WebUI works fine. I still don't have figured out/setup a usable way to actually debug whats going on there and what stuff gets called. Do you have an idea where I should look? |
Beta Was this translation helpful? Give feedback.
-
|
Since the webui causes the crash, I’d focus on XmlRpc.cpp It handles communication with the webui. |
Beta Was this translation helpful? Give feedback.
-
|
I'm making progress, but don't know if this is a direction that would be considered committable. After a lot of experimenting, back and forth, and a few moments of reconsiderations, I tend to move as much int to int64 as possible. There are those places where it's necessary to handle a big queue and lots of files. This did lead to NextParamAsInt, which I needed to output int64, so of course everything using this routine needed to move to int64 too, as I don't want to overcomplicate that routine or put in variants. This ended up being a "fun" rabbit-hole. In the end I don't think it hurts to upgrade int to int64, memory consumption didn't change in a noticable way. Having more headroom might also prevent from overflows here and there, we are just a malformed nzb-file or someone with a huge queue ;) away from that, how knows what else. Also, using just one type of int makes code easier to service in the future, which is one reason why I did throw away the idea of using uint32 here and there for the time being. Moving from int32 to int64 is the easiest and safest way I think, even if some stuff will never really need an int64 (as some of it never needed an int32 either). I also wanted to get rid of atoi/atoll and added a little replacement-routine using boost/spirt/qi, which noticeably speeds things up and should be more safe. This is something I just started testing. |
Beta Was this translation helpful? Give feedback.
-
This is exactly what I expected for. That’s why I don’t want to hurry with it. Your investigation is helpful, but we need to carefully weigh the advantages against the potential complications, particularly given our need to support a wide range of platforms and architectures. Changes like this can have unexpected consequences across different systems, so we need to be thorough.
I’ll keep an eye on your fork and address the issue as soon as possible, considering your changes. |
Beta Was this translation helpful? Give feedback.
-
|
Just a side-note, I did replace almost all atoi/atoll-calls with StrToNum, which works flawlessly so far. There are a few left that need more touching, and .value_or(0) could probably replaced with something smarter here and there, but one major gain is, the app doesn't crash windows anymore, when I get it to crash. ;) My current build runs smoothly, with heavy testing. But I'd like to take back int64 on a few places. For that, could we modify NextParamAsInt into a template-based function spitting out variants of ints, like StrToNUm? Maybe call it NextParamAsNum and keep the old one until the transition is complete. |
Beta Was this translation helpful? Give feedback.
-
Of course, it’s technically possible. You could modify the method’s declaration to directly accept an int64_t* or create a function overload: bool XmlCommand::NextParamAsInt(int64_t* value);And you could use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Looks like I've hit a wall here. NZBget crashs after loading a heavy queue (btw, those tiny files can slow down the starting process when the state file is being rebuild).
This is what windbg puts out:
(4b3c.422c): Security check failure or stack buffer overrun - code c0000409 (!!! second chance !!!)
Subcode: 0x5 FAST_FAIL_INVALID_ARG
nzbget!iconv_canonicalize+0xfd31d2:
00007ff6`6fe5f65c cd29 int 29h
The log doesnt put out anything interesting, a dmp file hasnt been crated, even though it was enabled and i'm using the debug build.
Beta Was this translation helpful? Give feedback.
All reactions