Net use malloc #267
Net use malloc #267
Conversation
|
this all needs to be squashed down into one or two commits, in order to make the changes reversible if necessary.... |
|
its working for me so far - really impressive! i did encounter one freeze. after spastically creating and deleting many operators, i made a new LOGIC somewhere in the middle and froze on using the INS function on it. no METROs or peripherals running. haven't been able to reproduce though... maybe i'll try hooking up the serial net management stuff and use that to stress-test... |
|
OK it's as I suspected - 99% working. Is it quick/easy for you to chuck your jtag-ice on the thing? Do I understand right that JTAG debugger would catch a segfault, get a stacktrace & allow to inspect the stack after a fatal crash? that is a great point about serial for stress test! Hadn't thought of it. Serial protocol needs extending to include the arbitrary op insertion/deletion. Does slime still work on your emacs? Maybe I can even try to repeat the stress test on beekeep using named pipe to spoof a tty. If that catches it, root out the problem with gdb then we're home & dry... As a parallel attack to catch this intermittent bug I think some unit tests for the actual mempool implementation might show some error which could be causing the undesired behaviour. |
|
i'll dig out the jtag thing and fire it up today. IIRC, it's of somewhat limited use for post-mortem crash analysis - the stack is likely to be corrupted - but of course it depends on the cause. ah... hm, i have to go back and find your instructions for setting up slime again (new system.) and it might even be easier for me to build out a C interface (which would also be useful as an extension for other scripting languages that i know better than lisp.) and i should note that i literally only had this freeze once, on a debug build at that, and running on the kooky old prototype. i don't think i was instantiating any especially large operators. in other words, it could have had nothing to do with anything. |
|
Realised I forgot to free the op memory in the case when op gets alloc-ed then bail here: The serial stress test did uncover another tangible bug! In theory, should be able to allocate 16 bigOps out of the opPool before failover to malloc: Once again great suggestion with the serial stress-test - many years experience of creative debugging solutions for embedded devices I guess! And pretty sure I have seen a similar kind of freeze in normal use, but only once or twice. Now there's a known, repeatable bug this will hopefully lead to the root of the problem... And yea - building a C interface is going to be very straightforward. Started a new issue for aleph->OSC bridge over on the main repo with my notes on how to do that. |
|
ok so bit like peeling back the layers of an onion here - this was pretty boneheaded (did the cat jump on my keyboard!?): Dunno why malloc failover doesn't work but it blows up reliably - first malloc-ed screen instance doesn't display correctly in the oplist... so with those two changes, at a point where stress tests run 3 or 4 times before a crash:
Pretty sure this is either an existing bug with one of the grid-centric ops, or the alternative theory: this bug could be due to event-queue not being flushed & trying to access the op memory from a 'stale' event. Really I think flushing the event queue after deleting an op would be best practice. Will try to figure out how to do that... Anyway whenever the crash occurs, it's just after that warning about requesting focus, then op creation failed, then setting grid focus 0... |
|
hm! ok, that's not what i saw; i don't have grid hardware, and wasn't creating grid ops. but, i absolutely believe you, that there could easily be a focus-handling problem with one or more of the existing grid ops. the only one that i actually authored was the "raw" one - this is not to cast blame, rather to acknowledge my ignorance. also must admit that i have in no way totally grokked the new pool-allocation system, and can't extract much meaning from those diffs... bigOpPool head initially points at the top of the pool, and counts down as bigOps are allocated? i really wish i had a grid; i would of course hook up the jtag and break on the failure in i'm assuming you have a grid device attached though. if not, the debug message makes perfect sense (a grid op was created and had focus set, but the monome driver hadn't detected a device... the result should just be that no-op dummy handlers are set for monome grid events... shouldn't in itself cause a crash or anything, but maybe something has broken.) |
Umm yea maybe the following explanation should also go in a code comment! Op pool is simply a linked list of equal-size memory chunks. In lisp terminology the 'car' (head) of each 'cons-cell' (linked list element) is a pointer to a memory chunk within bigOpData (or smallOpData). The 'cdr' (tail) of each 'cons cell' is a pointer to the next 'cons cell'. The tail of last 'cons-cell' in the linked list points to NULL (i.e opPool exhausted). Each member of the statically allocated 'cons cell' array shares an index with an array member of dynamically allocated bigOpData. When memory is requested from the pool, op_pool code pops the first 'cons cell' off the linked list, and hands the chunk of memory over to operator. When operator is deleted & frees a region, op_pool code examines the index of region (relative to bigOpData or smallOpData), finds the corresponding cons cell, and pushes that memory region back onto the top of the linked list. The linked list structure will initially hand over memory regions to ops in 'array order'. However, once opFree has been called at least once, the most recently freed region will be the next to be allocated.
Nope, my grid is attached to a USB hub attached to this monstrosity: and yea I wasn't surprised at the focus set debug messages, just the subsequent crash! Well tonight I'll try to run the stress test several times, saving the output each time. Pertinent questions after sleep:
|
Just to be clear, these three changes fix three distinct bugs revealed by the stress test (hopefully unrelated to the midiCC/cascades crash): 9987c19 3acc1b9 c9f80ec I'm guessing:
could conceivably have been triggering one or more of those three (now-fixed) bugs... |
|
Added bounds checking to serial op deletion/insertion commands (duh), 90% of the crazy behaviour went away (just this minute saw a crash after 10 minutes or so of constant stress testing) See here for the hardest stress test: In summary, the 'uzi' stress test:
I ran that on a loop for around 5-10 minutes, before it died like this:
Gonna try again right now! (btw op 6 is opMidiNote) |
|
EDIT:
Ok ran for 20 minutes that time, before crapping out on op class 6 again. What are the chances!? Anyway I'll try another stress test concentrating on opMidiNote tomorrow... And another long test ignoring opMidiNote |
|
Noting down some more results using the 'uzi' stress test (frantic op creation/deletion, frantic input pinging):
|
|
ok after a pretty epic soak in the tub, IT'S STILL RUNNING!!!!! It's now 21:50 - I will vouch for that test! wooooohoooo! time to finish ALL the beer & maybe find what's up with the midi/HID ops! |
|
OK after (hopefully) fixing this bug in midi & hid ops I am re-running uzi stress test, with all the ops enabled! Start time 0:52 - hang on is it nearly 1am!? EDIT: 1:38, time to put my laptop to bed for the night... I'm calling this working! If there are no objections to merging this, I will rebase & clean up the log a bit... |
|
Just rebased these changes neatly on top of monome/dev. Another round of uzi stress testing to double check nothing went haywire in the rebase, then I'll pull-request from the other branch... |
New features for BEES:
Pretty sure all present & correct here. Obv it's a big change and demands quite a bit of testing. Guess I should squash this down into a single commit really so we can cleanly revert the change in case it breaks stuff.