Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Net use malloc #267

Closed
wants to merge 44 commits into from
Closed

Net use malloc #267

wants to merge 44 commits into from

Conversation

@ghost
Copy link

@ghost ghost commented Oct 28, 2016

New features for BEES:

  • new memory management (can't remember now if it uses two memory pools for big/small ops, or malloc/free & fragmentation be damned)
  • arbitrary op deletion/insertion

Pretty sure all present & correct here. Obv it's a big change and demands quite a bit of testing. Guess I should squash this down into a single commit really so we can cleanly revert the change in case it breaks stuff.

@ghost
Copy link
Author

@ghost ghost commented Oct 31, 2016

this all needs to be squashed down into one or two commits, in order to make the changes reversible if necessary....

@catfact
Copy link
Collaborator

@catfact catfact commented Oct 31, 2016

its working for me so far - really impressive!

i did encounter one freeze. after spastically creating and deleting many operators, i made a new LOGIC somewhere in the middle and froze on using the INS function on it. no METROs or peripherals running. haven't been able to reproduce though...

maybe i'll try hooking up the serial net management stuff and use that to stress-test...

@ghost
Copy link
Author

@ghost ghost commented Oct 31, 2016

OK it's as I suspected - 99% working. Is it quick/easy for you to chuck your jtag-ice on the thing? Do I understand right that JTAG debugger would catch a segfault, get a stacktrace & allow to inspect the stack after a fatal crash?

that is a great point about serial for stress test! Hadn't thought of it. Serial protocol needs extending to include the arbitrary op insertion/deletion. Does slime still work on your emacs? Maybe I can even try to repeat the stress test on beekeep using named pipe to spoof a tty. If that catches it, root out the problem with gdb then we're home & dry...

As a parallel attack to catch this intermittent bug I think some unit tests for the actual mempool implementation might show some error which could be causing the undesired behaviour.

@catfact
Copy link
Collaborator

@catfact catfact commented Nov 2, 2016

i'll dig out the jtag thing and fire it up today. IIRC, it's of somewhat limited use for post-mortem crash analysis - the stack is likely to be corrupted - but of course it depends on the cause.

ah... hm, i have to go back and find your instructions for setting up slime again (new system.) and it might even be easier for me to build out a C interface (which would also be useful as an extension for other scripting languages that i know better than lisp.)

and i should note that i literally only had this freeze once, on a debug build at that, and running on the kooky old prototype. i don't think i was instantiating any especially large operators. in other words, it could have had nothing to do with anything.

@ghost
Copy link
Author

@ghost ghost commented Nov 2, 2016

Realised I forgot to free the op memory in the case when op gets alloc-ed then bail here:
https://github.com/rick-monster/aleph/blob/net-use-malloc/apps/bees/src/net.c#L522-L530
I've got those changes on my local machine, only just started testing them before bed so they're not pushed yet...

The serial stress test did uncover another tangible bug! In theory, should be able to allocate 16 bigOps out of the opPool before failover to malloc:
https://github.com/rick-monster/aleph/blob/net-use-malloc/apps/bees/src/net.c#L501-L503
However the first 15 work great, whereas very last one looks blank on the op screen. Something's wrong there and won't need a JTAG to figure it out...

Once again great suggestion with the serial stress-test - many years experience of creative debugging solutions for embedded devices I guess!

And pretty sure I have seen a similar kind of freeze in normal use, but only once or twice. Now there's a known, repeatable bug this will hopefully lead to the root of the problem...

And yea - building a C interface is going to be very straightforward. Started a new issue for aleph->OSC bridge over on the main repo with my notes on how to do that.

@ghost
Copy link
Author

@ghost ghost commented Nov 3, 2016

ok so bit like peeling back the layers of an onion here - this was pretty boneheaded (did the cat jump on my keyboard!?):
3acc1b9
this is just a pragmatic starting point:
9987c19

Dunno why malloc failover doesn't work but it blows up reliably - first malloc-ed screen instance doesn't display correctly in the oplist...

so with those two changes, at a point where stress tests run 3 or 4 times before a crash:
https://github.com/rick-monster/aleph/blob/net-use-malloc/utils/serial-com-proto/test-harness.lisp#L377-L389
always some debug like this before the aleph stops responding to serial commands and crashes hard:

Warning non-snapping chunk pointer (idx = 4294868984) passed to freeSmallOp
 chunks per message: 1 , op class: 22 , size: 96 ; allocating... 
 op creation failed; too many inputs in network.
 chunks per message: 1 , op class: 8 , size: 104 ; allocating... 
 chunks per message: 1 , op class: 36 , size: 96 ; allocating... 
 setting monome grid focus, op pointer: 0xD00412C4 , value: 1
 warning! requested focus, but no handler was set.  bad device type maybe?
 op creation failed; too many inputs in network.
 setting monome grid focus, op pointer: 0xD00412C4 , value: 0
 chunks per message: 1 , op class: 31 , size: 84 ; allocating... 

Pretty sure this is either an existing bug with one of the grid-centric ops, or the alternative theory:

this bug could be due to event-queue not being flushed & trying to access the op memory from a 'stale' event. Really I think flushing the event queue after deleting an op would be best practice. Will try to figure out how to do that...

Anyway whenever the crash occurs, it's just after that warning about requesting focus, then op creation failed, then setting grid focus 0...

@catfact
Copy link
Collaborator

@catfact catfact commented Nov 3, 2016

hm! ok, that's not what i saw; i don't have grid hardware, and wasn't creating grid ops.

but, i absolutely believe you, that there could easily be a focus-handling problem with one or more of the existing grid ops. the only one that i actually authored was the "raw" one - this is not to cast blame, rather to acknowledge my ignorance.

also must admit that i have in no way totally grokked the new pool-allocation system, and can't extract much meaning from those diffs... bigOpPool head initially points at the top of the pool, and counts down as bigOps are allocated?

i really wish i had a grid; i would of course hook up the jtag and break on the failure in net_monome_set_focus...

i'm assuming you have a grid device attached though. if not, the debug message makes perfect sense (a grid op was created and had focus set, but the monome driver hadn't detected a device... the result should just be that no-op dummy handlers are set for monome grid events... shouldn't in itself cause a crash or anything, but maybe something has broken.)

@ghost
Copy link
Author

@ghost ghost commented Nov 3, 2016

also must admit that i have in no way totally grokked the new pool-allocation system

Umm yea maybe the following explanation should also go in a code comment!

Op pool is simply a linked list of equal-size memory chunks. In lisp terminology the 'car' (head) of each 'cons-cell' (linked list element) is a pointer to a memory chunk within bigOpData (or smallOpData). The 'cdr' (tail) of each 'cons cell' is a pointer to the next 'cons cell'. The tail of last 'cons-cell' in the linked list points to NULL (i.e opPool exhausted). Each member of the statically allocated 'cons cell' array shares an index with an array member of dynamically allocated bigOpData.

When memory is requested from the pool, op_pool code pops the first 'cons cell' off the linked list, and hands the chunk of memory over to operator. When operator is deleted & frees a region, op_pool code examines the index of region (relative to bigOpData or smallOpData), finds the corresponding cons cell, and pushes that memory region back onto the top of the linked list.

The linked list structure will initially hand over memory regions to ops in 'array order'. However, once opFree has been called at least once, the most recently freed region will be the next to be allocated.

i'm assuming you have a grid device attached though.

Nope, my grid is attached to a USB hub attached to this monstrosity:
http://llllllll.co/t/pictures/462/600
The current idea is I run a rather complex but pretty canonical sequencer on the beaglebone, keeping everything sync-ed with the boomerang, routing an FSR midi controller etc, just use aleph for sound-generation & maybe some generative patches if I can get my head round making music that way. Might add a full 'grid-forwarding' feature to the serial driver, enabling aleph to shuttle monome grid's raw serial protocol via the host interface & efficiently run ops like WW on the aleph using grid attached to host, somewhat cleverly circumventing the USB hub issue (if I do say so myself)...

and yea I wasn't surprised at the focus set debug messages, just the subsequent crash! Well tonight I'll try to run the stress test several times, saving the output each time. Pertinent questions after sleep:

  • Is this bug always occuring immediately after attempting to allocate for 'op class 31'?
  • 31 is midiCC
  • if not, is it always occuring one allocation after 'op class 36'?
  • 36 is cascades
@ghost
Copy link
Author

@ghost ghost commented Nov 3, 2016

hm! ok, that's not what i saw;

Just to be clear, these three changes fix three distinct bugs revealed by the stress test (hopefully unrelated to the midiCC/cascades crash):

9987c19
(malloc opPool failover strategy results in glitchy opList)

3acc1b9
(final opPool allocation was returning NULL, instead of final available memory chunk, subsequent glitchy opList)

c9f80ec
(ops weren't getting de-inited or freed on bail with full netlist. Side-effects of this unknown/not understood)

I'm guessing:

spastically creating and deleting many operators

could conceivably have been triggering one or more of those three (now-fixed) bugs...

@ghost
Copy link
Author

@ghost ghost commented Nov 4, 2016

Added bounds checking to serial op deletion/insertion commands (duh), 90% of the crazy behaviour went away (just this minute saw a crash after 10 minutes or so of constant stress testing)

See here for the hardest stress test:

https://github.com/rick-monster/aleph/blob/net-use-malloc/utils/serial-com-proto/test-harness.lisp#L413-L429

In summary, the 'uzi' stress test:

  • creates 256 randomly selected ops (including out-of-range opIds) at random positions (including out-of-range ops)
  • pings 256 randomly selected input nodes with random data (including out-of-range nodes)
  • then finally deletes 256 randomly place ops (including out-of-range ops)

I ran that on a loop for around 5-10 minutes, before it died like this:

chunks per message: 1 , op class: 49 , size: 72 ; allocating...
chunks per message: 1 , op class: 32 , size: 72 ; allocating...
chunks per message: 1 , op class: 6 , size: 84 ; allocating...

Gonna try again right now! (btw op 6 is opMidiNote)

@ghost
Copy link
Author

@ghost ghost commented Nov 4, 2016

EDIT:
I miscounted - the problem op is opMidiNote, not monomeGridRaw as I originally thought. Running a long stress test again...

invalid opType requested: 65
chunks per message: 1
Out of range op insertion requested: 45
chunks per message: 1 , op class: 6 , size: 84 ; allocating...

Ok ran for 20 minutes that time, before crapping out on op class 6 again. What are the chances!?

Anyway I'll try another stress test concentrating on opMidiNote tomorrow...

And another long test ignoring opMidiNote

@ghost
Copy link
Author

@ghost ghost commented Nov 5, 2016

Noting down some more results using the 'uzi' stress test (frantic op creation/deletion, frantic input pinging):

  • haven't seen crash or bad behaviour by hammering hard on midiCC, midiNote or HID operators on their own, though pretty sure that those three are strongly correlated correlated with observed crashes on this stress test
  • currently running 50ms uzi stress test excluding midiCC, midiNote & opHid.
  • Increased the system stress (roughly x2) by reducing range of random numbers (less probability of serial protocol bouncing back out-of-range messages). Test started at 20:45, still running (19 minutes & counting).
@ghost
Copy link
Author

@ghost ghost commented Nov 5, 2016

ok after a pretty epic soak in the tub, IT'S STILL RUNNING!!!!! It's now 21:50 - I will vouch for that test!

wooooohoooo! time to finish ALL the beer & maybe find what's up with the midi/HID ops!

@ghost
Copy link
Author

@ghost ghost commented Nov 6, 2016

OK after (hopefully) fixing this bug in midi & hid ops I am re-running uzi stress test, with all the ops enabled! Start time 0:52 - hang on is it nearly 1am!?

EDIT:

1:38, time to put my laptop to bed for the night...

I'm calling this working! If there are no objections to merging this, I will rebase & clean up the log a bit...

@ghost
Copy link
Author

@ghost ghost commented Nov 6, 2016

Just rebased these changes neatly on top of monome/dev. Another round of uzi stress testing to double check nothing went haywire in the rebase, then I'll pull-request from the other branch...

@ghost ghost closed this Nov 6, 2016
@ghost ghost deleted the net-use-malloc branch Apr 20, 2017
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants