Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net_midi/net_hid linked list implementation bug? #272

Closed
ghost opened this issue Nov 5, 2016 · 9 comments
Closed

net_midi/net_hid linked list implementation bug? #272

ghost opened this issue Nov 5, 2016 · 9 comments

Comments

@ghost
Copy link

ghost commented Nov 5, 2016

Seeing a crash specific to operator insertion/deletion stress test with ops that use net_midi/net_hid (see https://github.com/rick-monster/aleph/issues/21 & #267).

Reading through the code now in net_midi.c - seems like a linked list-ish thing. Also seems possible that algorithm assumes operators are allocated last-in-first-out, and seems plausible to me this may blow up with the observed error message 'error unlinking midi operator' in case the allocation order is broken when de-allocating...

@catfact do you remember writing the code for midi ops? Haven't fully understood the code in net_midi.c yet - does the above explanation hold water?

Seems HID ops also have net_hid.c, similar-ish code - makes sense that both would be blowing up in the same way...

@ghost ghost changed the title net_midi linked list implementation net_midi/net_hid linked list implementation bug!? Nov 5, 2016
@ghost ghost changed the title net_midi/net_hid linked list implementation bug!? net_midi/net_hid linked list implementation bug? Nov 5, 2016
@ghost
Copy link
Author

ghost commented Nov 6, 2016

OK so as I understand it, net_midi.c is a doubly linked list. The active midi ops in there make a circular linked list structure. The error condition 'error unlinking midi operator' should only ever fire if net_midi_list_remove is called twice on the same midi op.

That should surely never happen - possibly gives another clue to what might be going on...

@catfact
Copy link
Collaborator

catfact commented Nov 6, 2016

oy sorry, i didn't see this earlier.
yeah it really looks like ops are being removed twice from the list... this only happens in the op deinit function. could deinit be called twice somehow?

the list managemetn stuff itself is awfully simple.
and yeah, it doesn't need to be doubly linked really.

@ghost
Copy link
Author

ghost commented Nov 6, 2016

Just added two commits with some more debug - now I'm getting a crash like this:

chunks per message: 1 , op class: 47 , size: 96 ; allocating...
op creation failed; too many inputs in network.
chunks per message: 1 , op class: 6 , size: 84 ; allocating...
chunks per message: 1 , op class: 46 , size: 68 ; allocating...
op creation failed; too many inputs in network.
chunks per message: 1 , op class: 6 , size: 84 ; allocating...
op creation failed; too many inputs in network.
chunks per message: 1 , op class: 20 , size: 128 ; allocating...
op creation failed; too many inputs in network.
chunks per message: 1 , op class: 6 , size: 84 ; allocating...
requested push of midi op already in list - bailing
op creation failed; too many inputs in network.

that 'bail on midi op already in list' was silent in the original code. Needs some more debug on entering/leaving net_midi_list_push & net_midi_list_remove...

@catfact
Copy link
Collaborator

catfact commented Nov 6, 2016

yeah.. something funny is happening when add_op fails... and sequence of events is something like

  • get a chunk for the op
  • call op_init , which calls net_midi_list_push
  • oops, too many ins/outs
  • call op_deinit which should unlink it from the list again
  • add_op should exit early before doing anything else
  • next time a midi op is made, it reuses the chunk address? and it's already linked? that's the part that's weird.

(sorry i know this is obvious, just going through it)

@ghost
Copy link
Author

ghost commented Nov 6, 2016

next time a midi op is made, it reuses the chunk address? and it's already linked? that's the part that's weird.

Very weird! In fact I'm going to try printing the hex address of new midi ops - most obvious possibility would be that the new op pool code is playing up, allocating the same memory chunk twice...

It doesn't look like deinit is called twice...

@ghost
Copy link
Author

ghost commented Nov 6, 2016

With the mem address printed on every entry to midi_list_push & midi_list_remove:

chunks per message: 1 , op class: 6 , size: 84 ; allocating...
entered net_midi_list_push D0042944
net_midi_list_push exited normally
op creation failed; too many inputs in network.
entered net_midi_list_remove D0042944
net_midi_list_remove exited normally
chunks per message: 1 , op class: 16 , size: 88 ; allocating...
op creation failed; too many inputs in network.
chunks per message: 1 , op class: 49 , size: 72 ; allocating...
op creation failed; too many inputs in network.
chunks per message: 1 , op class: 5 , size: 96 ; allocating...
setting monome grid focus, op pointer: 0xD0042944 , value: 1
warning! requested focus, but no handler was set. bad device type maybe?
op creation failed; too many inputs in network.
setting monome grid focus, op pointer: 0xD0042944 , value: 0
chunks per message: 1 , op class: 6 , size: 84 ; allocating...
entered net_midi_list_push D0042944
requested push of midi op already in list - bailing
op creation failed; too many inputs in network.
entered net_midi_list_remove D0042944

If I didn't know better, I'd say there must be a bug in these four lines:
https://github.com/rick-monster/aleph/blob/net-use-malloc/apps/bees/src/net_midi.c#L90-L94
The reason being (again going through it step by step)

  • D0042944 gets pushed onto the midi_op_list
  • net_add_op_at bails because not enough inputs
  • D0042944 gets freed
  • net_midi_list_remove exits normally (should pop D0042944 from midi_op_list)
  • next time net_midi_list_push gets called with that same memory chunk, D0042944 is still in the op list

Obviously this conclusion makes no sense, those 4 lines look correct!

Gonna write a debug function to print out the entire contents of ml, call it on entry/exit from push/remove midi_op, try & see what's going wrong there...

@ghost
Copy link
Author

ghost commented Nov 6, 2016

think I figured out the corner case that was causing this bug!

https://github.com/rick-monster/aleph/commit/0680f00bac21ca697b052d6e7c635310c5c90c19

So the problem with the original code is, what happens if you de-init ml.top when the list is not of length 1? the op to be deleted (ml.top) gets 'squashed out' of the doubly linked list, then ml.top is still left pointing to the disembodied op linked list element that just got deleted!

According to my reasoning on this - this bug must also have been present in the old code before arbitrary op deletion/insertion. (but not dead sure about that) Whatever - anyway now the bug seems to be fixed! This stress test:

https://github.com/rick-monster/aleph/blob/net-use-malloc/utils/serial-com-proto/test-harness.lisp#L434-L453

Has been running now for over 5 minutes, whereas it would crash out before https://github.com/rick-monster/aleph/commit/0680f00bac21ca697b052d6e7c635310c5c90c19 after only 30s!

@catfact
Copy link
Collaborator

catfact commented Nov 6, 2016

yep, that's a bug alright! sorrrry...

@ghost
Copy link
Author

ghost commented Nov 6, 2016

Well I somehow doubt that's the only remaining corner-case crash in BEES - there are a lot of lines in there!

@ghost ghost closed this as completed Nov 7, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant