version 181008: mlr crashes, then hardlocks system #603
Comments
|
thanks for the detailed report, i'll get to work on it |
i assume you mean 180707? on DSP side, changes are minimal but do affect wrapping logic for negative rates (removed some branches.) sounds like this is specifically affecting negative rates so i'll look there. odd because minimal script using softcut and reversed playback doesn't have any problems for me. since we've switched to supernova i need to do a full rewrite incorporating the resampling anyways. sorry it's been slow, day job is too intense to even play with norns let alone work on it.
OT, but this "bug" was never clearly communicated to me. the forum discussion is a muddle. i explained elsewhere that the voice regions really should have some pre-roll and post-roll (not start at zero, and not be totally jammed up against each other) if you don't want to hear unexpected things during the crossfades. but as you say that's another issue. |
voice regions have pre-roll, but i need to fix buffer loading to make sure adjacent "clips" don't get wiped out. easy fix i think. i'm generally unsure how a bug crept in re: crashing but i'll investigate |
correct; i typed a few too many 7s. edited original report.
mentioning reverse playback got me thinking, so i did some more thorough testing just now. the same hardlock occurs at <2min of letting it just play even without setting any clips to reverse. to reproduce: set 2 clips to play forward normal speeds, and the other 2 clips to play forward at 2x (or 4x). mlr still locks up and stops playing audio after less than 2 minutes of hands-off playback. curiously, the screen and controls of norns itself are working, and the grid itself partially works. that is, the mode/page-select buttons work, though the status/progress lights on each grid page are frozen as they were the moment the audio died. i can even press buttons in the rec and clip pages with the expected led activity, but they don't have any affect on the audio. loading a new sample into a clip doesn't make the cut/play page respond at all. since i can still access the norns menu, i was able to perform an audio system restart and reload the samples, though again that resulted in the same hardlock, after just 1 minute of all four clips playing. as before, each sample played at 1x and 2x forward. also as before, the norns menu and grid rec/clip pages are accessible, but inoperative. |
|
ok, many thanks for checking that! |
|
also, i made sure to ssh into norns before re-running my tests, and discovered the following messages filling up
there are dozens of these lines prior to/at the time the audio cutout, when mlr froze after 1 minute of untouched playback. again, the norns on-device controls work, the grid pages have limited response, and the system is still responsive over ssh. but mlr doesn't work; only restarting the system audio or rebooting norns lets me back in...until mlr crashes after 1 minute. i'm afraid that this is a complete blocker of a bug, as i can't find any way of stabilizing mlr long enough to get anything done. are there any other logfiles/messages i should examine, as long as i'm logged-in? |
|
since you asked, it would be very helpful for me to see the supercollider output. i think that maiden shows this if you select the appropriate REPL tab? (apologies that i can't actually remember this right now.) if not, you can get it on the terminal:
the first session should be showing SC output. NB: maiden won't work b/c we're bypassing the websocket wrappers. you still have a primitive lua REPL in the session running the matron process. (i'm not on a norns right now so apologies if i made any errors there but hopefully the idea is clear) |
|
oh and if you're feeling really ambitious you could go into https://github.com/monome/norns/blob/master/sc/core/CroneEngine.sc#L71 then restart then we'll see all the commands mlr is actually sending to the engine. |
|
thanks, i got it working within REPL as expected. here's the output with CroneEngine uncommented. (i can make my samples available via dropbox if necessary; they're 32bit 48khz .wav...i'm wondering if their format is what's breaking mlr. it looks like it thinks they're 16bit, as seen in the first debug line.) order of operation was: load mlr, select clip page, load samples into clips 1-4 sequentially, switch to rec page, toggle 2x speed on clips 2 & 4, switch to cut/play page, press a single button in each row, let loop. mlr crashed and hardlocked in less than 20sec. SC tab output:
|
|
ok thanks, that's extremely helpful. so, the actual error message nothing is showing up as a smoking gun exactly. but, i'm suspecting something to do with loading samples now. buffer frames allocated is higher than i expected - almost 12 minutes. this is some stupid error on my part, incorrectly predicting the next power of two after applying hardware samplerate.) just as a sanity check, we could try reducing the buffer duration here: https://github.com/monome/dust/blob/master/lib/sc/Engine_SoftCut.sc#L3-L4 changing it to 300 (seconds) should give a frame count of 16777216, instead of 33554432. (or lower - seems like i suppose you could also try reducing the voice count back to 4: in the engine at line 3, and also in the mlr script - i think it's sufficient to change the value of besides that and the phase-wrapping method, the main difference from 180707 release (a really big one) is the switch to supernova. thanks again and i'll check this out myself shortly - blocking some dedicated time for norns development this week. |
|
[continuing] i'm not sure why the higher frame count would be a problem - it's ~134 MB and i thought we had plenty of RAM to spare. but loading a sample actually allocates a temp buffer on top of that, so i dunno. it's worth checking (surprised that supernova wouldn't give us a warning if it failed to reserve enough RAM, but checking it now on x86, appear to be able to request absurd buffer sizes on the sclang client without a peep from the server....) |
that is correct.
i did exactly this, using my short only difference is our files? oh, my wifi is off. i'll turn it on and good poke around with it still running |
current hypothesis is file length could be significant... |
|
FYI i intend to reduce the voice count of mlr to 4, but expand the "rows" and implement mute groups (as in mlrv). partially because i'd like there to be more overhead in softcut for interp recording, and also the possibility of other processing (ie if/when we make swappable aux/insert effects) |
i made these changes and re-ran the test. no joy. it hardlocked within 15sec, this time with the unpleasant buzzy audio cycling, and the norns and grid UIs are completely frozen. have to use the bottom-button-reset trick. (again, lemme know if you want me to send my samples to you for further analysis/reproduction.) SC output in REPL is also dead; the device locked so hard, it didn't write error messages. this is the last bit when starting play:
|
|
ok. loaded something isn't working where the full file didn't get loaded, and then when playing further into the non-loaded area it devolved to digital noise explosion. didn't crash the engine. was able to reload mlr. and then ran will get some actual debug output now |
|
ok then we're probably onto something there. at minimum @nightmorph reported buffer frame size is unchanged - are you sure the edit got saved or whatever? |
|
i verified in maiden that the files were correctly updated (as i'd edited them via ssh) before restarting the audio engine and mlr, but will check again now. |
|
just to be clear, any time you edit any |
|
verified, reloaded, and (ouch). i think i just got the same horrible digital noise-explosion that @tehn did after several seconds of untouched playback. engine also didn't seem to crash, was also able to reload mlr. is there a chance that the script needs more time to fully read the file from disk into ram? i've encountered that issue in the er301, in which i need to give long samples time to load from the microSD card into memory. there's no visual indicator there (or in norns) of how far along the operation is, so longer files might be getting truncated, or something. here's the SC output from REPL:
|
|
ok thanks. i'll have to dig into this and probably fix up seems to be some subtly different behavior in supernova and maybe less graceful bad-argument / concurrency handling for buffer stuff (no too surprising since it has a totally different threading model.) don't think this will be too hard and should have something to share late tonight. apologies for the difficulties. |
|
further testing note: while i reported earlier that i had a similar experience as @tehn in being able to reload mlr without the engine crash at the time...after letting it sit unattended for several minutes, without loading any samples, mlr in a blank state, i found that norns had still hardlocked itself. completely unresponsive, and nothing in REPL SC. |
|
ok thanks. made a stripped-down test of playback with multiple voices, reading same region, changing rates, using experience so far:
would be great to see what we can add from a minimal starting place that reproduces. or whether engine/buffer stuff is a total red herring and it's all about OSC and other network traffic density. @tehn alternatively, could try disabling some polls in mlr. |
|
(oh, will also try more agressively triggering negative loop points, and maybe quick fix the negative wrap logic in ugen.) |
|
left the house and came back - audio still running, but UI is frozen and matron process is gone. in this case though, i see the problem - i was running matron from an ssh session, and the session timed out, and it killed the child process. somehow the session running sclang wasn't afffected - sclang uses some kind of virtual terminal which may be the reason. anyway something to watch out for i guess... |
|
anything i can help test on my end at this point? is the lua script something i can run directly on my norns, after changing the hardcoded filepath to point at one of my samples? |
|
sure, what i'm trying to do is build on that minimal example until it crashes reproduceably (in fact the rate change is probably not useful and just makes it less boring to listen to!)... not there yet and indeed i don't know at this point that the issue doesn't arise from mlr's greater complexity in general (if it is somehow correlated with network traffic) also, does this ever happen with wifi off? |
yes; all the crashes first occurred while wifi was off, and the adapter not even plugged-in. that was what prompted me to file the initial report. i only turned on wifi so i could ssh in and poke at the logfiles. interestingly, on the harder crashes, in which the audio unpleasantly & noisily locks up, the ssh session also freezes. i can't interact with the OS at all. note that i've only ever run a single ssh terminal at a time, so norns isn't under heavy multi-term network load. |
i will see if i get the glitch explosion with polls disabled |
|
well, it's not a very good data point, but:
|
|
report on forum of lockups using curious to see if there are other reports. in any case, i'm currently applying the 181008 update. it looks like the update script should overwrite all my custom stuff and give me something like what users have. will report. still, it would be nice to have a full disk image. given that i can just flash the whole filesystem on the white proto, which isn't a big deal, and see what happens. re: current / cm3 power supply, i would think that if there were RAM problems due to this, they would show up in the memtest. however another symptom of undercurrent is that the emmc would be intermittent, causing all kinds of mysterious issues which could definitely include kernel panic. i have to consider the strong possiblity of a hardware issue, since so far the lockups seem so much more prevalent on some units than others, and the big-picure change since june is that we are using the hardware more fully across the board (engaging all cores, &c.) @nightmorph the listing of SC extensions shows that those files were created in july. that's weird, i'd think the update scripts should be overwriting them. i'll have a better idea after trying the update scripts here. in next update one thing i would like us to add is better logging for supernova process. i'll see what i can do. |
|
i applied the 181008 updates. this worked as far as i can tell. some interesting wrinkles:
anyways. i'm pretty confident i now have the same binaries and stuff as anyone who successfully ran the 181008 update. (supernova, sclang, sc3 plugins, matron, maiden, ugens, kernel overlays, boot modules.) and the end result is that MLR is happily running so far (same setup as before.) but i will update after a few hours. TL/DR (@tehn) |
|
MLR ran fine for an hour or so. (good thing the samples are pretty!) following up on forum reports, i tried running since this was a shutdown, i suspected a power thing. the first time it happened was on battery. then i plugged into a laptop usb port (macbook adapter + hub, actually) and got it a couple more times. i hooked up norns to mains power and tried again: result was audio/supernova death after several minutes, but grid and matron keep running. then i attached a current meter to the USB running between norns and grid. the grid current draw is peaking at 0.4 amps. (loom is an extremely bright and active script.) when i run MLR the USB current draw never exceeds 0.1A. this seems potentially significant. i think i'm using the standard norns power supply; anyways it's a 2A guy. (my "official" RPI3 power supplies are 2.5A, but have a hardwired micro-USB plug.) last test: running oh, last data point: i was able to immediately trigger a hard shutdown on battery by running |
|
per something that @catfact mentioned, it does indeed seem like the symlinked shared objects didn't get updated when migrating to 181008. (unless the date is wrong because the hwclock failed to keep itself synced when connecting to wifi.) is there a logfile for the update script, to see if something was silently failing? is there a set of manual steps i can run to make sure i get the updated libraries, and then re-test? also, do i need to undo/reset any of the various single-line script/lib changes i made in the earlier tests? to mlr, dust/sc, etc. as far as i know, tape/ is empty, so it's not a lack of space preventing a proper update. i only have 3-4 samples from a different project in dust/ -- there are no other samples or userscripts beyond what's installed by default. |
|
image still uploading (stalled overnight, cursed rural internet) |
|
i'm completely perplexed how some objects wouldn't have been updated. the update script: https://github.com/monome/norns/blob/master/update.sh_template
there is no logfile, it's just a script. we're working on completely overhauling the update method as it is too fragile. re: incomplete updates. this is a strong suspect. re: power. yes, this is disconcerting. it's just that i can't explain differences between people's hardware. one thing i am concerned about is battery quality-- i've seen a couple of them go bad. i'm wondering if random batteries just can't output current to spec. re @nightmorph, i'd be curious to first have you test a fresh image flash, and if that fails, replace the battery. if the image flash fails i'd be happy to look at your hardware myself and save you the trouble. i'm sorry for all of this. |
|
also re power adapter. i chose a 5.25V at 2A which is close to 5V @ 2.5A in terms of wattage, but the higher voltage allows for voltage drop over the usb cable with high current (i also chose a high quality high-gauge usb cable, for lower drop)... this part of the design stage was very intense last year. but in reality we shouldn't even be getting close to the limits of power draw. i also regularly monitor USB power draw and haven't seen anything too crazy. |
I'd be curious to hear exactly what steps you're taking when seeing the loom issues - I've been testing here and unable to reproduce so far. Perhaps we can open a new issue if it's reproducible and I can investigate. |
|
current disk image uploaded: https://monome.nyc3.digitaloceanspaces.com/norns181008.img.zip @nightmorph did you per chance try changing the number of softcut voices and reducing the mlr tracks? i know ezra suggested it but i may have missed the results in the thread here |
yes, i did that earlier, but it had no effect. just to verify--what's the procedure for loading this image on my norns? or should i just try to reinstall the regular release? edit: here's something curious: i ran the note: mlr ran just fine for about 10 minutes, but then hardlocked with the distinctive unpleasant buzz. still, that's greatly improved over the sub-15sec freezes using the old unsymlinked libs. |
|
@nightmorph don't bother flashing your disk, we're debugging other over-cpu causes... |
|
edit also, can you ssh in to your unit and do
while you run your test? |
|
@markwheeler i got a couple of hard crashes with loom yesterday, each after 10 minutes or so, not doing any special except dialing up the number of voices to 10 or something. interestingly, i couldn't get any by running loom without the grid connected. other people have never had a crash. [ed: oh i should note: i had these after running the 181008 update. before that i was using sources+scripts from tip of git, and sclang/supernova/ugens/matron built on device.] i don't think this is a problem with your engine or script in pariticular, we're seeing stability problems across the board with high usage under supernova and are still trying to triage this. making it far more frustrating is the variability across units (or users, or climates, or houses of the zodiac, or who knows.) e.g., josh can get reliable crashes after 30s with mlr, but it takes me 5 days to get one. anyways, after spending lots of time on softcut-specific obvious culprits (like buffers and custom ugens) i think these are describing the same issue. so seems ok to discuss them both here. |
|
@nightmorph i can't explain why those files wouldn't have been properly symlinked. the update procedure is not working correctly and needs to be worked on. if you would indeed like to try a clean disk image: https://github.com/monome/norns-image/blob/master/readme-usbdisk.md @markwheeler i've been test-running loom for about 20 minutes now without problems, everything max'd out. |
changes made, temps tested. really interesting results when it did crash.
|
|
looks like kernel oops / near panic (meaningless |
I posted this in the lines update thread, but I'll mention again here: when I was manually running the sc/install.sh script on RasPi I would get some errors about failing to created symlinks. A friend helped me out with this and apparently adding |
|
that makes sense. sc install script assumes parents exist but i guess now we are nuking higher levels of SC extension dir tree |
|
unfortunately, this hardlock bug is still present, even after receiving my norns back from @tehn and updated to the latest november release. just about a minute of playing four unique samples in mlr (each in their own slot, ranging in size from 34sec to 46sec), and then the device hardlocked with the buzzing sound of death, while running with AC power plugged in, no wifi. controls on the unit and grid/mlr were completely unresponsive. i think i've managed to reproduce the freezes in a second test: while playing samples, go into mlr's parameters screen, and adjust volumes for each track (i only touched 2-4). any noticeable/rapid volume change, e.g. 0.85 -> 0.70 resulted in the freeze. |
|
ok thanks. we're getting close with some large changes especially to the MLR engine. in the meantime i'd love to know if you happen get any crashes with any other scripts, and in general what the cpu/temp readings in system are looking like in typical use... |
|
when plugged in, temp is 65-66deg when letting all 4 tracks just loop. and what looks like 53%-56% cpu usage at the same time. i'll test more scripts soon as i can. so far, it's stable for a few minutes when just looping, not making any large adjustments to volume while tracks play. |
|
this is alarming to me. i performed similar looping tests for hours here and had no issue. i'm having a difficult time imagining that ambient temperature is our only variation, and will certainly replace hardware if need be. please report any further data points if you have time. |
|
i'm starting to suspect that there are just bugs in supernova and maybe particularly on the ARM/NEON branch of |
i also tested it in pretty cool ambient air; took it out to a nearby canyon late this afternoon. ran off the internal battery, only powering a grid and main audio outs (not the headphone jack). temps were a few degrees C lower, but the unit still locked up twice while running mlr -- both times when switching to the system audio/fx menu. the first time, the audio cut out immediately upon switching to that menu, even before i'd started adjusting the reverb send amount. the grid remained responsive, but didn't actually do anything. resetting the audio and relaunching mlr was ineffective; no sound came out, so i had to reboot. the second time, the audio cut out just after switching to the audio menu and quickly adjusting the reverb send. this time it locked up with the harsh buzz. after those two freezes and reboots, i got a good third take (i was filming a video). i had the proper fx screen already prepared and the setting highlighted, so i didn't need to menu dive the way i did the second take. it was a one-press switchover from mlr to fx settings. in this third try, i made a point of gradually adjusting the send amount, and that worked just fine. so one success outta three 3-minute tries ain't bad.
last time i played with arm and compilation/optimization/bootstrapping an OS for the architecture was on beagleboard circa 2009. at the time, the arch as a whole had major issues with neon and softfloat. wouldn't surprise me if they still linger in anything that does intense dsp. lemme know if either of you want me to upload my samples and mlr settings somewhere; it's mostly just track volumes and fx amounts. |
|
@nightmorph to fill you in, one thing we're doing is definitely getting rid of supernova. we still want to support people's supercollider patches so there is some work to do re-architecting things. this might include making the i'd welcome feedback if you want to give that a spin. it fixes write-head resampling, adds some weird new things (a filter), is still missing many other things (loading files, more voices, routing, param smoothing, &c) and is undergoing more optimization right now. it's nothing like a complete instrument and pretty much needs to be controlled from a remote computer. if you get random crashes with a standalone buffer-manipulating jack client like this then i think there is definitely something bad going with your unit. |
|
@nightmorph i know you need the reverb for your performance, but i'd be interested if the crash occurs with AUX turned off? also for mlr specifically, i'd be curious to have you try commenting out this line (disable supernova): https://github.com/monome/norns/blob/master/sc/core/Crone.sc#L47 you'll need to |
|
2.0 is out; closing |
norns OS version 181008 (updated from 180707. this bug is not present in 180707.)
the latest version of mlr keeps crashing during normal play, repeatedly taking down the entire norns OS with it. sometimes it's just a "soft" crash, in which the screen and controls are still responsive, so system->reset audio might fix it. in that case, another crash will result very shortly after reloading mlr, reloading samples, and beginning to play.
but most of the time, mlr crashes so hard that it completely locks up norns; it's no longer possible to reset the audio engine or use any of the onscreen menus/controls. only the bottom reset button reboots into a working system.
steps to reproduce:
The text was updated successfully, but these errors were encountered: