Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up / document component management methods #1014

Closed
catfact opened this issue Feb 17, 2020 · 15 comments
Closed

clean up / document component management methods #1014

catfact opened this issue Feb 17, 2020 · 15 comments
Assignees

Comments

@catfact
Copy link
Collaborator

@catfact catfact commented Feb 17, 2020

recently there were some issues reported on the forum related to the way program lifecycle management works on norns (or doesn't work.)

it made me realize that the use cases and the final design concept are fuzzy to me, and i guess to most users.

here are a few things that can be used to reset parts of the norns stack:

    1. SYSTEM > SLEEP.
    • saves state
    • zeros audio output levels
    • performs sudo shutdown
    1. SYSTEM > RESET
    • saves state
    • resets systemd services norns-matron, norns-crone, norns-sclang
    1. issuing ;restart to the matron repl in maiden.
    • resets only norns-matron service (i think!)
    • immediately reconnects maiden to matron websocket, catching most startup output
    • this will cause a lot of runtime problems, since matron needs metadata about running engines/polls.
    • the only time i can think of needing this, is in the diagnosis of startup errors in matron sources or system lua (and for some reason you can't or don't want to use the shell.)
    1. issuing ;restart to the sc repl in maiden.
    • resets only norns-sclang service (i think!)
    • immediately reconnects maiden to sclang websocket, catching most startup output
    • as far as i know, things should basically keep working after doing this, but the architecture doesn't actually guarantee it.
    • i think this is useful in the diagnosis of startup errors in supercollider, which includes any errors in custom engine classes. (though again one can just use the shell also.)
    1. AUDIO > RESET. as far as i can tell, this is no longer exposed in the menu, but the underlying API calls (_norns.reset_audio in lua, and on down) are still available:
    • soft-resets supercollider environment by performing Interpreter.recompile, which also relaunches the scsynth process.
    • tells matron to reperform handshake with sclang.
    • so in a nutshell, this should guarantee that things keep working, but could of course be bugged at this point. (this was more relevant pre-2.0.)
    1. rude power button. should result in a false clean_shutdown flag, meaning next launch will not run any script, restore mix levels, or (importantly) restore device mapping. this is still the only way for a typical (non-commandline) user to get out of many bugged situations. (e.g. a bug in the default script, bug in our handling of vports with nonexistent devices, anything that locks up the UI.)

in other words, i guess i think we need to clearly inform users about which method to use for what circumstance. (troubleshooting, recovering, updating etc.)? and we may want to refine the offerings. for example:

  • if the "audio reset" behavior is no longer relevant post-2.0, let's remove it completely.
  • we may want a way for users to perform a "clean boot" without the rude power button. ideally it would be implemented at a lower level than the main application loop, e.g.:
    • if all three keys are held (say), we detect this in hardware/gpio module and launch a special timeout thread.
    • if timeout thread detects a 6s hold of all keys (or whatever), matron immediately executes a soft-reset of the stack via systemd, without saving any application state or writing clean_shutdown.

what do yall think?

@catfact
Copy link
Collaborator Author

@catfact catfact commented Feb 17, 2020

oh another thought, sorry

we tell users with locked-up norns that the rude button is a method of last resort (correct), and imply that matron repl ;restart is preferred. two issues with that:

  • we should qualify that this should get you back to the menu with a clean application state, but it should not be considered functional at that point and SYSTEM > RESET should be used right away. if we want to remove this qualiication, we could put a probably-minor effort towards a more intelligent re-connection apparatus.
  • if matron's main loop is being overwhelmed, it may not respond to the restart request. and its not uncommon for the maiden UI to be flooded and unresponsive at that point too.
@tehn
Copy link
Member

@tehn tehn commented Feb 17, 2020

thanks for bringing this up.

SYSTEM > RESET should do a "clean" no-script restart, ie, delete the system.pset etc and reset to known working values. the current behaviour is misleading and doesn't solve all problematic use cases.

is it frequent that people somehow lock up their onboard UI? (ie knobs/screen?)

@catfact
Copy link
Collaborator Author

@catfact catfact commented Feb 18, 2020

yes, it is easy to lock up the UI with scripting errors or due to system bugs. for example if we print some error message at 1000hz then the main loop is borked.

hence the proposal to have a clean reset executed directly from GPIO module, which seems easy ( i can PR it if you like)

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Feb 18, 2020

Some quick questions

issuing ;restart to the matron repl in maiden.
this will cause a lot of runtime problems, since matron needs metadata about running engines/polls.

Why is this? Does this means that if I restart matron I get into a broken state? It doesn't just get the required info when it starts?

Do we need four different types/ways (2-5 in the list) to reset stuff?

hence the proposal to have a clean reset executed directly from GPIO module, which seems easy ( i can PR it if you like)

We could also use a watchdog for this, although that's of course quiet invisible to the user/not user initiated.

@tehn
Copy link
Member

@tehn tehn commented Feb 18, 2020

here's my proposal:

  • SYSTEM > RESET should delete all system settings (levels, vport assignments, current script) which will get the software in a "clean" state (these settings are pretty minimal so it's not a hassle)
  • a hard reset (white button) will have a proceeding boot in "not clean" state which means a script will not be loaded, so the user will not be locked out. hence i don't think it's necessary to have a GPIO detection method for clean boot. if we want a dirty boot to self-delete config data i'm fine with that, but it doesn't seem necessary as the user could then execute SYSTEM > RESET though that would only reset the levels, vports, etc.
  • in the case of a script fail (ie overloading prints disabling the hardware UI), there's a chance that ;restart via maiden will catch. if not, that's unfortunately what the hardware reset is for, and i am skeptical that watchdog is a good idea here--- false positives are way more problematic than an occasional hard shutdown, which should seriously only happen if someone is developing a script, so they should be aware of what they're doing. the fact that people think hard shutdowns fix anything is a huge misunderstanding that must be addressed... and obviously having point 1 above reset correctly will solve most peoples' issues.
  • ;restart in maiden should perhaps always reset both matron and crone and supercollider to get the system in a known state. this of course may change when/if we push around the supercollider handshake requirement (for later discussion). i generally find the quick matron reset helpful when developing the core lua menu stuff, but that's a minor case.
  • i generally want to avoid special-case gestures, but a sort of "watchdog" could be a 6-second hold of all three keys, to issue a reset during a lua lockup. this of course takes away this option as a performance/user gesture, though i'm not sure it's a super useful one (unless you were playing the 3 keys like an organ or something... which is why i hesitate here)
@tehn
Copy link
Member

@tehn tehn commented Feb 18, 2020

ugh. on another note, undocumented feature: K1 held + SELECT from the menu clears the current script. though like SYS > RESET it's not accessible if the UI is locked up

@tehn
Copy link
Member

@tehn tehn commented Feb 18, 2020

partial fix by #1015

TODO: documentation

still open to conversation about watchdog/GPIO/etc

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Feb 18, 2020

a hard reset (white button) will have a proceeding boot in "not clean" state which means a script will not be loaded, so the user will not be locked out. hence i don't think it's necessary to have a GPIO detection method for clean boot. if we want a dirty boot to self-delete config data i'm fine with that, but it doesn't seem necessary as the user could then execute SYSTEM > RESET though that would only reset the levels, vports, etc.

I think it would be nice to ensure that after startup everything is always guaranteed to be in a working state, no additional actions needed.
From @catfact's description I gather we already made sure everything was in the correct state when this happened based on the clean_shutdown flag he mentioned. Or is that not correct?

IMHO it would make sense if this would do the same thing as SYSTEM > RESET (maybe not restarting the services depending on when/where in the boot process we do this)

Also being able to reset (or restart) outside of SYSTEM > RESET and the hardware reset button means that we should be able to shutdown properly in more cases. Resets with the hardware reset button should only be necessary if the whole system (ie the kernel/Linux) hangs, not when the components of the norns stack hang, this prevents potential issues with disk/filesystem corruption/broken journals.

And @tehn you're right, watchdogs can be tricky and false positives is something we definitely don't want. Just wanted to mention it as an option. Personally I like the suggestion of having a complex key combination but I also see the issues with that combo then not being available for scripts to use so figured it might be a possible alternative solution.

@tehn
Copy link
Member

@tehn tehn commented Feb 18, 2020

honestly there isn't much to break at this point:

  • someone has their levels wrong and is really confused (hence deleting system.pset restores levels to sane)
  • something is busted with vports (which is a whole other issue) and deleting system.state means that vports are re-initialized on usb plugin (ie, the user doesn't have to manually reassign vports)

neither of these are a Fix. they are a quick solution to user error. which is why this hasn't gotten much attention up until now. the clean_shutdown flag is really just to prevent re-loading a failed script. and that totally works 100% even prior to today's fix.

the case of users truly stuck is uncommon and should only happen if someone if writing a script and gets in trouble... which hopefully can be solved via maiden resets/etc.

generally there should not be a common case where normal use of a script should cause a full lockup for someone, but of course this is possible.

i also don't think we realistically have seen repeatable lockups of the kernel or underlying norns components--- just the lua environment. so any fix should mostly address that. key combination is an option.

@catfact
Copy link
Collaborator Author

@catfact catfact commented Feb 19, 2020

Why is this? Does this means that if I restart matron I get into a broken state? It doesn't just get the required info when it starts?

well, it doesn't signal any change to the sclang process. so the running engine can be doing stuff, and in the case of engine polls it happens to be sending OSC that matron now doesn't understand.

we can patch this of course. i think reset_audio should just broadcast OSC to any/all connected process, engine interfaces should just handle this by shutting down current engine, and we should always broadcast this on startup just in case.

but here's a simplifying suggestion:

make a ;restart in either REPL on maiden, cause both services to restart,
and maybe the whole norns- service stack.

in fact: maybe rename ;restart to ;reset, and make it do the exact same thing as SYSTEM>RESET.
wouldn't that clarify things?

my assumption is that this:

SYSTEM > RESET should delete all system settings (levels, vport assignments, current script) which will get the software in a "clean" state (these settings are pretty minimal so it's not a hassle)

can be accomplished just by resetting all the systemd services, (causing a dirty-boot), and ensuring that handling a dirty-boot means deleting vports/psets. right?


generally there should not be a common case where normal use of a script should cause a full lockup for someone, but of course this is possible.

it happens. the norns API is very complicated and we can't pretend to have tested every possible script interaction for errors in the system stack on every update. so we really can't assume that it's user error or even a scripting error when the UI hangs for whatever reason.

the main reason i'm bringing up a separate hardware-driven reset, is: i have some suspicion that the white button is really very bad.

i spent a little time trying to research the susceptibility of XNAND flash to permanent damage from power loss during write. i didn't do enough to say for sure and it's a complicated topic, but i think it's a risk. (basically has a chance of creating a permanently-bad sector.)

even if it "only" corrupts the filesystem, in this environment that can mean hardware damage.

anyways, of course it's completely up to you whether a key-combo to restart is acceptable. it could be a really long hold (30s) or even something like a triple-hold plus a sequence of encoder turns. the purpose of such a thing would be to really do as much as possible to give people as few excuses as possible for using the white button under any circumstance.

@catfact
Copy link
Collaborator Author

@catfact catfact commented Feb 19, 2020

in fact, here's an additional suggestion:

;reset can reset all both processes and do journalctl and dmesg or whatever to assemble an error report. is that crazy?

@tehn
Copy link
Member

@tehn tehn commented Feb 19, 2020

i actually think a long hold (10s) in a particular sequence (k3,k2,k1 or something) would be a good idea. granted this would be managed by matron, so it's less immune than maiden's ;reset... but i still think it's a good idea.

logging, sure. would be possibly sensible to dump a log to ~/dust/data so it's view-able in maiden. i'm unsure which logs exactly we'd want.

@tehn
Copy link
Member

@tehn tehn commented Feb 19, 2020

here's my proposal:

  • matron: add lower level reset function (which basically issues systemd commands)
  • matron: GPIO monitor for 10s hold of all keys, in particular order, which issues reset
  • lua: on startup, if dirty detected, erase system.pset and system.state, skip script resume
  • maiden: add ;restart in either window which does systemd restart norns-* (this will cause a dirty load of matron)

note that if the norns is shut down (or matron restarted) in any way other than SLEEP it will boot up the next time dirty.

@tehn tehn self-assigned this Feb 19, 2020
@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Feb 19, 2020

matron: add lower level reset function (which basically issues systemd commands)

Would this be different from systemd restart norns-* as listed in point 4?

maiden: add ;restart in either window which does systemd restart norns-* (this will cause a dirty load of matron)

Could the dirty startup, which will trigger item 3 if I understand it correctly, somehow interfere with/be annoying when developing Lua (like the menu code you mentioned) or SC code from maiden?

P.S. We could probably use systemd's PartOf dependency so you only need to restart one item (probably norns.target)

PartOf=

Configures dependencies similar to Requires=, but limited to stopping and restarting of units. When systemd stops or restarts the units listed here, the action is propagated to this unit. Note that this is a one-way dependency — changes to this unit do not affect the listed units.

@tehn
Copy link
Member

@tehn tehn commented Feb 19, 2020

Would this be different from systemd restart norns-* as listed in point 4?

bullet 1 isn't really needed. just issuing an os command is sufficient.


i was not suggesting removing maiden's ;restart ;start and ;stop for matron. so that could still be used for development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants