Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sclang crash on system time sync #399

Open
tehn opened this issue Jun 5, 2018 · 66 comments
Open

sclang crash on system time sync #399

tehn opened this issue Jun 5, 2018 · 66 comments
Labels
bug

Comments

@tehn
Copy link
Member

@tehn tehn commented Jun 5, 2018

sclang crashes when time/date change by a substantial amount

  • this happens when connecting to a wifi network which syncs time/date
@tehn tehn added the bug label Jun 5, 2018
@samdoshi
Copy link

@samdoshi samdoshi commented Jun 5, 2018

What are you using to set the time on network connection? systemd-timesyncd or a full NTP client?

Does the hardware have an RTC?

@tehn
Copy link
Member Author

@tehn tehn commented Jun 5, 2018

the hardware does not have an RTC. it didn't occur to me that this would ever be an issue unfortunately.

@samdoshi
Copy link

@samdoshi samdoshi commented Jun 5, 2018

Okay. Apparently you can use fake-hwclock to stop the time coming back up as the UNIX epoch (1st Jan 1970), instead it will come up with the time set the same as when Norns was last powered down.

Something like chrony (an NTP client) will slew the clock rather than step it, but that's more designed to handle differences of a few seconds.

I guess the question is what counts as a "substantial amount"? If the issue is that it crashes when going from 1st Jan 1970 to the current date, but not if jump is only a few weeks/months then fake-hwclock might be good enough.

Otherwise someone will need to fire up gdb and figure out why sclang is crashing.

This might be handy for setting up a reproducible crash on a desktop machine:

https://serverfault.com/questions/138325/faking-the-date-for-a-specific-shell-session

@catfact
Copy link
Collaborator

@catfact catfact commented Jun 5, 2018

@artfwo has already done a lot of work on this including with fake-hwclock. He's out for the week at Linux audio conference. Maybe will have more to add. oddly, spoofing the clock doesn't crash sclang (necessarily) but does mess up various ugens in various ways

@samdoshi
Copy link

@samdoshi samdoshi commented Jun 5, 2018

Bother. Tried using libfaketime to crash sclang:

$  FAKETIME_STOP_AFTER_SECONDS=10 faketime "1970-01-01" sclang 

(skipped some output)

*** Welcome to SuperCollider 3.9.3. *** For help type ctrl-c ctrl-h (Emacs) or :SChelp (vim) or ctrl-U (sced/gedit).
qt5ct: palette support is disabled
sc3> thisThread.seconds
-> 7.212063747
sc3> thisThread.seconds
-> 1528231717.044
sc3> 
Exiting sclang (ctrl-D)
sc3> main: waiting for input thread to join...
main: quitting...
cleaning up OSC

The 2 calls to thisThread.seconds are a few seconds apart. No crash.

@catfact
Copy link
Collaborator

@catfact catfact commented Jun 6, 2018

did you have a server running though? the crash is in scsynth, not sclang

[now seems like i was wrong about that, sorry]

@catfact
Copy link
Collaborator

@catfact catfact commented Jun 6, 2018

and it might be related to other stuff we're doing, like grabbing bus values with the shared memory interface

@samdoshi samdoshi changed the title sclang crash on big time/date jump scsynth crash on big time/date jump Jun 6, 2018
@samdoshi
Copy link

@samdoshi samdoshi commented Jun 6, 2018

I did try booting the server, but not in depth. I've edited the bug title to say scsynth instead of sclang.

Just tried on my desktop again (no Norns yet).

$  FAKETIME_STOP_AFTER_SECONDS=30 faketime "1970-01-01" sclang

(skipped output)

sc3> thisThread.seconds;
-> 6.12997873
sc3> s.boot;
-> localhost
sc3> booting server 'localhost' on address: 127.0.0.1:57110

(skipped output)

sc3> thisThread.seconds;
-> 15.170008094
sc3> x = {SinOsc.ar}.play;
-> Synth('temp__0' : 1000)
sc3> thisThread.seconds;
-> 19.714078338
sc3> thisThread.seconds;
-> 1528310138.1781
sc3> x.free;
-> Synth('temp__0' : 1000)
sc3> x = {SinOsc.ar}.play;
-> Synth('temp__1' : 1001)
sc3> x.free;
-> Synth('temp__1' : 1001)
sc3> thisThread.seconds;
-> 1528310155.1541
sc3> 
Exiting sclang (ctrl-D)
sc3> main: waiting for input thread to join...
main: quitting...
server 'localhost' disconnected shared memory interface
'/quit' sent

cleaning up OSC

No crash with a SinOsc playing over the time jump.

I think someone that can observe a crash will need to try and minimise it. If you've got libfaketime available that should (hopefully) let you automate a crash. That would be my first step.

edit: I probably should say I can't guarantee that libfaketime is running properly inside the scsynth process that's started by sclang, I could start an external server, but I think it's easier for someone that can observe the existing crash to take over.

@artfwo
Copy link
Member

@artfwo artfwo commented Jun 6, 2018

this is reproducible only in part, even when you change the actual system time using date -s. i do not experience crashes, but the engine breaks in weird ways instead: doesn't react to commands, doesn't correctly report polls, etc. as @catfact described above.

and yeah, raspbian is already using fake-hwclock+systemd-timesyncd out of the box. we haven't touched those.

@catfact catfact changed the title scsynth crash on big time/date jump scsynth crash on wifi enable Jun 6, 2018
@catfact
Copy link
Collaborator

@catfact catfact commented Jun 6, 2018

updated title since were not actually sure what the crash mechanism is, only that it can be (very sporadically) reproduced when enabling wifi

@artfwo
Copy link
Member

@artfwo artfwo commented Jun 6, 2018

it's related to timesync, this can be verified by watching syslog when the crash happens. it's also clear that the sporadicity of the bug decreases if you hold the box switched off for significant (hours) amount of time.

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Jun 6, 2018

Can confirm it's related to the time change, can reproduce this way:

# make sure you've synced time once, so systemd-timesyncd has a timestamp stored
sudo systemctl stop systemd-timesyncd.service
sudo date -s '2018-05-24 08:00:00'
sudo systemctl restart norns-crone.service
sudo systemctl start systemd-timesyncd.service
# now run top and see sclang take 100% CPU

This also happens why I just run sclang without the ws-wrapper, so it seems it is caused by SC and not by our code.

The odd thing is that manually changing the date doesn't seem to trigger this issue. I'm not sure why this yet. Maybe it's related to how systemd-timesync works/changes the time? There is mention of gradually adjusting the time here https://www.freedesktop.org/software/systemd/man/systemd-timesyncd.service.html

This minimalistic service will set the system clock for large offsets or slowly adjust it for smaller deltas
Maybe that doesn't work too well with sclang?

One other odd thing, I've ran into the following a couple of times:

Warning: systemd-timesyncd.service changed on disk. Run 'systemctl daemon-reload

Not sure what's causing it. I did a diff and the unit that's loaded is no different than the unit on disk 😕

@samdoshi
Copy link

@samdoshi samdoshi commented Jun 8, 2018

Apparently systemd-timesyncd will perform a similar function to fake-hwclock

It also saves the local time to disk every time the clock has been synchronized and uses this to possibly advance the system realtime clock on subsequent reboots to ensure it monotonically advances even if the system lacks a battery-buffered RTC chip.

https://www.freedesktop.org/software/systemd/man/systemd-timesyncd.service.html

I'm wondering if there is some weird interplay between them.

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Jun 9, 2018

So, it seems this is a bug in supercollider where if the system time changes a large enough amount it'll consume 100% CPU for a longer period which will eventually result in a kernel panic.

I've been able to replicate this on my laptop as well, so it's not related to ARM/the CM3.
How to replicate on norns:

sudo systemctl stop systemd-timesyncd
sudo date -s '2018-06-01 00:00:00'
sudo systemctl restart norns-crone
sudo date -s '2018-06-01 04:00:00'

When using smaller time delta's the following happens:

  • Delta of 1 hour: brief spike (+-3 seconds) of 100% CPU and goes back to normal CPU usage after that
  • Delta of 2 hours: slightly longer spike (+-6 seconds) of 100% CPU and goes back to normal CPU usage after that
  • Delta of 3 hours: results in an even longer spike (+-12 seconds) of 100% CPU and goes back to normal CPU usage after that. Sometimes already results in a kernel panic

I'll file a bug for SuperCollider for this.

I think for now the simplest way to address this is to disable timesync/systemd-timesyncd. This does mean we won't have timesync of course, but I don't think the system time is used in many places. @artfwo mentioned system time is used for naming recordings, we should probably change that to use something else like an incrementing number for this.

@catfact
Copy link
Collaborator

@catfact catfact commented Jun 9, 2018

forgive my ignorance, but would that be the same system timed used by posix clock_gettime()?

we use that in metronomes:
https://github.com/monome/norns/blob/master/matron/src/metro.c#L218

we also use gettimeofday() for lua timing
https://github.com/monome/norns/blob/master/matron/src/weaver.c#L1127

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Jun 9, 2018

@catfact Someone who knows C should probably answer that, but I think it sometimes is, depending on the first argument passed to it, see https://stackoverflow.com/a/12480485

@catfact
Copy link
Collaborator

@catfact catfact commented Jun 9, 2018

thanks for checking. Ok, looks like metros should be fine (CLOCK_MONOTONIC) and get_time() will need adjustment (wall clock)

@scztt
Copy link

@scztt scztt commented Jun 9, 2018

Can someone who has repro'd attach an actual crash stack to this issue?

@catfact catfact changed the title scsynth crash on wifi enable scsynth crash on system time sync Jun 9, 2018
@catfact
Copy link
Collaborator

@catfact catfact commented Jun 10, 2018

i'll keep trying but haven't been able to get a nice stack trace or even a proper crash actually.

what i get is arbitrarily long hangs where the server is just totally unresponsive. sclang output (in syslog since we're running it with systemd) is silent except when s_new, &c is failing. nothing in kern.log or /var/crash.

next will try restructuring things to work with a remote server and launching scsynth explicitly, maybe can catch more of its output.

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Jun 10, 2018

@scztt @catfact the steps I posted above consistently cause a kernel panic for me on norns. Is that not the case for you?

@catfact
Copy link
Collaborator

@catfact catfact commented Jun 11, 2018

no, not getting kernel panics, just hangs. like last time i tried, it was running the awake script - and i got

  • a nasty audio dropout, then silence
  • then about a minute of s_new errors from sclang
  • then, sound came back and about a thousand notes tried to play at once

maybe i should take a snapshot of my filesystem since i haven't actually been using the update/sync process.

:/

@tehn
Copy link
Member Author

@tehn tehn commented Jun 11, 2018

i've seen this jam-up of events happen typically with WIFI activation or attempted activation.

this also happened with the old update method, where a long (20 seconds) os.execute was run, effectively stalling lua execution, and then all the metros would catch up and trigger at once. (os.execute should be run in a coroutine or the script should fork immediately, but the update routine has changed so this isn't an issue any longer)

@tehn
Copy link
Member Author

@tehn tehn commented Jun 28, 2018

@simonvanderveldt @artfwo is there a strong reason not to simply disable systemd-timesyncd so we can move past this in the short term?

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Jun 28, 2018

@simonvanderveldt @artfwo is there a strong reason not to simply disable systemd-timesyncd so we can move past this in the short term?

@tehn I've been running with it disable for the last two weeks. Haven't noticed any issues because of the time being out of sync. There might be some funky issues eventually, mainly looking at security/encryption related things which can depend on some form of a somewhat sane time. But for example installing deb packages worked fine.

@artfwo I believe you were mainly concerned about stuff like log ration, right?

@ngwese
Copy link
Member

@ngwese ngwese commented Jun 28, 2018

it would be worth testing the usb sync feature - iirc the rsync command being used is trying to do deltas in order to minimize the amount of data being synced. what i’m not sure of it whether rsync is using file modification timestamps as part of its selection criteria.

@scztt
Copy link

@scztt scztt commented Jun 29, 2018

(Non-local) OSC bundles require time sync to work - if you ever have an external device sending timing-sensitive OSC messages (or, for that matter, any OSC bundles that have a timestamp at all, whether or not you really care about timing), it'll be broken if your clocks aren't synced. Maybe not an immediate need, but definitely something to be aware of.

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Aug 13, 2018

@scztt Thanks! We can try that
I do expect we're using this one though https://github.com/supercollider/supercollider/blob/cca12ff02a774a9ea212e8883551d3565bb24a6f/server/scsynth/SC_Jack.cpp#L514

[edit] as @catfact said, if we're going to use supernova this probably won't be relevant.

@catfact
Copy link
Collaborator

@catfact catfact commented Aug 13, 2018

... not if we're using supernova? seems totally different

@tehn agree, i got one too a couple hours ago.

@scztt
Copy link

@scztt scztt commented Aug 14, 2018

@simonvanderveldt https://github.com/supercollider/supercollider/blob/d0c3e438ec9cefda19af08d01db0b536829458f5/server/supernova/sc/sc_osc_handler.cpp#L730 is the corresponding chunk for supernova.

This fix could possibly be mocked up in a very loose fashion using the /clearSched server message, which should do the same thing. If you could catch the clock change early, send a /clearSched, you might be able to keep incoherent time values out of the scheduler. This wouldn't be a reliable long-term fix, but it'll take five minutes to test and if it fixes then you'd have a pretty solid idea that the more rigorous schedule-clearing solution should work as well. I think you'd want to bundle /clearSched with a latency value of nil, which should cause it to be executed immediately.

@tehn tehn modified the milestones: 1.0.1, 1.0.2 Aug 14, 2018
@tehn
Copy link
Member Author

@tehn tehn commented Aug 14, 2018

short term solution.

  • add menu item for disable/enable audio engine. starting a wifi network session will automatically shut down audio.
sudo systemctl stop norns-crone
sudo systemctl start norns-crone
@tehn
Copy link
Member Author

@tehn tehn commented Aug 14, 2018

sudo systemctl stop norns-crone doesn't work.

we'll have to do

killall sclang
killall supernova
@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Aug 14, 2018

sudo systemctl stop norns-crone doesn't work.

@tehn Why doesn't that work?

@catfact
Copy link
Collaborator

@catfact catfact commented Aug 14, 2018

@simonvanderveldt @tehn it works for me

@catfact
Copy link
Collaborator

@catfact catfact commented Aug 14, 2018

three more points:

  • it might be cleaner to shutdown SC from within SC (s.quit, ("kill "++thisProcess.pid).unixCmd). this would also allow a slick fadeout of the audio instead of a hard stop.

  • when restarting, will have to go through the handshake procedure as in audio.restart(). beware of cairo thread safety for the audio connect animation (nothing else can be using the screen)

  • and of course the real difficulty is detecting timesync. @artfwo proposed a manual method of having a thread in matron that checks the system time delta. (this, i could help with.) other possible solutions involve adding systemd services, which is beyond me.

@tehn
Copy link
Member Author

@tehn tehn commented Aug 14, 2018

oh indeed it works, i made a mistake in my launching earlier.

slick fadeout

!!!

would be nice on shutdown as well.

@tehn tehn modified the milestones: 1.0.2, 1.0.3 Aug 27, 2018
@tehn tehn modified the milestones: 1.0.3, 1.0.4 Sep 4, 2018
@lazzarello
Copy link
Contributor

@lazzarello lazzarello commented Sep 22, 2018

I've run into reproducible screen lock ups, button latency and audio drop outs when I turn on wifi. I'm worried that all these hard resets might make the filesystem on the eMMC get corrupt. So I've been doing a reboot via SSH, which doesn't actually reboot the screen or wifi.

I wonder if disabling NTP will fix this problem? Now that I have a shell, I'm tempted to do "normal linux stuff".

@tehn
Copy link
Member Author

@tehn tehn commented Sep 22, 2018

@lazzarello when you turn on wifi and NTP sync's you MUST reboot crone. otherwise you'll experience problems.

i'm preparing an update which disables crone on wifi connection, and implemented a menu option to reboot crone.

@ngwese
Copy link
Member

@ngwese ngwese commented Oct 6, 2018

fyi: i'm running into problems with the new wifi logic.

  1. power up, load earthsea
  2. go to SYSTEM > WIFI > NETWORK
  3. select the desired network, confirm password, OK
    ...at this point i'm immediately taken to the main SYSTEM menu again (not the WIFI sub menu as previous)
    ...the loaded script say AUDIO ERROR which is somewhat expected since the engine just got stopped.
  4. navigate to SYSTEM > WIFI > NETWORK again and the status reads failed
  5. select SYSTEM > RESET AUDIO
    ...when returned to the menu and select a script they don't load, i can only get as far the the script help screen
  6. attempt to enable wifi again and i always get "password fail"
  7. second attempt at RESET AUDIO appears to work
@tehn
Copy link
Member Author

@tehn tehn commented Oct 6, 2018

thanks for the report. i need to prioritize a wifi overhaul.

the only thing that changed is that upon wifi activation crone is shut down (to be manually re-activated after connection is established.) i should add a message when activating wifi that explains this.

i'm not sure why it would take two attempts, and why wifi is still saying failed, however.

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Dec 28, 2020

@tehn Why was this closed? I don't think we have a solution for this yet, right?

@tehn
Copy link
Member Author

@tehn tehn commented Dec 28, 2020

this was closed in the great-ticket-purge at the march meetup. likely as it wasn't reported as a problem for some time.

have there been issues with this? (AFAIK, no?)

@simonvanderveldt
Copy link
Member

@simonvanderveldt simonvanderveldt commented Dec 28, 2020

It still happens every now and then that someone runs into this. Maybe we should keep this open keep track of it? Although realistically there isn't much we'll be able to do about it ourselves.

@antonhornquist
Copy link
Collaborator

@antonhornquist antonhornquist commented Dec 28, 2020

Mentioned by @lazzarello: ” I wonder if disabling NTP will fix this problem? ”

This is probably one way to address this. Or have people manually sync upon confirming that the script/engine is restarted or something (weird ”fix” but would probably work anyhow).

@antonhornquist
Copy link
Collaborator

@antonhornquist antonhornquist commented Dec 28, 2020

And this is messy still - any engine and script using a high rate polls stall upon time change. I think there’s reason to reopen this.

@tehn
Copy link
Member Author

@tehn tehn commented Dec 28, 2020

ah, i forgot about the poll-heavy crash on startup w/ wifi attached.

@tehn tehn reopened this Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants