Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

class I/O problem causing crash with DBBC2 communication error #191

Closed
wehimwich opened this issue Feb 22, 2023 · 8 comments
Closed

class I/O problem causing crash with DBBC2 communication error #191

wehimwich opened this issue Feb 22, 2023 · 8 comments

Comments

@wehimwich
Copy link
Member

@varenius at station On reported crashes with 10.1.0-beta1. The latest example:

...
11:51:47/scan_check/?,r11091_on_052-1149b,VDIF,32,2023y52d11h49m58.2375s,107.762s,16.000000,0,8032
11:51:47/fb_status/status,0x00000001
11:51:47/fb/!net_protocol? 0 : udpsnor : 32000000 : 256000000 : 4 ;
11:51:47/fb/!mtu? 0 : 9000 ;
11:51:47/fb/!rtime? 0 : 3.17564e+06s : 204054GB : 47.2425% : VDIF : 32 : 0MHz : 514.048Mbps ;
11:51:47/fb_mode/VDIF_8000-512-16-2,,,((16.)),VDIF,32,16.,8000
11:51:47/fb/!scan_set= 0 ;
11:51:47/fb/!disk2file= 1 ;
11:51:52:setupsx
cls_alc_s: out of class numbers
cls_alc_s error in cls_snd()
dbbcn terminated-Exit code 255
mcbcn killed
boss killed
matcn killed
chekr killed
ibcon killed
fivpt killed
onoff killed
...

A classic class number eating crash report. Prior to that there were error message like:

2023.052.05:30:37.40?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:37.40?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:30:40.42?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:40.42?ERROR ch -810 Communication error for DBBC.
2023.052.05:30:44.22?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:44.22?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:30:54.22?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:54.22?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:31:03.43?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:31:03.43?ERROR ch -810 Communication error for DBBC.
2023.052.05:31:06.46?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:31:06.46?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:31:12.24?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:31:12.24?ERROR cd -1 Error from DBBCN in TPICD, see above for error.

It seems that the DBBC2 is having communication issues and that is exciting a bug. The key to the crashing appears to be the ch -810 error, which comes from chekr when is trying to check the DBBC2 firmware personality and version. It makes the check every 20 seconds (most of the checks do not encounter an error). This suggests that a class number is being left allocated each time the error occurs and eventually the number of class numbers is exhausted. The ch -810 error occurs about ~35 times before each crash (there is a maximum of 40 class numbers). Examining the code reveals, that sure enough, the class number is not being cleared for that error in two places:

fs/chekr/dbbcchk.c

Lines 64 to 76 in 1ce0cb7

nsem_take( "fsctl" , 0 );
skd_run("dbbcn",'w',ip);
nsem_put( "fsctl" );
skd_par(ip);
if(ip[2]<0) {
logita(NULL,ip[2],ip+3,ip+4);
logit(NULL,-810,lwho);
return;
}

and

fs/chekr/dbbcchk.c

Lines 115 to 127 in 1ce0cb7

nsem_take( "fsctl" , 0 );
skd_run("dbbcn",'w',ip);
nsem_put( "fsctl" );
skd_par(ip);
if(ip[2]<0) {
logita(NULL,ip[2],ip+3,ip+4);
logit(NULL,-810,lwho);
return;
}

In both cases, insie the if clause there needs to be a nested if clause:

      if(ip[0]!=0) {
        cls_clr(ip[0]);
        ip[0]=ip[1]=0;
      }

That change has now run without a crash for 160 of the ch -810 errors. So this is probably the cause. There does not seem to be this kind of error in the code for cd -1 error.

The next question is why is the DBBC2 having communication issues. It seems to go through ~30 second stretches where it signals EOF on the connection when queried. Of course, calibration data is being lost each time the cd -1 occurs. At the next opportunity, @varenius plans to reboot the DBBC2 to see if that will improve the situation.

wehimwich added a commit that referenced this issue Feb 22, 2023
Delete response class number allocated so it isn't left hanging
around.

Closes #191
@haavee
Copy link
Contributor

haavee commented Feb 22, 2023

Could the DBBC2 comms #FAIL be because of a dbbc_proxy inbetween FS and actual DBBC2 that's misbehaving?

@wehimwich
Copy link
Member Author

Good catch. @varenius would have to say if dbbc_proxy is use, but I would guess it is. If so, it might make sense, before restarting the DBBC2, to restart the dbbc_proxy to see if that solves the problem. Regardless if which is tried first, it may not be an independent test. However, if restarting one did not fix the problem and then restarting the other does fix it, that may tell us something. I don't understand the details of dbbc_proxy to know what makes the most sense here.

@haavee
Copy link
Contributor

haavee commented Feb 22, 2023

One option would be to try to manually hammer the DBBC2 with commands and check if it's maybe only the DBBCN command that trips it up or if it errors out on every single command. However, dbbc_proxy's been in use for years so why it would, all of a sudden, start to misbehave I don't know. But maybe a different Python version is used, who knows. Typically, dbbc_proxy might report errors on its console or in the syslog depending on how it was started. Thinking about it, @varenius should possibly try to look at those first to see if the proxy itself sais anything.

@wehimwich
Copy link
Member Author

wehimwich commented Feb 23, 2023

That sounds like good advice. I will say that the communication works most of the time, but there are these ~30 second periods when it fails. BTW, when the communication fails (time-out, EOF, etc) dbbcn closes the current connection and opens a new one.

@BeppeMc
Copy link

BeppeMc commented Feb 23, 2023 via email

wehimwich added a commit that referenced this issue Feb 23, 2023
Delete response class number allocated so it isn't left hanging
around.

Closes #191
@wehimwich
Copy link
Member Author

wehimwich commented Feb 24, 2023

In Medicina we are running dbbc_proxy from long time ago and I didn't ever get this problem, even leaving dbbc_proxy running in between FS restart.

So perhaps the problem is more likely with the DBBC2.

I would repeat a Ed note of November 2019. May be they are still using class exchange system or logit: Please note also that it is critical that all versions of 'sterp' and 'erchk', local or not, must not use the class-I/O system, particularly the logit*() family of calls to report internal errors to the log.

It is a good point. I checked their setup (on one system anyway) and it doesn't look like they have local versions of sterp or erchk. So apparently this is not an issue for them.

This issue with sterp and erhck is actually more complex than I initially realized when I started this reply. It is going to require more time than I have right now to address it. I will try to get back to it in few days. It will probably need its own issue.

@wehimwich
Copy link
Member Author

wehimwich commented Feb 26, 2023

In 2014, a similar bug in tpicd was fixed for version 9.11.6 in commit eef9c44 and reported in fs9116up.txt. That problem was originally reported by Michael Lindqvist and Roger Hammargren at Onsala.

I looked through the current FS code and did not find any more instances of this issue for DBBC2/DBBC3. The code isn't always the same for handling this issue but it all looked okay.

There were three places in onoff where a class number was cleared redundantly if there was an error. It is possible that the code could clear a class number that is already in use again, but the cross section is pretty small and would require the scheduler interrupting execution between the two clears. It is very rare, but has been known to happen. They are now fixed with commit b38de51.

Addendum: The bugs in chekr were introduced in commits 8e59c7c (January 2016, FS 9.11.9) and 0fd3f96 (April 2019, 9.13.1), well after the bug in tpicd was fixed.

@wehimwich
Copy link
Member Author

I am going to push a commit that will close this issue. If new information comes to light, please feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants