class I/O problem causing crash with DBBC2 communication error #191

wehimwich · 2023-02-22T21:24:18Z

@varenius at station On reported crashes with 10.1.0-beta1. The latest example:

...
11:51:47/scan_check/?,r11091_on_052-1149b,VDIF,32,2023y52d11h49m58.2375s,107.762s,16.000000,0,8032
11:51:47/fb_status/status,0x00000001
11:51:47/fb/!net_protocol? 0 : udpsnor : 32000000 : 256000000 : 4 ;
11:51:47/fb/!mtu? 0 : 9000 ;
11:51:47/fb/!rtime? 0 : 3.17564e+06s : 204054GB : 47.2425% : VDIF : 32 : 0MHz : 514.048Mbps ;
11:51:47/fb_mode/VDIF_8000-512-16-2,,,((16.)),VDIF,32,16.,8000
11:51:47/fb/!scan_set= 0 ;
11:51:47/fb/!disk2file= 1 ;
11:51:52:setupsx
cls_alc_s: out of class numbers
cls_alc_s error in cls_snd()
dbbcn terminated-Exit code 255
mcbcn killed
boss killed
matcn killed
chekr killed
ibcon killed
fivpt killed
onoff killed
...

A classic class number eating crash report. Prior to that there were error message like:

2023.052.05:30:37.40?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:37.40?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:30:40.42?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:40.42?ERROR ch -810 Communication error for DBBC.
2023.052.05:30:44.22?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:44.22?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:30:54.22?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:30:54.22?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:31:03.43?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:31:03.43?ERROR ch -810 Communication error for DBBC.
2023.052.05:31:06.46?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:31:06.46?ERROR cd -1 Error from DBBCN in TPICD, see above for error.
2023.052.05:31:12.24?ERROR db -107 dbbcn: EOF on read, dbbc may have crashed, connection closed
2023.052.05:31:12.24?ERROR cd -1 Error from DBBCN in TPICD, see above for error.

It seems that the DBBC2 is having communication issues and that is exciting a bug. The key to the crashing appears to be the ch -810 error, which comes from chekr when is trying to check the DBBC2 firmware personality and version. It makes the check every 20 seconds (most of the checks do not encounter an error). This suggests that a class number is being left allocated each time the error occurs and eventually the number of class numbers is exhausted. The ch -810 error occurs about ~35 times before each crash (there is a maximum of 40 class numbers). Examining the code reveals, that sure enough, the class number is not being cleared for that error in two places:

fs/chekr/dbbcchk.c

Lines 64 to 76 in 1ce0cb7

    
           nsem_take( "fsctl" , 0 ); 
        
           skd_run("dbbcn",'w',ip); 
        
           nsem_put( "fsctl" ); 
        
           skd_par(ip); 
        
           if(ip[2]<0) { 
        
             logita(NULL,ip[2],ip+3,ip+4); 
        
             logit(NULL,-810,lwho); 
        
             return; 
        
           }

and

fs/chekr/dbbcchk.c

Lines 115 to 127 in 1ce0cb7

    
           nsem_take( "fsctl" , 0 ); 
        
           skd_run("dbbcn",'w',ip); 
        
           nsem_put( "fsctl" ); 
        
           skd_par(ip); 
        
           if(ip[2]<0) { 
        
             logita(NULL,ip[2],ip+3,ip+4); 
        
             logit(NULL,-810,lwho); 
        
             return; 
        
           }

In both cases, insie the if clause there needs to be a nested if clause:

      if(ip[0]!=0) {
        cls_clr(ip[0]);
        ip[0]=ip[1]=0;
      }

That change has now run without a crash for 160 of the ch -810 errors. So this is probably the cause. There does not seem to be this kind of error in the code for cd -1 error.

The next question is why is the DBBC2 having communication issues. It seems to go through ~30 second stretches where it signals EOF on the connection when queried. Of course, calibration data is being lost each time the cd -1 occurs. At the next opportunity, @varenius plans to reboot the DBBC2 to see if that will improve the situation.

The text was updated successfully, but these errors were encountered:

Delete response class number allocated so it isn't left hanging around. Closes #191

haavee · 2023-02-22T23:03:37Z

Could the DBBC2 comms #FAIL be because of a dbbc_proxy inbetween FS and actual DBBC2 that's misbehaving?

wehimwich · 2023-02-22T23:14:27Z

Good catch. @varenius would have to say if dbbc_proxy is use, but I would guess it is. If so, it might make sense, before restarting the DBBC2, to restart the dbbc_proxy to see if that solves the problem. Regardless if which is tried first, it may not be an independent test. However, if restarting one did not fix the problem and then restarting the other does fix it, that may tell us something. I don't understand the details of dbbc_proxy to know what makes the most sense here.

haavee · 2023-02-22T23:32:02Z

One option would be to try to manually hammer the DBBC2 with commands and check if it's maybe only the DBBCN command that trips it up or if it errors out on every single command. However, dbbc_proxy's been in use for years so why it would, all of a sudden, start to misbehave I don't know. But maybe a different Python version is used, who knows. Typically, dbbc_proxy might report errors on its console or in the syslog depending on how it was started. Thinking about it, @varenius should possibly try to look at those first to see if the proxy itself sais anything.

wehimwich · 2023-02-23T00:16:41Z

That sounds like good advice. I will say that the communication works most of the time, but there are these ~30 second periods when it fails. BTW, when the communication fails (time-out, EOF, etc) dbbcn closes the current connection and opens a new one.

BeppeMc · 2023-02-23T07:04:06Z

In Medicina we are running dbbc_proxy from long time ago and I didn't ever get this problem, even leaving dbbc_proxy running in between FS restart. I would repeat a Ed note of November 2019. May be they are still using class exchange system or logit: Please note also that it is critical that all versions of 'sterp' and 'erchk', local or not, must not use the class-I/O system, particularly the logit*() family of calls to report internal errors to the log. These programs should have their own separate reporting mechanism for internal errors. If your local version of either of these programs is using the class-I/O system in any way, this must be corrected. Beppe.On Wed, 2023-02-22 at 16:16 -0800, Ed Himwich wrote:

…

That sounds like good advice. I will say that the communication work most of the time, but there are these ~30 second period when it fails. BTW, when the communication fails (time-out, EOF, etc) dbbcn closes the current connection and opens a new one. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com> [ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#191 (comment)", "url": "#191 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Delete response class number allocated so it isn't left hanging around. Closes #191

wehimwich · 2023-02-24T19:56:53Z

In Medicina we are running dbbc_proxy from long time ago and I didn't ever get this problem, even leaving dbbc_proxy running in between FS restart.

So perhaps the problem is more likely with the DBBC2.

I would repeat a Ed note of November 2019. May be they are still using class exchange system or logit: Please note also that it is critical that all versions of 'sterp' and 'erchk', local or not, must not use the class-I/O system, particularly the logit*() family of calls to report internal errors to the log.

It is a good point. I checked their setup (on one system anyway) and it doesn't look like they have local versions of sterp or erchk. So apparently this is not an issue for them.

This issue with sterp and erhck is actually more complex than I initially realized when I started this reply. It is going to require more time than I have right now to address it. I will try to get back to it in few days. It will probably need its own issue.

wehimwich · 2023-02-26T16:04:44Z

In 2014, a similar bug in tpicd was fixed for version 9.11.6 in commit eef9c44 and reported in fs9116up.txt. That problem was originally reported by Michael Lindqvist and Roger Hammargren at Onsala.

I looked through the current FS code and did not find any more instances of this issue for DBBC2/DBBC3. The code isn't always the same for handling this issue but it all looked okay.

There were three places in onoff where a class number was cleared redundantly if there was an error. It is possible that the code could clear a class number that is already in use again, but the cross section is pretty small and would require the scheduler interrupting execution between the two clears. It is very rare, but has been known to happen. They are now fixed with commit b38de51.

Addendum: The bugs in chekr were introduced in commits 8e59c7c (January 2016, FS 9.11.9) and 0fd3f96 (April 2019, 9.13.1), well after the bug in tpicd was fixed.

wehimwich · 2023-02-26T21:13:19Z

I am going to push a commit that will close this issue. If new information comes to light, please feel free to reopen it.

wehimwich added a commit that referenced this issue Feb 22, 2023

Fix bug for DBBC2 communicaton error with chekr

c3a15ee

Delete response class number allocated so it isn't left hanging around. Closes #191

wehimwich added a commit that referenced this issue Feb 23, 2023

Fix bug for DBBC2 communicaton error with chekr

7f8b634

Delete response class number allocated so it isn't left hanging around. Closes #191

wehimwich closed this as completed in 0094a3c Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

class I/O problem causing crash with DBBC2 communication error #191

class I/O problem causing crash with DBBC2 communication error #191

wehimwich commented Feb 22, 2023

haavee commented Feb 22, 2023

wehimwich commented Feb 22, 2023

haavee commented Feb 22, 2023

wehimwich commented Feb 23, 2023 •

edited

Loading

BeppeMc commented Feb 23, 2023 via email

wehimwich commented Feb 24, 2023 •

edited

Loading

wehimwich commented Feb 26, 2023 •

edited

Loading

wehimwich commented Feb 26, 2023

class I/O problem causing crash with DBBC2 communication error #191

class I/O problem causing crash with DBBC2 communication error #191

Comments

wehimwich commented Feb 22, 2023

haavee commented Feb 22, 2023

wehimwich commented Feb 22, 2023

haavee commented Feb 22, 2023

wehimwich commented Feb 23, 2023 • edited Loading

BeppeMc commented Feb 23, 2023 via email

wehimwich commented Feb 24, 2023 • edited Loading

wehimwich commented Feb 26, 2023 • edited Loading

wehimwich commented Feb 26, 2023

wehimwich commented Feb 23, 2023 •

edited

Loading

wehimwich commented Feb 24, 2023 •

edited

Loading

wehimwich commented Feb 26, 2023 •

edited

Loading