-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
class I/O problem causing crash with DBBC2 communication error #191
Comments
Delete response class number allocated so it isn't left hanging around. Closes #191
Could the DBBC2 comms #FAIL be because of a |
Good catch. @varenius would have to say if dbbc_proxy is use, but I would guess it is. If so, it might make sense, before restarting the DBBC2, to restart the dbbc_proxy to see if that solves the problem. Regardless if which is tried first, it may not be an independent test. However, if restarting one did not fix the problem and then restarting the other does fix it, that may tell us something. I don't understand the details of dbbc_proxy to know what makes the most sense here. |
One option would be to try to manually hammer the DBBC2 with commands and check if it's maybe only the DBBCN command that trips it up or if it errors out on every single command. However, |
That sounds like good advice. I will say that the communication works most of the time, but there are these ~30 second periods when it fails. BTW, when the communication fails (time-out, EOF, etc) dbbcn closes the current connection and opens a new one. |
In Medicina we are running dbbc_proxy from long time ago and I didn't ever get this problem, even leaving dbbc_proxy
running in between FS restart. I would repeat a Ed note of November 2019. May be they are still using class exchange
system or logit:
Please note also that it is critical that all versions of 'sterp' and 'erchk', local or not, must not use the
class-I/O system, particularly the logit*() family of calls to report internal errors to the log. These
programs should have their own separate reporting mechanism for internal errors. If your local version of
either of these programs is using the class-I/O system in any way, this must be corrected.
Beppe.On Wed, 2023-02-22 at 16:16 -0800, Ed Himwich wrote:
… That sounds like good advice. I will say that the communication work most of the time, but there are these ~30 second
period when it fails. BTW, when the communication fails (time-out, EOF, etc) dbbcn closes the current connection and
opens a new one.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***
com>
[
{
***@***.***": "http://schema.org",
***@***.***": "EmailMessage",
"potentialAction": {
***@***.***": "ViewAction",
"target": "#191 (comment)",
"url": "#191 (comment)",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
***@***.***": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]
|
Delete response class number allocated so it isn't left hanging around. Closes #191
So perhaps the problem is more likely with the DBBC2.
It is a good point. I checked their setup (on one system anyway) and it doesn't look like they have local versions of sterp or erchk. So apparently this is not an issue for them. This issue with sterp and erhck is actually more complex than I initially realized when I started this reply. It is going to require more time than I have right now to address it. I will try to get back to it in few days. It will probably need its own issue. |
In 2014, a similar bug in tpicd was fixed for version 9.11.6 in commit eef9c44 and reported in fs9116up.txt. That problem was originally reported by Michael Lindqvist and Roger Hammargren at Onsala. I looked through the current FS code and did not find any more instances of this issue for DBBC2/DBBC3. The code isn't always the same for handling this issue but it all looked okay. There were three places in onoff where a class number was cleared redundantly if there was an error. It is possible that the code could clear a class number that is already in use again, but the cross section is pretty small and would require the scheduler interrupting execution between the two clears. It is very rare, but has been known to happen. They are now fixed with commit b38de51. Addendum: The bugs in chekr were introduced in commits 8e59c7c (January 2016, FS 9.11.9) and 0fd3f96 (April 2019, 9.13.1), well after the bug in tpicd was fixed. |
I am going to push a commit that will close this issue. If new information comes to light, please feel free to reopen it. |
@varenius at station On reported crashes with 10.1.0-beta1. The latest example:
A classic class number eating crash report. Prior to that there were error message like:
It seems that the DBBC2 is having communication issues and that is exciting a bug. The key to the crashing appears to be the
ch -810
error, which comes from chekr when is trying to check the DBBC2 firmware personality and version. It makes the check every 20 seconds (most of the checks do not encounter an error). This suggests that a class number is being left allocated each time the error occurs and eventually the number of class numbers is exhausted. Thech -810
error occurs about ~35 times before each crash (there is a maximum of 40 class numbers). Examining the code reveals, that sure enough, the class number is not being cleared for that error in two places:fs/chekr/dbbcchk.c
Lines 64 to 76 in 1ce0cb7
and
fs/chekr/dbbcchk.c
Lines 115 to 127 in 1ce0cb7
In both cases, insie the
if
clause there needs to be a nestedif
clause:That change has now run without a crash for 160 of the
ch -810
errors. So this is probably the cause. There does not seem to be this kind of error in the code forcd -1
error.The next question is why is the DBBC2 having communication issues. It seems to go through ~30 second stretches where it signals EOF on the connection when queried. Of course, calibration data is being lost each time the
cd -1
occurs. At the next opportunity, @varenius plans to reboot the DBBC2 to see if that will improve the situation.The text was updated successfully, but these errors were encountered: