-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDSplus dispatcher opens lots of files #2731
Comments
I think you need to add
I am happy to meet this afternoon to discuss... |
According to Mitchell's post on another issue, Atlas has been running |
There's likely more to this than open files, as enabling core dumps against the I haven't looked at this too closely, but for the sake of notes I'll just quickly post the
|
EDIT: Note that as of the time the following
A current
|
Systemd contents: mds-8002
mds-8003:
OS (atlas.gat.com):
MDSplus Version:
To explain
Please note that the error mentioned occurs after ~1000 actions nodes triggered. This happens whether it its 1k actions in a single shot, or across multiple shots. |
Thanks for all the info. |
Hi @sflanagan -- Josh has succeeded in reproducing the issue with the current alpha on Ubuntu. We are now debugging the problem. And will post additional updates here later this afternoon. |
Took some time to configure my dev environment, but I am now also able to reproduce the bug. So will attach the debugger and figure out how the code is leaving dangling sockets. |
Progress has been made on debugging the issue, but still more to do to locate the root cause of the dangling sockets. Have provided my initial findings to Josh and Stephen for review and suggestions. |
@mwinkel-dev Glad you were able to reproduce. Did the workaround proposed by @joshStillerman work? |
Hi @smithsp and @sflanagan -- I will test Josh's workaround right now. I'm sure it will work. I have been manually killing the Current investigation shows that the dangling sockets gradually grow until the process that launches the actions (such as |
Hi @smithsp and @sflanagan -- I have tested Josh's recommended workaround and it is A-OK. Here are the details of the experiment . . .
In short, Josh's workaround works fine on my Ubuntu 20 development system. Let us know if you have any problems with this workaround on GA's RHEL8 server. |
Hi @smithsp and @sflanagan -- Root cause of the dangling sockets in It will take us a few days to fix the issue (i.e., iterate on the design, manual testing, and peer review). While investigating this issue, I have also been using the systemd workaround (so I have a clean slate for each experiment). The workaround is working fine on my development system. Let me know if you have any problems with the systemd workaround on GA's server. Addendum |
Hi @sflanagan -- How many "actions" are issued when running a single DIII-D shot? And how many "action servers" are involved? And how many shots per day? Would like that information so that I can design a test case that is many times more load on the "action server" than the DIII-D load. |
Assuming, as an upper bound, that everything is enabled and ops is having a long and/or productive run day...
~300
1
~50 |
I redesigned my
Where the Seems "fine" as a temporary workaround, but it does leave me open to skipping/missing a shot if the dispatcher service is restarting while the DAQ system is trying to start a new shot cycle. Trying to reduce that risk is next up on my agenda (if I can). |
Hi @sflanagan -- Thanks for the information about the number of actions in a DIII-D shot. And for the news regarding the |
Now have a prototype fix. Next step is to do some manual stress testing, followed by peer review. Will take a few more days before there is a final fix that will be merged to alpha and cherry-picked to the GA branch. |
Hi @smithsp and @sflanagan, Summary This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the Details The issue is that the "connection ID table" in the The two workarounds clean up the dangling "to" sockets. Adding an "exit" statement at the end of a The proposed fix does not alter any logic regarding sockets and connections. All that it does is allow the "receiver" thread to access the "connection ID table" so that the "to" socket can be closed. Preliminary testing confirms that the fix works and that Low Risk? And because the problem appears to be limited to the Note also that this bug has been present in the Clarification
|
Hi @sflanagan -- Questions about your site's single "action server" . . .
Answers to the above questions will enable us to set up a cross-version testing environment, if needed. |
A stress test of 1,200 actions (400 actions per shot, 3 shots) reveals that there is an unprotected critical section that can cause a segfault. The segfault also occurs in the Adding a delay of 0.01 seconds to each action eliminates the segfault, but causes the proposed fix to leave ~10 dangling sockets (i.e., it closes the other 1,190 "to" sockets). Additional investigation is required to determine if the dangling sockets are caused by a second unprotected critical section. And if so, whether it was introduced with the proposed fix. Increasing the delay to 0.1 seconds allows the proposed fix to work fine -- no dangling sockets. |
I slowly remember the struggle we had sorting out the existing problem at the time while keeping backward compatibility. The issue was that the old protocol required a reconnection on an offered port to send the result of the action. the original goals that i remember were to support:
|
Hi @zack-vii -- Thanks for the history of the change. Much appreciated! I will expand my experiments to see if my proposed fix for the leaking sockets breaks any of the features you listed above. |
When configure The actions are failing because their associated "job callback" functions are misbehaving. Now investigating if the "job callback" problem is a multi-threading issue. |
Hi @smithsp and @sflanagan, I am now compiling and testing the proposed fix (PR #2740) that was created by @zack-vii. Results of the test will be posted in that PR and summarized in this issue. Note that @zack-vii was the author of PRs #2288 and #2289 that involved the |
PR #2740 by @zack-vii is a server-side fix and thus more elegant than the client-side (i.e., However, there is an edge case in The complete fix of this issue will likely consist of two or three PRs. |
@mwinkel-dev @zack-vii |
Note that the workaround provided of letting the system restart is not robust. We look forward to a fix as soon as possible. |
Hi @smithsp -- Thank you for the update. I will put that news on the agenda for Tuesday's meeting of the MDSplus software team. |
Hi @smithsp and @sflanagan, Although the temporary workarounds are not an adequate solution for GA, nonetheless it would be helpful to for us to learn what aspect of the workaround is failing for GA. (That information might help us create additional test cases for the real fix.) On 1-Apr-2024, @sflanagan posted that the temporary fix does mean that a shot could be missed (which of course is a serious problem). However, are there now additional problems with the temporary workarounds? Also, which of these workarounds has GA tried?
where "action_server" is whatever the name is of the
|
Hi @smithsp and @sflanagan, It is my understanding that the GA branch should be based on |
@mwinkel-dev The version is in #2731 (comment) (repeated here):
|
Hi @smithsp, Apparently, I need to drink more coffee and wake up. Which is to say that I'm amused to learn from your post that the version information was present if I had just scrolled up to the top of this GitHub issue. Regardless, thanks for confirming the version again. Reason I was double checking was to make sure we create the GA branch from the correct commit. Today, we will be cherry-picking PR #2735 and PR #2740 into the branch. Here are the tasks:
If all goes according to plan, the above tasks will all be completed this afternoon or evening. When the release is ready, GA will have two options:
|
Trial run of GA branch / build is underway. Now have RPMs on a Rocky8 system for testing on Friday. Some minor build issues still to be addressed (major-minor-release labelling scheme for the branch, and so forth). |
Hi @smithsp and @sflanagan, What "tagging" scheme would you like us to use for the GA branch (i.e., values for "branch-major-minor-release")? GA release zero is identical to What should the next release of the GA branch be called? (This will be the release with the cherry-picks of PR #2735 and PR #2740.) |
I propose the next release be labeled with |
Hi @smithsp and @sflanagan, Thanks for the answer. We will start with |
Just dispatched 2,000 actions on Rocky Linux 8.9 without leaking any sockets. Details of the trial branch build and test . . .
|
Many of the preceding posts by @mwinkel-dev were initial conjectures and thus incorrect. Now that the investigation has been completed, here is a summary of the root cause. It is expensive to create connections, and thus the architecture of In normal usage, And this is how the software changed over the years . . . Release PR #2288 refactored the software and inadvertently omitted reuse of the dispatch connection. It created a new connection for each action dispatched, thus leaked sockets. The proposed fix created by @mwinkel-dev incorrectly assumed that the architecture was indeed supposed to create a new connection per action. And also incorrectly assumed that there was a receiver thread created for each action. (Maintenance developers rely on comments in the source code to guide them; when those comments are missing it is easy to make bad assumptions.). This proposed fix did solve the leaking socket issue, but had other problems (such as deadlock). PR #2740 corrects the omission of PR #2288 and reuses the dispatch connection. Thus eliminating the leaking sockets. Although some code cleanup was done on the "action service", the primary fix is actually in the client-side code. |
@mwinkel-dev |
Hi @smithsp, Thanks for the decision. Therefore, we will include PR #2746 only if two conditions are met:
If PR #2746 is not included in the forthcoming tarball (aka "ga_atlas_1.0.0"), my recommendation is that GA add it to the list of things to be included in an eventual subsequent release ("ga_atlas_1.0.1"). Regarding the tarball, I will post an update on Wednesday 24-Apr-2024 with an updated schedule. (I was out of the office today, so must check with my colleagues to find out the present status of the GA branch and tarball.) |
Hi @smithsp, With guidance from @WhoBrokeTheBuild, the official GA branch is now available. It is named And it contains the cherry-picks of PR #2735 and PR #2740. (PR #2746 was not included because it is still in the review process.) The history of the
The Stephen informed me that the Jenkins build server is not configured to build the I am also creating a tarball of the RHEL8 RPMs so that I can install the |
Timeline of fix (in calendar days, not work days) and lessons @mwinkel-dev learned . . .
Step 2 was the result of undocumented source code and wrong assumptions about the software architecture. To prevent this from occurring again, PR #2746 adds comments to help future maintenance programmers. Lesson learned = seek input from others sooner. Six days elapsed between Steps 4 and 6. Fix was merged at step 4 thus available to customer, but not in the form the customer requested. Lesson learned = better coordination saves time. |
The |
Hi @smithsp -- Thanks for the update. Am glad to read that the |
Affiliation
GA/DIII-D
Version(s) Affected
Our production server, atlas, currently suffers this bug. @sflanagan will provide the version at some point.
Platform
RHEL8
Describe the bug
The MDSplus dispatcher leaves open files, filling up the limit of open files. Somehow the limit of number of open files is not honored by the OS or dispatcher, even after attempting to increase the limit in various ways in various configuration files.
To Reproduce
Our production server, atlas, currently suffers this bug. Details to be provided at some point by @sflanagan .
Expected behavior
I expect that the dispatcher will close its own files as it goes, without accumulating open files.
Additional context
Any fix provided needs to be on a branch starting from the version that is currently on atlas.
The text was updated successfully, but these errors were encountered: