New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in URL format #1779
Conversation
@minghui-liu : code looks fine. @alexis-cousein : can you review this as well, and see if it's sufficient? thanks. |
Someone left a comment that uses the word "trucated" instead of "truncated" just to check if I was reading it thoroughly ;-). Other that, I'm wondering about this:
PBS_MOM_NODE_NAME was taken to configure mom_host even when gethostname() fails --also before this PR. If mom_host was actually taken from PBS_MOM_NODE_NAME, since it actually conforms to the RFCs for a valid hostname if we get here, I'm wondering if we should actually return(-1) here, and not instead continue with mom_host set to the contents of PBS_MOM_NODE_NAME (i.e. with both the short and 'long' canonicalized names set to PBS_MOM_NODE_NAME). It seems odd, in the case that PBS_MOM_NODE_NAME is set, to tolerate gethostname() failures and substitute PBS_MOM_NODE_NAME for what that call should have yielded, but then not to tolerate canonicalization failures of that name (though we should certainly emit the error written there!) That prevents MoM from working if the name resolution framework is entirely broken locally when MoM starts up, but presumably if someone sets PBS_MOM_NODE_NAME to a FQDN that doesn't mean it's not working for the other hosts, and it doesn't mean it will not work later on the host starting up MoM (e.g. on Cloud nodes the name resolution can often start to work quite late). Not a biggie, though, just wondering about the consistency about what failures we tolerate when the variable is set. One plus side of failing here is that people are eventually used to force to do things the really correct way -- e.g. make sure that MoM startup depends on the name resolution framework already being up. The second one is that if we have typos in PBS_MOM_NODE_NAME it'll stick out like a sore thumb, and the site admin will see that it needs fixed. The downside may be erratic failures to node start up races that we could possibly survive. |
FWIW: I'm fine with any decision we may make about the desired behaviour on failed canonicalization. As long as it is, indeed, a decision and not an accident ;-). |
@alexis-cousein : I believe it was a real decision to not allow mom to continue if mom_host fails in hostname resolution. Even when adding nodes via qmgr, the given <node_name> is also checked for validity. PBS won't allow the node to be added if it fails on hostname resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
OK for me (although I would prefer "truncate" to be properly spelled in the comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of suggestions
|
||
class TestMomLocalNodeName(TestFunctional): | ||
|
||
def test_url_nodename_not_truncated(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add docstrings for the test?
self.mom.log_match("my local nodename is a.b.c.d") | ||
|
||
def test_ip_nodename_not_truncated(self): | ||
ipaddr = socket.gethostbyname(self.mom.hostname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using socket, can you import it? I know it's imported by something in testlib, but I think it's safer to also import it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Describe Bug or Feature
pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in "dotted" format but not an IP address. This is preventing some sites from using URL format mom node names.
Describe Your Change
Set
mom_short_name
to the valuePBS_MOM_NODE_NAME
in pbs.conf if it is set, regardless of whether it is an IP address. IfPBS_MOM_NODE_NAME
is not set, then use the return value ofgethostname()
but truncated after the first dot.Link to Design Doc
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1824423937/Revision+to+PBS+MOM+NODE+NAME+configuration+variable
Attach Test and Valgrind Logs/Output
SmokeTest, HookSmokeTests, TestMomLocalNodeName