Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in URL format #1779

Merged
merged 6 commits into from Jun 19, 2020

Conversation

minghui-liu
Copy link
Contributor

@minghui-liu minghui-liu commented May 28, 2020

Describe Bug or Feature

pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in "dotted" format but not an IP address. This is preventing some sites from using URL format mom node names.

Describe Your Change

Set mom_short_name to the value PBS_MOM_NODE_NAME in pbs.conf if it is set, regardless of whether it is an IP address. If PBS_MOM_NODE_NAME is not set, then use the return value of gethostname() but truncated after the first dot.

Link to Design Doc

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1824423937/Revision+to+PBS+MOM+NODE+NAME+configuration+variable

Attach Test and Valgrind Logs/Output

SmokeTest, HookSmokeTests, TestMomLocalNodeName
Screen Shot 2020-06-15 at 9 31 18 PM
Screen Shot 2020-06-15 at 9 31 33 PM

@CLAassistant
Copy link

CLAassistant commented May 28, 2020

CLA assistant check
All committers have signed the CLA.

@minghui-liu minghui-liu changed the title Fix pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in URL format [WIP] Fix pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in URL format May 29, 2020
@minghui-liu minghui-liu changed the title [WIP] Fix pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in URL format Fix pbs.get_local_nodename() returns truncated PBS_MOM_NODE_NAME when it is in URL format Jun 15, 2020
@minghui-liu minghui-liu marked this pull request as ready for review June 15, 2020 13:33
@minghui-liu minghui-liu requested a review from bayucan June 18, 2020 00:46
@bayucan
Copy link
Contributor

bayucan commented Jun 18, 2020

@minghui-liu : code looks fine. @alexis-cousein : can you review this as well, and see if it's sufficient? thanks.

@alexis-cousein
Copy link
Contributor

alexis-cousein commented Jun 18, 2020

Someone left a comment that uses the word "trucated" instead of "truncated" just to check if I was reading it thoroughly ;-).

Other that, I'm wondering about this:

	c = get_fullhostname(mom_host, mom_host, (sizeof(mom_host) - 1));
	if (c == -1) {
		log_err(-1, msg_daemonname, "Unable to resolve my host name");
#ifdef	WIN32
		g_dwCurrentState = SERVICE_STOPPED;
		ss.dwCurrentState = g_dwCurrentState;
		ss.dwWin32ExitCode = ERROR_INVALID_COMPUTERNAME;
		if (g_ssHandle != 0)
			SetServiceStatus(g_ssHandle, &ss);
#endif
		return (-1);
	}

PBS_MOM_NODE_NAME was taken to configure mom_host even when gethostname() fails --also before this PR.

If mom_host was actually taken from PBS_MOM_NODE_NAME, since it actually conforms to the RFCs for a valid hostname if we get here, I'm wondering if we should actually return(-1) here, and not instead continue with mom_host set to the contents of PBS_MOM_NODE_NAME (i.e. with both the short and 'long' canonicalized names set to PBS_MOM_NODE_NAME).

It seems odd, in the case that PBS_MOM_NODE_NAME is set, to tolerate gethostname() failures and substitute PBS_MOM_NODE_NAME for what that call should have yielded, but then not to tolerate canonicalization failures of that name (though we should certainly emit the error written there!)

That prevents MoM from working if the name resolution framework is entirely broken locally when MoM starts up, but presumably if someone sets PBS_MOM_NODE_NAME to a FQDN that doesn't mean it's not working for the other hosts, and it doesn't mean it will not work later on the host starting up MoM (e.g. on Cloud nodes the name resolution can often start to work quite late).

Not a biggie, though, just wondering about the consistency about what failures we tolerate when the variable is set.

One plus side of failing here is that people are eventually used to force to do things the really correct way -- e.g. make sure that MoM startup depends on the name resolution framework already being up.

The second one is that if we have typos in PBS_MOM_NODE_NAME it'll stick out like a sore thumb, and the site admin will see that it needs fixed.

The downside may be erratic failures to node start up races that we could possibly survive.

@alexis-cousein
Copy link
Contributor

alexis-cousein commented Jun 18, 2020

FWIW: I'm fine with any decision we may make about the desired behaviour on failed canonicalization. As long as it is, indeed, a decision and not an accident ;-).

@bayucan
Copy link
Contributor

bayucan commented Jun 18, 2020

@alexis-cousein : I believe it was a real decision to not allow mom to continue if mom_host fails in hostname resolution. Even when adding nodes via qmgr, the given <node_name> is also checked for validity. PBS won't allow the node to be added if it fails on hostname resolution.

Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@alexis-cousein
Copy link
Contributor

OK for me (although I would prefer "truncate" to be properly spelled in the comment).

Copy link
Contributor

@vstumpf vstumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of suggestions


class TestMomLocalNodeName(TestFunctional):

def test_url_nodename_not_truncated(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docstrings for the test?

self.mom.log_match("my local nodename is a.b.c.d")

def test_ip_nodename_not_truncated(self):
ipaddr = socket.gethostbyname(self.mom.hostname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using socket, can you import it? I know it's imported by something in testlib, but I think it's safer to also import it here.

Copy link
Contributor

@vstumpf vstumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vstumpf vstumpf merged commit 8689424 into openpbs:master Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants