POC : use a unique top_session_dir directory unless direct launch'ed #2088

ggouaillardet · 2016-09-19T04:54:11Z

No description provided.

ggouaillardet · 2016-09-19T04:58:33Z

@rhc54 this commit implements the truly unique top_session_dir as discussed on the ML

@artpol84 i saw you defined a PMIX_NSDIR, but had not much time to investigate it.
at first glance, it looks like a "to be adopted" PMIX feature, so i did not spend much time on it.

note the first commit comes from #2084 , it is only here in order to prevent a merge conflict

artpol84 · 2016-09-19T06:42:36Z

Will this work if tmpdir is on the shared FS?
I guess not - mktemp ensures that the directory is unique at the node scope only.
IIRC this case was under consideration, @rhc54 correct me if I am wrong.

artpol84 · 2016-09-19T06:45:17Z

@ggouaillardet PMIX_NSDIR is a legal PMIx key and was sitting in PMIx for a long time.
SLURM will soon support this once the contribution https://bugs.schedmd.com/show_bug.cgi?id=3051 accepted.

ggouaillardet · 2016-09-19T07:19:58Z

Well, i call `mkdtemp("$TMP/ompi../XXXXXX") So even if$TMP``is on a shared filesystem,``$TMP/ompi..`` is used by one node at a time. I will make some more tests (i did not have a chance to test on several nodes yet) and for now, i can only hope it is ok.

Thanks for the PMIX_NSDIR pointer, i will patch and check this PR against it

ibm-ompi · 2016-09-19T07:35:13Z

Build Failed with PGI compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/f60fd58c59b9485b2c564d12937fd1b3

artpol84 · 2016-09-19T07:49:20Z

So you making the assumption that mkstemp will check the directory content before creating the directory. Consider 2 options:

This is a correct statement for all OS (The following bug report shows some evidences: http://www-01.ibm.com/support/docview.wss?uid=isg1IV45350 while). But in this case jobstart will suffer from unwanted delays regardless to where the tmpdir is located.
Since mkstemp man page doesn't clearly state that mkstemp will scan target directory to ensure that dirname is unique some OS's may implement this in other way. For example they may use some internal-to-node info to generate the unique value (I'd prefer this case from performance perspective).

According to google it seems that mkstemp supposed to work over NFS:
https://lists.freebsd.org/pipermail/freebsd-bugs/2009-May/035437.html

So I guess the disadvantage of this PR is in possible jobstart delays.

artpol84 · 2016-09-19T07:50:19Z

btw according to http://www-01.ibm.com/support/docview.wss?uid=isg1IV45350 you need to call mkstemp another time if getting EEXIST

artpol84 · 2016-09-19T08:05:01Z

@ggouaillardet sorry I missed your point about hostname in the tempdir name. Indeed this is sufficient to avoid conflicts.
The question of a possible delay is still open though. I'd suggest to measure mkdtemp delay as a function of existing directories of the same pattern and as the function of a number of directories in the base dir.
I guess that mkdtemp can have a fastpath like adding a rand value and if conflict is notices it may go ahead with longer path doing dir scan.

rhc54 · 2016-09-19T14:26:22Z

I'm not entirely sure that this is doing the right thing, so let me explain how the session dir works. The top-level directory is setup solely on the basis of the node/user, and thus is shared across multiple mpirun's. People often do run multiple mpirun's in parallel, so you cannot just blow that directory away at the end of any one execution.

Underneath that top-level dir, each mpirun creates its own jobfamily-level directory based on its pid. This directory is indeed unique to a given mpirun, and it can be safely destroyed when that mpirun completes. However, if it has "output" files in it, then you cannot destroy the directory tree as the user needs those files! This is why our cleanup program carefully checks every filename before deleting it.

So how does this proposal take these matters into account?

ggouaillardet · 2016-09-20T01:47:20Z

well, this is not exactly what master is doing.

from _setup_jobfam_session_dir in orte/util/session_dir.c
three directories can be created in top_session_dir (eg$TMP/ompi.<hostname>.<uid>) :

pid.<pid> for HNP
jobfam if not HNP and ORTE_JOBID_INVALID
jf.<job family> otherwise

and

is ok, though it might not be empy when the job starts (pid reuse), but i do not see why we could not wipe-out this directory if it already exists
i do not know how that can happens, and that being said, this directory is shared by multiple mpirun
jobfam is basically a 16 bits hash of a 32 bits pid, so there is a risk this directory is shared by several mpirun. also, even if there is no collision, this directory might not be empty when the job starts (jobfam reuse so to speak)

bottom line, i see some potential issues here.
what this patch does is create a truly unique top_session_dir (e.g. $TMP/ompi.<hostname>.<pid>/XXXXX) so these potential issues are avoided.

currently, top_session_dir cannot be removed when a job completes since it can be shared by several jobs. with this commit, and if no direct launch, this top_session_dir could be removed, but that is not implemented yet. (i could probably keep top_session_dir as it is, and add an extra top_session_unique_dir and creates the three sub directories inside, and then wipe out top_session_unique_dir only if different than top_session_dir)

this part has been changed recently on master, mainly in 81195ab but i think the same issue exists in v2.x (e.g. job family collision can occur)

so unless i am missing something :

this commit does fix some issues
this commit does not break anything
cleanup is missing (and as @artpol84 indirectly suggested earlier, we want to do some cleanup in order to keep mkdtemp fast)
i just realized that we could use the mpirun/orted pid instead of mkdtemp, as long as we wipe out this directory if it already exists on job startup

rhc54 · 2016-09-20T03:09:09Z

Something doesn't sound right here - it could be that something got inadvertently changed during a prior commit. OMPI has always maintained a single top-level directory, with each mpirun creating a job-family directory underneath it. Each mpirun cleaned up by working its way up the directory tree, removing its own entries and then testing to see if things were empty. Thus, the top-level directory was removed by the "last-one-out" method.

Direct-launched jobs do indeed use a variation of this, but there should not be three levels of variation. Such jobs likewise used the same "last-one-out" cleanup method.

All this worked for nearly a decade now. It may be that the PMIx usock rendezvous files weren't positioned in the right place, and/or the cleanup code not adjusted to remove them. If so, then perhaps that is what needs to be addressed.

I'd rather not commit a change to the basic session directory system until we fully detail what we are doing and ensure that all use-cases are properly handled. It's clear that things have evolved in a manner that may not have been fully considered, so let's take a moment to step back and figure out what we really want this setup to do.

artpol84 · 2016-09-20T03:10:35Z

Please, see below

2016-09-20 8:47 GMT+07:00 Gilles Gouaillardet notifications@github.com:

well, this is not exactly what master is doing.

from _setup_jobfam_session_dir in orte/util/session_dir.c
three directories can be created in top_session_dir (eg
$TMP/ompi..) :

pid. for HNP

jobfam if not HNP and ORTE_JOBID_INVALID

jf. otherwise

and

is ok, though it might not be empy when the job starts (pid reuse),
but i do not see why we could not wipe-out this directory if it already
exists

i do not know how that can happens, and that being said, this
directory is shared by multiple mpirun

jobfam is basically a 16 bits hash of a 32 bits pid, so there is a
risk this directory is shared by several mpirun. also, even if there is no
collision, this directory might not be empty when the job starts (jobfam
reuse so to speak)

Paragraphs 2 and 3 sounds contradictorily to me :)

bottom line, i see some potential issues here.
what this patch does is create a truly unique top_session_dir (e.g.
$TMP/ompi../XXXXX) so these potential issues are avoided.

currently, top_session_dir cannot be removed when a job completes since
it can be shared by several jobs. with this commit, and if no direct
launch, this top_session_dir could be removed, but that is not
implemented yet. (i could probably keep top_session_dir as it is, and add
an extra top_session_unique_dir and creates the three sub directories
inside, and then wipe out top_session_unique_dir only if different than
top_session_dir)

this part has been changed recently on master, mainly in open-mpi/ompi@
81195ab
81195ab
but i think the same issue exists in v2.x (e.g. job family collision can
occur)

This is not fully correct. I haven't changed any directory naming logic in
this commit. I preserved that from
ae2af61
.

so unless i am missing something :

this commit does fix some issues

this commit does not break anything

cleanup is missing (and as @artpol84 https://github.com/artpol84
indirectly suggested earlier, we want to do some cleanup in order to keep
mkdtemp fast)

i just realized that we could use the mpirun/orted pid instead of
mkdtemp, as long as we wipe out this directory if it already exists on
job startup

This PR tries to ensure the uniqueness of the session directory names. I'd
like to suggest the alternative for the cases when RM are detected. All RMs
provide the JOBID in some way and this is something that is truly unique
for the particular job across the cluster. Why can't we use them in
corresponding ESS components to generate truly unique jobid's?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2088 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PqYKFUKsndwBHJcaxR3vn2M0otqKks5qrzsrgaJpZM4KAIps
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

artpol84 · 2016-09-20T03:14:37Z

@rhc54 I guess that @ggouaillardet is mentioning this 3 cases:

rhc54 · 2016-09-20T03:15:41Z

We do use the RM-provided jobid - we always have. The only time we generate one ourselves is when we launch via mpirun, and that is required as not every RM provides us with a unique jobid for each mpirun invocation (in fact, if we use the ssh PLM, then we never get a "stepid"). So mpirun will always generate its own, but that has nothing to do with the session dir question.

artpol84 · 2016-09-20T03:16:40Z

If we launched with mpirun I don't think we use RM's jobid.

artpol84 · 2016-09-20T03:16:53Z

But we can in case of SLURM and Torque

rhc54 · 2016-09-20T03:19:28Z

No, we can't - there can be multiple mpirun invocations within a given job, and there is no guarantee those are going to launch via srun. Thus, there is no guarantee they are being assigned a unique "stepid", and they all share the same jobid (what we would call a "session" id as it is related to the allocation and not the specific mpirun).

We've been thru this multiple times over the years - truly, there is a reason why we do things this way 😄

artpol84 · 2016-09-20T03:21:53Z

If we are using SLURM plm we can use the jobid aren't we?

artpol84 · 2016-09-20T03:23:06Z

It's PLM who is responsible for generating jobid's so why do we avoid using them if SLURM/TORQUE/whatever plm was selected?

artpol84 · 2016-09-20T03:25:35Z

And jobid generation is certainly has something to do with session directories creation because it affects directory names and here we are trying to solve the collision problem.

artpol84 · 2016-09-20T03:41:47Z

What I'm suggesting is to have jobfam to be generated using RM's JOBID.
And local jobid portion can be generated from STEPID if available or from something else if not.

ggouaillardet · 2016-09-20T04:26:02Z

@artpol84 you are right about the commit id that introduce the change of directory name in master

my understanding of @rhc54 previous reply, is that if your slurm script does

mpirun a.out &
mpirun b.out

then you can obviously not use slurm jobid as a job family, otherwise both Open MPI jobs will end up with the same job family.
in this case, mpirun will srun orted under the hood, and orted will have a unique jobstep provided by slurm. i did not check how things are actually implemented, but i have the feeling this slurm jobstep come too late.
an option is to srun -n 1 sh -c 'echo $SLURM_STEPID' twice, retrieve two unique stepids (one for mpirun/orted, one for the mpi tasks), and then srun orted with these unique stepid collected earlier(and that will not match the one assigned to orted nor mpi tasks)
regardless this looks quite ugly to me, that works as long as the mpi app does not MPI_Comm_spawn
iirc, slurm jobid is 32 bits, but job family is 16 bits, so two jobs running on the same node could end up with the same job family.

artpol84 · 2016-09-20T04:41:36Z

I'm trying to find out how the caller of the srun can get the stepid.

ggouaillardet · 2016-09-20T04:45:10Z

@rhc54 i am fine to take some time thinking about that.

my understanding of the current situation is

it is assumed that at a given time and on a given node, two different Open MPI jobs cannot have the same jobfam, so we can build jobfam_session_dir only from hostname and jobfam (we also use uid but this is only to keep things organized and avoid permission issues). since jobfam is (basically) a 16 bit hash of a pid (32 bits on Linux, 64 on Solaris iirc), that assumption is incorrect
it is assumed to have been working for over a decade since no one remember of any report. strictly speaking, no bug report => no bug is incorrect.

we recently got a bug report, and some of us, including me, went too fast putting the blame on a jobfam collision, and the root cause was not that.

imho, i'd rather make sure we never have to worry about collision any more, hence this relatively non intrusive commit that could be backported to v2.x and v1.10.

in the longer term, and as you previously suggested, namespace (which is hopefully truly unique) could be used instead of the job family, or at least to replace jobfam_session_dir with namespace_session_dir

artpol84 · 2016-09-20T04:45:33Z

2016-09-20 11:26 GMT+07:00 Gilles Gouaillardet notifications@github.com:

@artpol84 https://github.com/artpol84 you are right about the commit id
that introduce the change of directory name in master

my understanding of @rhc54 https://github.com/rhc54 previous reply, is
that if your slurm script does

mpirun a.out &
mpirun b.out

then you can obviously not use slurm jobid as a job family, otherwise
both Open MPI jobs will end up with the same job family.

In my understanding sharing job family is OK in this case because you will
have different local jobid's and this is the same job allocation = job
family.

in this case, mpirun will srun orted under the hood, and orted will have
a unique jobstep provided by slurm. i did not check how things are actually
implemented, but i have the feeling this slurm jobstep come too late.
an option is to srun -n 1 sh -c 'echo $SLURM_STEPID' twice, retrieve two
unique stepids (one for mpirun/orted, one for the mpi tasks), and then srun
orted with these unique stepid collected earlier(and that will not
match the one assigned to orted nor mpi tasks)

As I've said the other option will be to try to get the stepid from the
orted's srun.

regardless this looks quite ugly to me, that works as long as the mpi app
does not MPI_Comm_spawn
iirc, slurm jobid is 32 bits, but job family is 16 bits, so two jobs
running on the same node could end up with the same job family.

That is correct, but if you will use random numbers for local job id's you
will end up having collision much earlier than when stepid will start
overlapping on 16 bits.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2088 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PrnLfph1hjlpj5Bv61A5MZBG-C9oks5qr2BdgaJpZM4KAIps
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

ggouaillardet · 2016-09-20T04:50:17Z

@artpol84 i am sure there is a way to do that (ideally that would be straight forward, worst case scenario is we have to be (very) creative).
first, we need 2 stepids (one for mpirun/orted, one for the MPI apps) as long as the app does not MPI_Comm_spawn. currently, these stepids are 0 and 1.
then, if we consider we will find a way to retrieve these stepids, would that be enough ? i am afraid these stepids would arrive after we need them for the first time.

artpol84 · 2016-09-20T04:52:57Z

2016-09-20 11:50 GMT+07:00 Gilles Gouaillardet notifications@github.com:

@artpol84 https://github.com/artpol84 i am sure there is a way to do
that (ideally that would be straight forward, worst case scenario is we
have to be (very) creative).

If you can retrieve stepid from srun orted ... you don't need 2 stepids

you will use the same stepid for mpirun and orted's.
Your way of having 2 stepid's seems OK with me too.

first, we need 2 stepids (one for mpirun/orted, one for the MPI apps) as
long as the app does not MPI_Comm_spawn. currently, these stepids are 0
and 1.
then, if we consider we will find a way to retrieve these stepids, would
that be enough ? i am afraid these stepids would arrive after we need
them for the first time.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2088 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PrQclVSqCSCNBHl-9puHjkfYvRXoks5qr2YMgaJpZM4KAIps
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

artpol84 · 2016-09-20T04:53:31Z

What is the problem with spawn that you envisioning?

ggouaillardet · 2016-09-20T05:04:42Z

@artpol84

currently,

mpirun/orted have stepid 0
MPI tasks started by mpirun have stepid 1
tasks created by first MPI_Comm_spawn have stepid 2
tasks created by second MPI_Comm_spawn have stepid 3
...

also, i am pretty sure we assume stepid of mpirun/orted is 0 here and there

if two distinct jobs share the same job family, we have to

support mpirun/orted with a non zero stepid
have a way to ensure two distincts Open MPI jobs sharing the same job family use different stepids
we might have to take extra considerations when cleaning jobfam_session_dir

for spawn, we need a mechanism to dynamically retrieve a unique job step
(i just thought that can be achieved by having mpirun perform a dummy srun just to get a unique stepid)

ggouaillardet · 2016-09-20T05:11:41Z

@artpol84 i never thought of using random numbers for jobid nor stepid (pid to jobfam hashing in Open MPI is a deterministic function, stepid are created in sequence)

if you submit slurm jobs 0 and 65536 and they end up running at the same time on the same node
(i know that is quite unlikely !) then my understanding of your idea is that they will end up with the same jobfam, which is an issue today.

artpol84 · 2016-09-20T05:14:22Z

I've said "random" because current local ID assignment reminds me PRND.

artpol84 · 2016-09-20T05:15:05Z

Correction - jobid assignment.

artpol84 · 2016-09-20T05:19:07Z

I agree that demand of having orteds assigned a unique jobid conflicts with the fact that you do
srun orted and orted is launching apps without srun so there is no additional stepid generated for the app process.

jsquyres

Please add a Signed-off-by line to this PR's commit.

bwbarrett · 2017-09-12T18:32:30Z

@ggouaillardet, what's the plan here?

currently, top_session_dir is based on hostname, uid, HNP pid and job family. there is a risk a top_session_dir exists and contains some old data when a job starts, leading to undefined behavior. if the app is started via mpirun, use a unique top_session_dir mkdtemp("$TMP/ompi.<hostname>.<uid>/XXXXXX") that is passed to fork'ed MPI tasks via the OPAL_MCA_PREFIX"orte_top_session_dir" environment variable. if the app is direct launched, then the current behavior is unchanged. direct launch behavior will be enhanced when PMIx is able to pass a per-node directory (PMIX_NSDIR ?) to a direct launched task. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

ggouaillardet · 2017-09-14T02:43:37Z

no consensus was reached, so archiving this PR to https://github.com/open-mpi/ompi/wiki/Archived-PR

ggouaillardet added the ⚠️ WIP-DNM! label Sep 19, 2016

artpol84 closed this Sep 20, 2016

artpol84 reopened this Sep 20, 2016

jsquyres requested changes Oct 25, 2016

View reviewed changes

ggouaillardet force-pushed the poc/unique_top_session_dir branch from 681e9ff to 13932a3 Compare September 14, 2017 01:39

ggouaillardet force-pushed the poc/unique_top_session_dir branch from 13932a3 to f0c0c33 Compare September 14, 2017 01:40

ggouaillardet closed this Sep 14, 2017

POC : use a unique top_session_dir directory unless direct launch'ed #2088

POC : use a unique top_session_dir directory unless direct launch'ed #2088

Conversation

ggouaillardet commented Sep 19, 2016

ggouaillardet commented Sep 19, 2016

artpol84 commented Sep 19, 2016

artpol84 commented Sep 19, 2016

ggouaillardet commented Sep 19, 2016

ibm-ompi commented Sep 19, 2016

artpol84 commented Sep 19, 2016

artpol84 commented Sep 19, 2016

artpol84 commented Sep 19, 2016 • edited Loading

rhc54 commented Sep 19, 2016

ggouaillardet commented Sep 20, 2016

rhc54 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

rhc54 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

rhc54 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

ggouaillardet commented Sep 20, 2016

artpol84 commented Sep 20, 2016

ggouaillardet commented Sep 20, 2016

artpol84 commented Sep 20, 2016

ggouaillardet commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

ggouaillardet commented Sep 20, 2016

ggouaillardet commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016

artpol84 commented Sep 20, 2016 • edited Loading

jsquyres left a comment

Choose a reason for hiding this comment

bwbarrett commented Sep 12, 2017

ggouaillardet commented Sep 14, 2017

artpol84 commented Sep 19, 2016 •

edited

Loading

artpol84 commented Sep 20, 2016 •

edited

Loading