Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC : use a unique top_session_dir directory unless direct launch'ed #2088

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@rhc54 this commit implements the truly unique top_session_dir as discussed on the ML

@artpol84 i saw you defined a PMIX_NSDIR, but had not much time to investigate it.
at first glance, it looks like a "to be adopted" PMIX feature, so i did not spend much time on it.

note the first commit comes from #2084 , it is only here in order to prevent a merge conflict

@artpol84
Copy link
Contributor

Will this work if tmpdir is on the shared FS?
I guess not - mktemp ensures that the directory is unique at the node scope only.
IIRC this case was under consideration, @rhc54 correct me if I am wrong.

@artpol84
Copy link
Contributor

@ggouaillardet PMIX_NSDIR is a legal PMIx key and was sitting in PMIx for a long time.
SLURM will soon support this once the contribution https://bugs.schedmd.com/show_bug.cgi?id=3051 accepted.

@ggouaillardet
Copy link
Contributor Author

Well, i call `mkdtemp("$TMP/ompi../XXXXXX") So even if$TMP``is on a shared filesystem,``$TMP/ompi..`` is used by one node at a time. I will make some more tests (i did not have a chance to test on several nodes yet) and for now, i can only hope it is ok.

Thanks for the PMIX_NSDIR pointer, i will patch and check this PR against it

@ibm-ompi
Copy link

Build Failed with PGI compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/f60fd58c59b9485b2c564d12937fd1b3

@artpol84
Copy link
Contributor

So you making the assumption that mkstemp will check the directory content before creating the directory. Consider 2 options:

  • This is a correct statement for all OS (The following bug report shows some evidences: http://www-01.ibm.com/support/docview.wss?uid=isg1IV45350 while). But in this case jobstart will suffer from unwanted delays regardless to where the tmpdir is located.
  • Since mkstemp man page doesn't clearly state that mkstemp will scan target directory to ensure that dirname is unique some OS's may implement this in other way. For example they may use some internal-to-node info to generate the unique value (I'd prefer this case from performance perspective).

According to google it seems that mkstemp supposed to work over NFS:
https://lists.freebsd.org/pipermail/freebsd-bugs/2009-May/035437.html

So I guess the disadvantage of this PR is in possible jobstart delays.

@artpol84
Copy link
Contributor

btw according to http://www-01.ibm.com/support/docview.wss?uid=isg1IV45350 you need to call mkstemp another time if getting EEXIST

@artpol84
Copy link
Contributor

artpol84 commented Sep 19, 2016

@ggouaillardet sorry I missed your point about hostname in the tempdir name. Indeed this is sufficient to avoid conflicts.
The question of a possible delay is still open though. I'd suggest to measure mkdtemp delay as a function of existing directories of the same pattern and as the function of a number of directories in the base dir.
I guess that mkdtemp can have a fastpath like adding a rand value and if conflict is notices it may go ahead with longer path doing dir scan.

@rhc54
Copy link
Contributor

rhc54 commented Sep 19, 2016

I'm not entirely sure that this is doing the right thing, so let me explain how the session dir works. The top-level directory is setup solely on the basis of the node/user, and thus is shared across multiple mpirun's. People often do run multiple mpirun's in parallel, so you cannot just blow that directory away at the end of any one execution.

Underneath that top-level dir, each mpirun creates its own jobfamily-level directory based on its pid. This directory is indeed unique to a given mpirun, and it can be safely destroyed when that mpirun completes. However, if it has "output" files in it, then you cannot destroy the directory tree as the user needs those files! This is why our cleanup program carefully checks every filename before deleting it.

So how does this proposal take these matters into account?

@ggouaillardet
Copy link
Contributor Author

well, this is not exactly what master is doing.

from _setup_jobfam_session_dir in orte/util/session_dir.c
three directories can be created in top_session_dir (eg$TMP/ompi.<hostname>.<uid>) :

  1. pid.<pid> for HNP
  2. jobfam if not HNP and ORTE_JOBID_INVALID
  3. jf.<job family> otherwise

and

  1. is ok, though it might not be empy when the job starts (pid reuse), but i do not see why we could not wipe-out this directory if it already exists
  2. i do not know how that can happens, and that being said, this directory is shared by multiple mpirun
  3. jobfam is basically a 16 bits hash of a 32 bits pid, so there is a risk this directory is shared by several mpirun. also, even if there is no collision, this directory might not be empty when the job starts (jobfam reuse so to speak)

bottom line, i see some potential issues here.
what this patch does is create a truly unique top_session_dir (e.g. $TMP/ompi.<hostname>.<pid>/XXXXX) so these potential issues are avoided.

currently, top_session_dir cannot be removed when a job completes since it can be shared by several jobs. with this commit, and if no direct launch, this top_session_dir could be removed, but that is not implemented yet. (i could probably keep top_session_dir as it is, and add an extra top_session_unique_dir and creates the three sub directories inside, and then wipe out top_session_unique_dir only if different than top_session_dir)

this part has been changed recently on master, mainly in 81195ab but i think the same issue exists in v2.x (e.g. job family collision can occur)

so unless i am missing something :

  • this commit does fix some issues
  • this commit does not break anything
  • cleanup is missing (and as @artpol84 indirectly suggested earlier, we want to do some cleanup in order to keep mkdtemp fast)
  • i just realized that we could use the mpirun/orted pid instead of mkdtemp, as long as we wipe out this directory if it already exists on job startup

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2016

Something doesn't sound right here - it could be that something got inadvertently changed during a prior commit. OMPI has always maintained a single top-level directory, with each mpirun creating a job-family directory underneath it. Each mpirun cleaned up by working its way up the directory tree, removing its own entries and then testing to see if things were empty. Thus, the top-level directory was removed by the "last-one-out" method.

Direct-launched jobs do indeed use a variation of this, but there should not be three levels of variation. Such jobs likewise used the same "last-one-out" cleanup method.

All this worked for nearly a decade now. It may be that the PMIx usock rendezvous files weren't positioned in the right place, and/or the cleanup code not adjusted to remove them. If so, then perhaps that is what needs to be addressed.

I'd rather not commit a change to the basic session directory system until we fully detail what we are doing and ensure that all use-cases are properly handled. It's clear that things have evolved in a manner that may not have been fully considered, so let's take a moment to step back and figure out what we really want this setup to do.

@artpol84
Copy link
Contributor

Please, see below

2016-09-20 8:47 GMT+07:00 Gilles Gouaillardet notifications@github.com:

well, this is not exactly what master is doing.

from _setup_jobfam_session_dir in orte/util/session_dir.c
three directories can be created in top_session_dir (eg
$TMP/ompi..) :

  1. pid. for HNP
  2. jobfam if not HNP and ORTE_JOBID_INVALID
  3. jf. otherwise

and

  1. is ok, though it might not be empy when the job starts (pid reuse),
    but i do not see why we could not wipe-out this directory if it already
    exists
  2. i do not know how that can happens, and that being said, this
    directory is shared by multiple mpirun
  3. jobfam is basically a 16 bits hash of a 32 bits pid, so there is a
    risk this directory is shared by several mpirun. also, even if there is no
    collision, this directory might not be empty when the job starts (jobfam
    reuse so to speak)

Paragraphs 2 and 3 sounds contradictorily to me :)

bottom line, i see some potential issues here.
what this patch does is create a truly unique top_session_dir (e.g.
$TMP/ompi../XXXXX) so these potential issues are avoided.

currently, top_session_dir cannot be removed when a job completes since
it can be shared by several jobs. with this commit, and if no direct
launch, this top_session_dir could be removed, but that is not
implemented yet. (i could probably keep top_session_dir as it is, and add
an extra top_session_unique_dir and creates the three sub directories
inside, and then wipe out top_session_unique_dir only if different than
top_session_dir)

this part has been changed recently on master, mainly in open-mpi/ompi@
81195ab
81195ab
but i think the same issue exists in v2.x (e.g. job family collision can
occur)

This is not fully correct. I haven't changed any directory naming logic in
this commit. I preserved that from
ae2af61
.

so unless i am missing something :

  • this commit does fix some issues
  • this commit does not break anything
  • cleanup is missing (and as @artpol84 https://github.com/artpol84
    indirectly suggested earlier, we want to do some cleanup in order to keep
    mkdtemp fast)
  • i just realized that we could use the mpirun/orted pid instead of
    mkdtemp, as long as we wipe out this directory if it already exists on
    job startup

This PR tries to ensure the uniqueness of the session directory names. I'd
like to suggest the alternative for the cases when RM are detected. All RMs
provide the JOBID in some way and this is something that is truly unique
for the particular job across the cluster. Why can't we use them in
corresponding ESS components to generate truly unique jobid's?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2088 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PqYKFUKsndwBHJcaxR3vn2M0otqKks5qrzsrgaJpZM4KAIps
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2016

We do use the RM-provided jobid - we always have. The only time we generate one ourselves is when we launch via mpirun, and that is required as not every RM provides us with a unique jobid for each mpirun invocation (in fact, if we use the ssh PLM, then we never get a "stepid"). So mpirun will always generate its own, but that has nothing to do with the session dir question.

@artpol84
Copy link
Contributor

If we launched with mpirun I don't think we use RM's jobid.

@artpol84
Copy link
Contributor

But we can in case of SLURM and Torque

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2016

No, we can't - there can be multiple mpirun invocations within a given job, and there is no guarantee those are going to launch via srun. Thus, there is no guarantee they are being assigned a unique "stepid", and they all share the same jobid (what we would call a "session" id as it is related to the allocation and not the specific mpirun).

We've been thru this multiple times over the years - truly, there is a reason why we do things this way 😄

@artpol84
Copy link
Contributor

If we are using SLURM plm we can use the jobid aren't we?

@artpol84
Copy link
Contributor

It's PLM who is responsible for generating jobid's so why do we avoid using them if SLURM/TORQUE/whatever plm was selected?

@artpol84
Copy link
Contributor

And jobid generation is certainly has something to do with session directories creation because it affects directory names and here we are trying to solve the collision problem.

@artpol84
Copy link
Contributor

What I'm suggesting is to have jobfam to be generated using RM's JOBID.
And local jobid portion can be generated from STEPID if available or from something else if not.

@ggouaillardet
Copy link
Contributor Author

@artpol84 you are right about the commit id that introduce the change of directory name in master

my understanding of @rhc54 previous reply, is that if your slurm script does

mpirun a.out &
mpirun b.out

then you can obviously not use slurm jobid as a job family, otherwise both Open MPI jobs will end up with the same job family.
in this case, mpirun will srun orted under the hood, and orted will have a unique jobstep provided by slurm. i did not check how things are actually implemented, but i have the feeling this slurm jobstep come too late.
an option is to srun -n 1 sh -c 'echo $SLURM_STEPID' twice, retrieve two unique stepids (one for mpirun/orted, one for the mpi tasks), and then srun orted with these unique stepid collected earlier(and that will not match the one assigned to orted nor mpi tasks)
regardless this looks quite ugly to me, that works as long as the mpi app does not MPI_Comm_spawn
iirc, slurm jobid is 32 bits, but job family is 16 bits, so two jobs running on the same node could end up with the same job family.

@artpol84
Copy link
Contributor

I'm trying to find out how the caller of the srun can get the stepid.

@artpol84 artpol84 closed this Sep 20, 2016
@artpol84 artpol84 reopened this Sep 20, 2016
@ggouaillardet
Copy link
Contributor Author

@rhc54 i am fine to take some time thinking about that.

my understanding of the current situation is

  • it is assumed that at a given time and on a given node, two different Open MPI jobs cannot have the same jobfam, so we can build jobfam_session_dir only from hostname and jobfam (we also use uid but this is only to keep things organized and avoid permission issues). since jobfam is (basically) a 16 bit hash of a pid (32 bits on Linux, 64 on Solaris iirc), that assumption is incorrect
  • it is assumed to have been working for over a decade since no one remember of any report. strictly speaking, no bug report => no bug is incorrect.

we recently got a bug report, and some of us, including me, went too fast putting the blame on a jobfam collision, and the root cause was not that.

imho, i'd rather make sure we never have to worry about collision any more, hence this relatively non intrusive commit that could be backported to v2.x and v1.10.

in the longer term, and as you previously suggested, namespace (which is hopefully truly unique) could be used instead of the job family, or at least to replace jobfam_session_dir with namespace_session_dir

@artpol84
Copy link
Contributor

2016-09-20 11:26 GMT+07:00 Gilles Gouaillardet notifications@github.com:

@artpol84 https://github.com/artpol84 you are right about the commit id
that introduce the change of directory name in master

my understanding of @rhc54 https://github.com/rhc54 previous reply, is
that if your slurm script does

mpirun a.out &
mpirun b.out

then you can obviously not use slurm jobid as a job family, otherwise
both Open MPI jobs will end up with the same job family.

In my understanding sharing job family is OK in this case because you will
have different local jobid's and this is the same job allocation = job
family.

in this case, mpirun will srun orted under the hood, and orted will have
a unique jobstep provided by slurm. i did not check how things are actually
implemented, but i have the feeling this slurm jobstep come too late.
an option is to srun -n 1 sh -c 'echo $SLURM_STEPID' twice, retrieve two
unique stepids (one for mpirun/orted, one for the mpi tasks), and then srun
orted with these unique stepid collected earlier(and that will not
match the one assigned to orted nor mpi tasks)

As I've said the other option will be to try to get the stepid from the
orted's srun.

regardless this looks quite ugly to me, that works as long as the mpi app
does not MPI_Comm_spawn
iirc, slurm jobid is 32 bits, but job family is 16 bits, so two jobs
running on the same node could end up with the same job family.

That is correct, but if you will use random numbers for local job id's you
will end up having collision much earlier than when stepid will start
overlapping on 16 bits.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2088 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PrnLfph1hjlpj5Bv61A5MZBG-C9oks5qr2BdgaJpZM4KAIps
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

@ggouaillardet
Copy link
Contributor Author

@artpol84 i am sure there is a way to do that (ideally that would be straight forward, worst case scenario is we have to be (very) creative).
first, we need 2 stepids (one for mpirun/orted, one for the MPI apps) as long as the app does not MPI_Comm_spawn. currently, these stepids are 0 and 1.
then, if we consider we will find a way to retrieve these stepids, would that be enough ? i am afraid these stepids would arrive after we need them for the first time.

@artpol84
Copy link
Contributor

2016-09-20 11:50 GMT+07:00 Gilles Gouaillardet notifications@github.com:

@artpol84 https://github.com/artpol84 i am sure there is a way to do
that (ideally that would be straight forward, worst case scenario is we
have to be (very) creative).

If you can retrieve stepid from srun orted ... you don't need 2 stepids

  • you will use the same stepid for mpirun and orted's.
    Your way of having 2 stepid's seems OK with me too.

first, we need 2 stepids (one for mpirun/orted, one for the MPI apps) as
long as the app does not MPI_Comm_spawn. currently, these stepids are 0
and 1.
then, if we consider we will find a way to retrieve these stepids, would
that be enough ? i am afraid these stepids would arrive after we need
them for the first time.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2088 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PrQclVSqCSCNBHl-9puHjkfYvRXoks5qr2YMgaJpZM4KAIps
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

@artpol84
Copy link
Contributor

What is the problem with spawn that you envisioning?

@ggouaillardet
Copy link
Contributor Author

@artpol84

currently,

  • mpirun/orted have stepid 0
  • MPI tasks started by mpirun have stepid 1
  • tasks created by first MPI_Comm_spawn have stepid 2
  • tasks created by second MPI_Comm_spawn have stepid 3
  • ...

also, i am pretty sure we assume stepid of mpirun/orted is 0 here and there

if two distinct jobs share the same job family, we have to

  • support mpirun/orted with a non zero stepid
  • have a way to ensure two distincts Open MPI jobs sharing the same job family use different stepids
  • we might have to take extra considerations when cleaning jobfam_session_dir

for spawn, we need a mechanism to dynamically retrieve a unique job step
(i just thought that can be achieved by having mpirun perform a dummy srun just to get a unique stepid)

@ggouaillardet
Copy link
Contributor Author

@artpol84 i never thought of using random numbers for jobid nor stepid (pid to jobfam hashing in Open MPI is a deterministic function, stepid are created in sequence)

if you submit slurm jobs 0 and 65536 and they end up running at the same time on the same node
(i know that is quite unlikely !) then my understanding of your idea is that they will end up with the same jobfam, which is an issue today.

@artpol84
Copy link
Contributor

I've said "random" because current local ID assignment reminds me PRND.

@artpol84
Copy link
Contributor

Correction - jobid assignment.

@artpol84
Copy link
Contributor

artpol84 commented Sep 20, 2016

I agree that demand of having orteds assigned a unique jobid conflicts with the fact that you do
srun orted and orted is launching apps without srun so there is no additional stepid generated for the app process.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by line to this PR's commit.

@bwbarrett
Copy link
Member

@ggouaillardet, what's the plan here?

currently, top_session_dir is based on hostname, uid, HNP pid and job family.
there is a risk a top_session_dir exists and contains some old data when
a job starts, leading to undefined behavior.

if the app is started via mpirun, use a unique top_session_dir
mkdtemp("$TMP/ompi.<hostname>.<uid>/XXXXXX") that is passed to fork'ed MPI tasks
via the OPAL_MCA_PREFIX"orte_top_session_dir" environment variable.

if the app is direct launched, then the current behavior is unchanged.
direct launch behavior will be enhanced when PMIx is able to pass
a per-node directory (PMIX_NSDIR ?) to a direct launched task.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet
Copy link
Contributor Author

no consensus was reached, so archiving this PR to https://github.com/open-mpi/ompi/wiki/Archived-PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants