need ability to snapshot python client logfile #885

hppritcha · 2019-12-12T15:40:20Z

The MTT python client as is is not well suited for use on typical ECP systems: cori, theta, etc.
The problem is that to function with the Open MPI community's IU database, to get the OMPI tarballs from the download site, to do a git checkout of the OMPI tests, all require internet connectivity.

On most ECP systems however, only the front end nodes have such internet connectivity. To run tests however, one needs a resource allocation from SLURM, COBALT, etc. Once such a resource is requested, however, the requesting process is typically put on some compute node or mom node on the system. These nodes typically do not have internet access.

Several workarounds have been attempted in the past. The most recent with SLURM systems has been to request an allocation in such a way that the process issuing the salloc request remains on the login node. This has only partially worked, as this mode of using SLURM is atypical and is buggy.

A potentially simpler approach is to add a checkpoint/restart like capability to MTT so one can break up an MTT run into a sequence of stages with different ini files. Some stages can be run on login nodes where internet connectivity is required, other stages can be run on the backend, typically to run the tests.

The text was updated successfully, but these errors were encountered:

rhc54 · 2019-12-12T15:48:04Z

I'm a little puzzled here. The Python client includes a resource manager plugin - e.g., Slurm. That plugin has the ability to request an allocation in it. The intent was that the Python client would not need to be started from within an allocation, but would instead execute its initial operations (fetching and building things) and then request an allocation when one was needed for actually running the tests.

The only issue we have encountered with that method is that the allocation request can take some time to be granted. Our internal solution was to simply create a high-priority queue for MTT operations and submit the request to it.

Is this not adequate for ECP systems? If not, note that the Python client already has a C/R capability in that you can have one ini file that downloads and builds things, and another ini file that flags the download/build components with ASIS to indicate that MTT is to use the existing installations if present. Does that not also solve the problem?

hppritcha · 2019-12-12T16:10:35Z

I don't see how this would work with the current IU database reporter. Anyway I've written the code and it appears to serve my purposes.

rhc54 · 2019-12-12T16:28:50Z

You are welcome to use your code - however, the methods I described work just fine with the current IU reporter. We use it every day precisely that way. The builds are reported correctly even if previously built.

ribab · 2019-12-12T20:23:19Z

Hi @hppritcha , I am trying to understand your use-case and how it differs from ours. Your process allocates the cluster from a compute node? How does this work?

You said:

the requesting process is typically put on some compute node

This means the process that requested an allocation runs on a compute node? Don't you need an allocation before running anything on a compute node?

hppritcha · 2019-12-18T18:15:02Z

@ribab one invokes the allocation command from a front end node - the one you get placed on when you ssh to the system. For example with the ANL theta cluster, here's how you'd get an allocation:

ssh theta
XXX@thetalogin6:qsub -n 8 --jobname ompi -q debug-flat-quad -t 60 -I

upon granting the allocation, the user is placed on one of the internal mom nodes on theta. Then one uses aprun or mpirun to launch the application.

A similar thing occurs on SLURM configured systems like NERSC cori. One can try using the salloc --no-shell option to remain on one of the cori login nodes, but we've found that option to be unreliable for running many tests like we do in MTT.

The only way I can see how one might use MTT ALPS plugin on theta would be for the allocate command to include the name of a script which would somehow continue the MTT run on the backend mom node. The front end process running the ALPS plugin up to that point would be disconnected with whatever was going on in the backend.

The way the MTT SLURM and ALPS plugins are written, I suspect systems configured similar to some of Cray's internal systems were being used. There, SLURM is configured so that when one does an salloc, one remains on the front end nodes, not ssh'd into a compute or mom node on the backend. In this case, the plugins as is with their allocate/deallocate commands should work fine. PBS was similarly configured on those systems.

hppritcha · 2022-06-27T14:45:07Z

closed via #916

hppritcha self-assigned this Dec 12, 2019

hppritcha added the Python Client label Dec 12, 2019

hppritcha closed this as completed Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need ability to snapshot python client logfile #885

need ability to snapshot python client logfile #885

hppritcha commented Dec 12, 2019

rhc54 commented Dec 12, 2019

hppritcha commented Dec 12, 2019

rhc54 commented Dec 12, 2019

ribab commented Dec 12, 2019

hppritcha commented Dec 18, 2019 •

edited

Loading

hppritcha commented Jun 27, 2022

need ability to snapshot python client logfile #885

need ability to snapshot python client logfile #885

Comments

hppritcha commented Dec 12, 2019

rhc54 commented Dec 12, 2019

hppritcha commented Dec 12, 2019

rhc54 commented Dec 12, 2019

ribab commented Dec 12, 2019

hppritcha commented Dec 18, 2019 • edited Loading

hppritcha commented Jun 27, 2022

hppritcha commented Dec 18, 2019 •

edited

Loading