-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need ability to snapshot python client logfile #885
Comments
I'm a little puzzled here. The Python client includes a resource manager plugin - e.g., Slurm. That plugin has the ability to request an allocation in it. The intent was that the Python client would not need to be started from within an allocation, but would instead execute its initial operations (fetching and building things) and then request an allocation when one was needed for actually running the tests. The only issue we have encountered with that method is that the allocation request can take some time to be granted. Our internal solution was to simply create a high-priority queue for MTT operations and submit the request to it. Is this not adequate for ECP systems? If not, note that the Python client already has a C/R capability in that you can have one ini file that downloads and builds things, and another ini file that flags the download/build components with |
I don't see how this would work with the current IU database reporter. Anyway I've written the code and it appears to serve my purposes. |
You are welcome to use your code - however, the methods I described work just fine with the current IU reporter. We use it every day precisely that way. The builds are reported correctly even if previously built. |
Hi @hppritcha , I am trying to understand your use-case and how it differs from ours. Your process allocates the cluster from a compute node? How does this work? You said:
This means the process that requested an allocation runs on a compute node? Don't you need an allocation before running anything on a compute node? |
@ribab one invokes the allocation command from a front end node - the one you get placed on when you ssh to the system. For example with the ANL theta cluster, here's how you'd get an allocation:
upon granting the allocation, the user is placed on one of the internal A similar thing occurs on SLURM configured systems like NERSC cori. One can try using the salloc The only way I can see how one might use MTT ALPS plugin on theta would be for the allocate command to include the name of a script which would somehow continue the MTT run on the backend mom node. The front end process running the ALPS plugin up to that point would be disconnected with whatever was going on in the backend. The way the MTT SLURM and ALPS plugins are written, I suspect systems configured similar to some of Cray's internal systems were being used. There, SLURM is configured so that when one does an salloc, one remains on the front end nodes, not ssh'd into a compute or mom node on the backend. In this case, the plugins as is with their allocate/deallocate commands should work fine. PBS was similarly configured on those systems. |
closed via #916 |
The MTT python client as is is not well suited for use on typical ECP systems: cori, theta, etc.
The problem is that to function with the Open MPI community's IU database, to get the OMPI tarballs from the download site, to do a git checkout of the OMPI tests, all require internet connectivity.
On most ECP systems however, only the front end nodes have such internet connectivity. To run tests however, one needs a resource allocation from SLURM, COBALT, etc. Once such a resource is requested, however, the requesting process is typically put on some compute node or
mom
node on the system. These nodes typically do not have internet access.Several workarounds have been attempted in the past. The most recent with SLURM systems has been to request an allocation in such a way that the process issuing the
salloc
request remains on the login node. This has only partially worked, as this mode of using SLURM is atypical and is buggy.A potentially simpler approach is to add a
checkpoint/restart
like capability to MTT so one can break up an MTT run into a sequence of stages with different ini files. Some stages can be run on login nodes where internet connectivity is required, other stages can be run on the backend, typically to run the tests.The text was updated successfully, but these errors were encountered: