Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combining multiple job starts #12

Open
bernstei opened this issue Mar 28, 2022 · 5 comments
Open

combining multiple job starts #12

bernstei opened this issue Mar 28, 2022 · 5 comments

Comments

@bernstei
Copy link
Contributor

On some remote machines just the ssh connection is somewhat slow. It would be nice if multiple job start commands could be combined, perhaps by gathering all the remote commands into an array of strings, and then running all of them in a single ssh connection.

@bernstei bernstei changed the title combining job start combining multiple job starts Mar 28, 2022
@bernstei
Copy link
Contributor Author

Note - it's unclear, in retrospect, what makes these remote job starts slow. Need to investigate further before determining how to increase rate.

@bernstei
Copy link
Contributor Author

bernstei commented Apr 1, 2022

Looks like the staging in of files and ssh qsub each take a non-negligible time (around 1s). Both would need to be batched to fully help.

@gabor1
Copy link
Contributor

gabor1 commented Apr 4, 2022

is this really an issue? I guess you are already batching individual configs, so it won't be the case that you'd want to qsub 10,000 individual jobs (many queueing systems would choke as well)

@bernstei
Copy link
Contributor Author

bernstei commented Apr 4, 2022

It is when you have 1000 jobs (one per config to re-evaluate an entire fitting database with tighter DFT params), and each one take 3 seconds, because the rsync to stage in fils take 1.5 s and the ssh to qsub takes 1.5 s. I guess I could set chunksize=1 and job_chunksize > 1 to do job_chunksize DFT evaluation per job, and reduce the number of rsync/ssh+qsub by a factor of job_chunksize. Maybe that's the right approach.

@bernstei
Copy link
Contributor Author

I have a solution for this, where ExPyRe, system, and scheduler can all be told to store information in a buffer, and then start all the jobs in buffer at once (one ssh to set up the directories, one rsync to stage in the run dirs, and one ssh to submit all the jobs). A PR will be available eventually - it'd be useful if people tested the SGE implementation, which I do not have access to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants