Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF integration issue #118

Open
hani1814 opened this issue Sep 5, 2018 · 5 comments
Open

LSF integration issue #118

hani1814 opened this issue Sep 5, 2018 · 5 comments

Comments

@hani1814
Copy link

hani1814 commented Sep 5, 2018

Hi,

We have an LSF Cluster that has been successfully tested with a python bsub submission. Jupyterhub / anaconda installed on the same serverworks wonderfully with the out of the box configuration.

I switched to batchapawner to test LSF which comes part of a SAS grid installation.

To keep things simple, jupyterhub is running under my own account to avoid any spawning issues. I have replaced my username with myusername
[I 2018-09-05 11:42:27.775 JupyterHub batchspawner:189] Spawner submitted script:
#!/bin/sh
#BSUB -R "select[type==any]" # Allow spawning on non-uniform hardware
#BSUB -R "span[hosts=1]" # Only spawn job on one server
#BSUB -q
#BSUB -J spawner-jupyterhub
#BSUB -o /home/myusername/.jupyterhub.lsf.out
#BSUB -e /home/myusername/.jupyterhub.lsf.err

    jupyterhub-singleuser --ip="0.0.0.0" --port=33546

[D 2018-09-05 11:42:27.780 JupyterHub base:427] 0/100 concurrent spawns
[D 2018-09-05 11:42:27.781 JupyterHub base:430] 0 active servers

...

[E 2018-09-05 11:27:08.847 JupyterHub user:427] Unhandled error starting myusername's server: /opt/sas94/thirdparty/platform/lsf/9.1/linux2.6-glibc2.3-x86_64/etc/eauth: read conf error!
Failed in an LSF library call: External authentication failed. Job not submitted.

@rkdarst
Copy link
Contributor

rkdarst commented Sep 5, 2018

If you try submitting that batch script as the user the hub is running as, does it work? Depending on your config, this may be with a sudo wrapper or not. I'm not sure the requirements for your cluster. The log line right above should have said what command was being used to submit the job, that would be helpful to tell what is happening (can your user sudo to itself and lose something which is required to submit jobs?)

But... since it sounds like you are running as your username and spawning your username, the above may not apply.

More likely solution: some environment variables are missing which the bsub command needs. I see some of them are defined in LsfSpawner.get_env. Could some other variables be needed?

For example, on my cluster the job is submitted with "sudo -u {username}", and no further environment (like a kerberos ticket) is required to submit jobs.

@hani1814
Copy link
Author

hani1814 commented Sep 6, 2018

Thank you for the feedback. Jupyterhub is running as myself. We’ve made progress by defining an extra env variable EGO_CONFDIR and added it to the env variable in py

I have limited LSF to the local server for now, I can now login but the notebook shuts down after 30s. Any idea how to address this?

We will then test the hub running under root and more servers.

@rkdarst
Copy link
Contributor

rkdarst commented Sep 6, 2018

To clarify: EGO_CONFDIR was added where? Is this a generic LSF variable or something special to your site?

Assuming the job runs and the singleuser server starts, JupyterHub will automatically stop the server if it doesn't get a positive response that it's up. This is should be seen in the logs. It connects back using hub_connect_url. hub_ip is the location that JH binds to. If connect address and bind address are different (might be the case if hub_ip is localhost and or not accessible to spawners, and spawners run on a different node. Or if there is a firewall.)

If I get JH logs that show when it starts and is cancelled, I can possibly say more. Also the singleuser server logs (stdout of the batch job) give important clues about if it can't connect back.

@rkdarst
Copy link
Contributor

rkdarst commented Sep 6, 2019

Is this working enough (or no longer relevant) so that it can be closed now?

@jbeal-work
Copy link

As an aside we have batchspawner working with LSF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants