-
Notifications
You must be signed in to change notification settings - Fork 25
IPEX Multinode SSH Support #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Manifest Files |
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com>
|
||
RUN python -m pip install --no-cache-dir -r multinode-requirements.txt | ||
|
||
ENV LD_LIBRARY_PATH="/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}:/usr/local/lib/python${PYTHON_VERSION}/dist-packages/oneccl_bindings_for_pytorch/opt/mpi/libfabric/lib:/usr/local/lib/python${PYTHON_VERSION}/dist-packages/oneccl_bindings_for_pytorch/lib" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepend /lib/x86_64-linux-gnu
overrides conda's installation of openssl which conflicts with dpkg
@dmsuehir @louie-tsai |
Thanks, I'm testing this multinode base with a rebuilt version of our workflow container that has the SSH config removed from our dockerfile. I'll report back with the results when testing is done. |
@tylertitsworth Based on my testing, this looks good. I ran on a single node and on multiple nodes with the updated container. |
@HarshaRamayanam FYI |
630f798
to
00d8147
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I tested this update with our k8s HF workflow.
12c2f53
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: sharvil.shah <sharvils@mlp-prod-clx-5669.ra.intel.com> Co-authored-by: sharvil10 <sharvil.shah@intel.com> Co-authored-by: Jitendra Patil <jitendra.patil@intel.com> Signed-off-by: ma-pineda <miguel.pineda.juarez@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: sharvil.shah <sharvils@mlp-prod-clx-5669.ra.intel.com> Co-authored-by: sharvil10 <sharvil.shah@intel.com> Co-authored-by: Jitendra Patil <jitendra.patil@intel.com> Signed-off-by: ma-pineda <miguel.pineda.juarez@intel.com>
Signed-off-by: Tyler Titsworth <tyler.titsworth@intel.com> Co-authored-by: sharvil.shah <sharvils@mlp-prod-clx-5669.ra.intel.com> Co-authored-by: sharvil10 <sharvil.shah@intel.com> Co-authored-by: Jitendra Patil <jitendra.patil@intel.com> Signed-off-by: ma-pineda <miguel.pineda.juarez@intel.com>
pytorch/Dockerfile
Outdated
|
||
COPY generate_ssh_keys.sh . | ||
|
||
RUN cat /generate_ssh_keys.sh >> ~/.bashrc && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about client public/private key? generate_ssh_key.sh only cover server key
# hadolint global ignore=SC2002 | ||
RUN mkdir -p /var/run/sshd && \ | ||
cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ | ||
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might also need to config ssh_config as below to have another ssh port support for sshd in docker with different port
Host * Port 2345
cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ | ||
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ | ||
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might also need to set sshd_config to use different sshd port among all instances.
"Port 2345 >> /etc/sshd/sshd_config"
cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ | ||
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ | ||
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also set below sshd config to have have Root login without password.
not sure whether we also need it or not
RUN sed -i'' -e's/^#PermitRootLogin prohibit-password$/PermitRootLogin yes/' /etc/ssh/sshd_config \ && sed -i'' -e's/^#PasswordAuthentication yes$/PasswordAuthentication no/' /etc/ssh/sshd_config \ && sed -i'' -e's/^#PermitEmptyPasswords no$/PermitEmptyPasswords yes/' /etc/ssh/sshd_config \ && sed -i'' -e's/^UsePAM yes/UsePAM no/' /etc/ssh/sshd_conf
> User also need to append content of id_rsa.pub in `/etc/ssh/authorized_keys` in the SSH server container. | ||
> Since the SSH key is not owned by default user account in docker, please also do "chmod 644 id_rsa.pub; chmod 644 id_rsa" to grant read access for default user account. | ||
> Users could also use "/usr/bin/ssh-keygen -t rsa -b 4096 -N '' -f ~/mnt/ssh_key/id_rsa" to generate a new SSH Key inside the container. | ||
> Users need to mount a config file to list all hostnames at location `/root/.ssh/config` on the SSH client container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be handled by docker image if we config /etc/ssh/sshd_config and ssh_config well.
``` | ||
|
||
2. Add hosts to config | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this step could be further simplified if you config /etc/ssh well during docker build. we don't need to put those efforts on users, and the manual configuration is also error-prone.
Description
Mount your custom SSH config and SSH keys to use manually, but otherwise it should generate a key and start a server for usage in k8s.
Related Issue
n/a
Changes Made
Validation
test_runner.py
with all existing tests passing, and I have added new tests where applicable.