-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAX cannot find TPU metadata inside a container #10923
Comments
None of these env variables are present on the host - it populates them dynamically |
Seems like the sandbox cannot reach the GCE metadata server for some reason. I'm surprised that |
runsc.log.20240918-170153.986977.boot.txt |
It looks like From runsc.log.20240918-170153.986977.boot.txt:
You'll also want to run From runsc.log.20240918-170153.986977.boot.txt:
FWIW I tested this on my own V5 GCE VM and it worked. Here was my
Here was my
|
I'll just add as a note that these environment variables are piped into the configs automatically in GKE. In GCE you'll either have to add the environment variables to your spec yourself (maybe by fetching them from the metadata server before starting a sandbox) or give the sandbox at least some host network access. |
@pawalt were you able to get something working for your use case? It would be great to be able to close this bug if so. |
@manninglucas I haven't had time to get this working, but that's not a GVisor issue - we've just been swamped with other work. You're welcome to close out this ticket since it'll probably go stale. Your env var fixes look reasonable anyway. I'm happy to reopen this or open another ticket if I get a chance to try this out again and run into the same issues. |
Description
When a TPU container is initialized, it's missing some environment variables that JAX needs in order to initialize. In the absence of these variables, JAX attempts to look up their values over the network. This fails as the container may not have direct access to the network.
I have also tried this with
network=host
to no avail.Steps to reproduce
Run a jax image with
--tpuproxy
:Runsc command:
Start the container:
I've built this image by exporting the following dockerfile:
runsc version
docker version (if using docker)
I'm not using docker
uname
uname -a Linux t1v-n-1f714773-w-0 5.19.0-1022-gcp #24~22.04.1-Ubuntu SMP Sun Apr 23 09:51:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
git describe release-20240826.0-81-g4bcbb55fc
runsc debug logs (if available)
No response
The text was updated successfully, but these errors were encountered: