You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried following the README to run the LmCloudSpmd2BTest example on TPUv4 but couldn't load the model; this is the output of saxutil ls /sax/test/lm2b on admin:
Here are the commands I used to start the admin and model server. On admin: bazel run saxml/bin:admin_config -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --fs_root=gs://saxml-data/sax-fs-root --alsologtostderr bazel run saxml/bin:admin_server -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --port=10000 --alsologtostderr saxutil publish /sax/test/lm2b saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd2BTest None 1
I0630 04:24:08.036908 19996 ipaddr.go:56] IPNet address 10.128.0.71
I0630 04:24:08.212039 19996 admin.go:305] Loaded config: fs_root: "gs://saxml-data/sax-fs-root"
I0630 04:24:08.248588 19996 addr.go:105] SetAddr /gcs/saxml-data/sax-root/sax/test/location.prot o "10.128.0.71:10000"
I0630 04:24:08.298355 19996 admin.go:325] Updated config: fs_root: "gs://saxml-data/sax-fs-root "
I0630 04:24:08.455680 19996 mgr.go:781] Loaded manager state
I0630 04:24:08.455819 19996 mgr.go:784] Refreshing manager state every 10s
I0630 04:24:08.455895 19996 admin.go:350] Starting the server on port 10000
I0630 04:24:08.455957 19996 cloud.go:480] Starting the HTTP server on port 8080
I0630 14:22:11.800066 19996 state.go:456] Starting a queue that drains pending model server acti ons
I0630 14:22:11.800149 19996 state.go:473] Initializing state from model server 10.130.0.4:10001
I0630 14:22:11.810371 19996 state.go:479] Refreshing model server state every 10s
I0630 14:29:54.329640 19996 mgr.go:134] Published with overrides: map[]
On model server: bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr
I0630 14:22:09.754312 139843449665280 model_service_base.py:852] Started joining SAX cell /sax/test
ERROR: logging before flag.Parse: I0630 14:22:11.754970 223228 location.go:141] Calling Join due to address update
ERROR: logging before flag.Parse: I0630 14:22:11.814963 223228 location.go:155] Joined 10.128.0.71 :10000
ERROR: logging before flag.Parse: I0630 14:37:11.758835 223228 location.go:162] Calling Join at fixed interval
ERROR: logging before flag.Parse: I0630 14:37:11.814902 223228 addr.go:72] FetchAddr /gcs/saxml-data/sax-root/sax/test/location.proto "10.128.0.71:10000"
ERROR: logging before flag.Parse: I0630 14:37:11.843650 223228 location.go:172] Joined 10.128.0.71 :10000
I've also waited a while to try saxutil ls /sax/test/lm2b again but still nothing in the "selected replica address" column. Any ideas of what might went wrong?
One thing I also noticed is the build time on model server is very long. The first time of running bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr took ~5 hrs to finish:
For your first question:
Could you help to confirm whether your TPU VM is able to access the GCS "gs://saxml-data/sax-root"? And also could you upload the model server log after saxutil publish?
For "Succeeding ones only took a few seconds to complete. Is this expected behavior?":
Yes, it's expected as the bazel cache was created locally and the following builds are using the cache. You can follow the https://cloud.google.com/tpu/docs/v5e-inference#large_language_model_serving to use our docker image directly without build.
Hi,
I tried following the README to run the LmCloudSpmd2BTest example on TPUv4 but couldn't load the model; this is the output of
saxutil ls /sax/test/lm2b
on admin:Here are the commands I used to start the admin and model server.
On admin:
bazel run saxml/bin:admin_config -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --fs_root=gs://saxml-data/sax-fs-root --alsologtostderr
bazel run saxml/bin:admin_server -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --port=10000 --alsologtostderr
saxutil publish /sax/test/lm2b saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd2BTest None 1
On model server:
bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr
I've also waited a while to try
saxutil ls /sax/test/lm2b
again but still nothing in the "selected replica address" column. Any ideas of what might went wrong?One thing I also noticed is the build time on model server is very long. The first time of running
bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr
took ~5 hrs to finish:Succeeding ones only took a few seconds to complete. Is this expected behavior?
Thanks!
The text was updated successfully, but these errors were encountered: