Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading model and build time #6

Open
deeplearningapps opened this issue Jun 30, 2023 · 1 comment
Open

Loading model and build time #6

deeplearningapps opened this issue Jun 30, 2023 · 1 comment

Comments

@deeplearningapps
Copy link

Hi,

I tried following the README to run the LmCloudSpmd2BTest example on TPUv4 but couldn't load the model; this is the output of saxutil ls /sax/test/lm2b on admin:

INFO: Running command line: bazel-bin/saxml/bin/saxutil_/saxutil '--sax_root=gs://saxml-data/sax-root' ls /sax/test/lm2b
+-------+-------------------------------------------------------+-----------------+---------------+---------------------------+
| MODEL | MODEL PATH | CHECKPOINT PATH | # OF REPLICAS | (SELECTED) REPLICAADDRESS |
+-------+-------------------------------------------------------+-----------------+---------------+---------------------------+
| lm2b | saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd2BTest | None | 0 | |
+-------+-------------------------------------------------------+-----------------+---------------+---------------------------+
+--------+-----+
| METHOD | ACL |
+--------+-----+
+--------+-----+

Here are the commands I used to start the admin and model server.
On admin:
bazel run saxml/bin:admin_config -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --fs_root=gs://saxml-data/sax-fs-root --alsologtostderr
bazel run saxml/bin:admin_server -- --sax_cell=/sax/test --sax_root=gs://saxml-data/sax-root --port=10000 --alsologtostderr
saxutil publish /sax/test/lm2b saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd2BTest None 1

I0630 04:24:08.036908 19996 ipaddr.go:56] IPNet address 10.128.0.71
I0630 04:24:08.212039 19996 admin.go:305] Loaded config: fs_root: "gs://saxml-data/sax-fs-root"
I0630 04:24:08.248588 19996 addr.go:105] SetAddr /gcs/saxml-data/sax-root/sax/test/location.prot o "10.128.0.71:10000"
I0630 04:24:08.298355 19996 admin.go:325] Updated config: fs_root: "gs://saxml-data/sax-fs-root "
I0630 04:24:08.455680 19996 mgr.go:781] Loaded manager state
I0630 04:24:08.455819 19996 mgr.go:784] Refreshing manager state every 10s
I0630 04:24:08.455895 19996 admin.go:350] Starting the server on port 10000
I0630 04:24:08.455957 19996 cloud.go:480] Starting the HTTP server on port 8080
I0630 14:22:11.800066 19996 state.go:456] Starting a queue that drains pending model server acti ons
I0630 14:22:11.800149 19996 state.go:473] Initializing state from model server 10.130.0.4:10001
I0630 14:22:11.810371 19996 state.go:479] Refreshing model server state every 10s
I0630 14:29:54.329640 19996 mgr.go:134] Published with overrides: map[]

On model server:
bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr

I0630 14:22:09.754312 139843449665280 model_service_base.py:852] Started joining SAX cell /sax/test
ERROR: logging before flag.Parse: I0630 14:22:11.754970 223228 location.go:141] Calling Join due to address update
ERROR: logging before flag.Parse: I0630 14:22:11.814963 223228 location.go:155] Joined 10.128.0.71 :10000
ERROR: logging before flag.Parse: I0630 14:37:11.758835 223228 location.go:162] Calling Join at fixed interval
ERROR: logging before flag.Parse: I0630 14:37:11.814902 223228 addr.go:72] FetchAddr /gcs/saxml-data/sax-root/sax/test/location.proto "10.128.0.71:10000"
ERROR: logging before flag.Parse: I0630 14:37:11.843650 223228 location.go:172] Joined 10.128.0.71 :10000

I've also waited a while to try saxutil ls /sax/test/lm2b again but still nothing in the "selected replica address" column. Any ideas of what might went wrong?

One thing I also noticed is the build time on model server is very long. The first time of running bazel run saxml/server:server -- --sax_cell=/sax/test --port=10001 --platform_chip=tpuv4 --platform_topology=2x2x1 --alsologtostderr took ~5 hrs to finish:

Target //saxml/server:server up-to-date:
bazel-bin/saxml/server/server.py
bazel-bin/saxml/server/server
INFO: Elapsed time: 16268.138s, Critical Path: 16222.45s
INFO: 5113 processes: 19 internal, 5091 linux-sandbox, 3 local.
INFO: Build completed successfully, 5113 total actions
INFO: Running command line: bazel-bin/saxml/server/server '--sax_cell=/sax/test' '--port=10001' '-- platform_chip=tpuv4' '--platform_topology=2x2x1' --alsologtostderr

Succeeding ones only took a few seconds to complete. Is this expected behavior?

Thanks!

@zhihaoshan-google
Copy link
Collaborator

Hi,

For your first question:
Could you help to confirm whether your TPU VM is able to access the GCS "gs://saxml-data/sax-root"? And also could you upload the model server log after saxutil publish?

For "Succeeding ones only took a few seconds to complete. Is this expected behavior?":
Yes, it's expected as the bazel cache was created locally and the following builds are using the cache. You can follow the https://cloud.google.com/tpu/docs/v5e-inference#large_language_model_serving to use our docker image directly without build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants