chore(tee): launcher cleanup by kevindeforth · Pull Request #549 · near/mpc

kevindeforth · 2025-06-23T09:10:17Z

Follow-up to #524, resolving the most pressing clean-up issues raised during the review (#524 (review)).

RPC variables (timeout for a request and timeout between succsessive requests) can now be passed through environment variables.
RPC variables default behavior is changed (before: waiting indefinitely and doing only one request, now, waiting for 10 seconds max and attempting up to 20 requests).
additional type checks for user env variables (tags, registry and docker_image name)
print stderr of failed processes instead of just the error code.

There are still a ton of panics in this code and splitting some of these bigger functions up would improve readability. But in the interest of simplifying review and concentrating on the most pressing issues, this is delayed to later PRs.

kevindeforth · 2025-06-23T09:11:02Z

+            print(
+                "[Warning] Exceeded number of maximum RPC requests for any given attempt. Will continue in the hopes of finding the matching image hash among remaining tags"
+            )
+            # Q: Do we expect all requests to succeed?


Feels weird to keep looping if we have failed the retry loops. If we hit rate limits, wouldn't it be better to use a pretty long exponential backoff? And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.

And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.

IIUC, then we don't necessarily require these requests to succeed. Seems to me like we are searching for a matching hash

mpc/tee-launcher/launcher.py

Line 393 in 3610a24

if config_digest == docker_image.digest:

Before this PR, we ignored any tag for which we got a response different from 200.
This meant that if we hit a rate limit at the hash of interest (or any other error that was recoverable), the launcher would not retry and eventually fail, since we would skip the tag of interest and probably not find it anymore.

With this PR, we try to give us the best chances of success. This might lead to wasting some resources, but I don't understand this endpoint enough to judge that. (I would guess @barakeinav1 has more insights here)

For example, if we request a tag that does not exist, will we get a 400 or a 404? In such a case, we should probably just skip the tag. But given my limited understanding of this endpoint and how it is expected to behave, I thought it would be better to err on the side of caution and just retry up to the allotted limit.

If it turns out to be bad performance wise, we can always increase the timeout and decrease the maximum number of attempts through the environment variables.

I don't have any meaningful input here. I haven't written or tested this part.
I know that Thomas had some rate limit issue, so this is way he added this logic

netrome

Nice stuff. Some thoughts and questions from me.

netrome · 2025-06-23T09:14:15Z

+# MUST be set to 1.
+OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'


Why do we need it if it must be set to 1?

not sure, this was done by @barakeinav1

yes- this enforces docker to validate the hash when launching the container

Probably worth adding that to the code comment then.

Suggested change

# MUST be set to 1.

OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'

# MUST be set to 1 to enforce Docker to validate image signatures when launching containers

# https://docs.docker.com/engine/security/trust/#client-enforcement-with-docker-content-trust

OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'

netrome · 2025-06-23T09:15:58Z

+# Dstack user config. Read from `DSTACK_USER_CONFIG_FILE`
+USER_ENV_VAR_LAUNCHER_IMAGE_TAGS = 'LAUNCHER_IMAGE_TAGS'
+USER_ENV_VAR_LAUNCHER_IMAGE_NAME = 'LAUNCHER_IMAGE_NAME'
+USER_ENV_VAR_LAUNCHER_IMAGE_REGISTRY = 'LAUNCHER_REGISTRY'


I'm confused. The comment says we should read these from a file but we still have env variables for it? Why not just take a file path?

Good point. I renamed to DSTACK_USER_CONFIG_*

netrome · 2025-06-23T09:16:19Z

+# Default values for dstack user config file.
+DEFAULT_LAUNCHER_IMAGE_NAME = 'nearone/mpc-node-gcp'
+DEFAULT_REGISTRY = 'registry.hub.docker.com'
+DEFAULT_LAUNCHER_IMAGE_TAG = 'latest'


We should use a specific tag here, right?

@barakeinav1 will know this better than me, I preserved his logic

no, latest will pull the latest, and it should be the latest voted node.

Not really sure if relying on the latest tag is safe. What if someone hacks our Dockerhub and pushes some arbitrary exploit as the latest MPC launcher?

from security perspective - we don't have a problem we always validated against approved hash.
and user can specify a tag if he wants to change it.
using will allow the user to use the launcher without adding config change more often

netrome · 2025-06-23T09:17:11Z

+    return val.strip() == val
+
+
+@dataclass(frozen=True)


netrome · 2025-06-23T09:25:32Z

+            print(
+                "[Warning] Exceeded number of maximum RPC requests for any given attempt. Will continue in the hopes of finding the matching image hash among remaining tags"
+            )
+            # Q: Do we expect all requests to succeed?


Feels weird to keep looping if we have failed the retry loops. If we hit rate limits, wouldn't it be better to use a pretty long exponential backoff? And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.

netrome · 2025-06-23T09:28:49Z

+    """
+    for attempt in range(1, rpc_max_attempts + 1):
+        # we sleep at the beginning, to ensure that we respect the timeout. Performance is not a priority in this case.
+        time.sleep(rpc_request_interval_secs)


We could consider increasing this interval on each loop by a factor of e.g. 1.5 until we hit a max interval of 1 minute (or any value that seems sensible).

kevindeforth added 8 commits June 21, 2025 16:04

init

b7fea5c

.

58d667c

.

10908d0

.

a58a6e7

.

dc08dd3

.

b78794b

.

f0ceaf7

.

4f3e6b8

kevindeforth commented Jun 23, 2025

View reviewed changes

kevindeforth marked this pull request as ready for review June 23, 2025 09:11

kevindeforth requested review from a user, barakeinav1 and netrome June 23, 2025 09:11

netrome approved these changes Jun 23, 2025

View reviewed changes

ghost approved these changes Jun 23, 2025

View reviewed changes

barakeinav1 approved these changes Jun 23, 2025

View reviewed changes

exponential backoff

3610a24

kevindeforth added this pull request to the merge queue Jun 23, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 23, 2025

ghost added this pull request to the merge queue Jun 23, 2025

barakeinav1 approved these changes Jun 23, 2025

View reviewed changes

Merged via the queue into main with commit 5393551 Jun 23, 2025
2 checks passed

ghost deleted the kd/launcher-cleanup branch June 23, 2025 11:38

This pull request was closed.

		# MUST be set to 1.
		OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'

Conversation

kevindeforth commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netrome left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbeza Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barakeinav1 Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pbeza Jun 23, 2025 •

edited

Loading

barakeinav1 Jun 23, 2025 •

edited

Loading