Skip to content

chore(tee): launcher cleanup#549

Merged
9 commits merged intomainfrom
kd/launcher-cleanup
Jun 23, 2025
Merged

chore(tee): launcher cleanup#549
9 commits merged intomainfrom
kd/launcher-cleanup

Conversation

@kevindeforth
Copy link
Copy Markdown
Contributor

Follow-up to #524, resolving the most pressing clean-up issues raised during the review (#524 (review)).

  • RPC variables (timeout for a request and timeout between succsessive requests) can now be passed through environment variables.
  • RPC variables default behavior is changed (before: waiting indefinitely and doing only one request, now, waiting for 10 seconds max and attempting up to 20 requests).
  • additional type checks for user env variables (tags, registry and docker_image name)
  • print stderr of failed processes instead of just the error code.

There are still a ton of panics in this code and splitting some of these bigger functions up would improve readability. But in the interest of simplifying review and concentrating on the most pressing issues, this is delayed to later PRs.

Comment thread tee-launcher/launcher.py
print(
"[Warning] Exceeded number of maximum RPC requests for any given attempt. Will continue in the hopes of finding the matching image hash among remaining tags"
)
# Q: Do we expect all requests to succeed?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird to keep looping if we have failed the retry loops. If we hit rate limits, wouldn't it be better to use a pretty long exponential backoff? And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.

IIUC, then we don't necessarily require these requests to succeed. Seems to me like we are searching for a matching hash

if config_digest == docker_image.digest:

Before this PR, we ignored any tag for which we got a response different from 200.
This meant that if we hit a rate limit at the hash of interest (or any other error that was recoverable), the launcher would not retry and eventually fail, since we would skip the tag of interest and probably not find it anymore.

With this PR, we try to give us the best chances of success. This might lead to wasting some resources, but I don't understand this endpoint enough to judge that. (I would guess @barakeinav1 has more insights here)

For example, if we request a tag that does not exist, will we get a 400 or a 404? In such a case, we should probably just skip the tag. But given my limited understanding of this endpoint and how it is expected to behave, I thought it would be better to err on the side of caution and just retry up to the allotted limit.

If it turns out to be bad performance wise, we can always increase the timeout and decrease the maximum number of attempts through the environment variables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any meaningful input here. I haven't written or tested this part.
I know that Thomas had some rate limit issue, so this is way he added this logic

@kevindeforth kevindeforth marked this pull request as ready for review June 23, 2025 09:11
@kevindeforth kevindeforth requested review from a user, barakeinav1 and netrome June 23, 2025 09:11
Copy link
Copy Markdown
Collaborator

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff. Some thoughts and questions from me.

Comment thread tee-launcher/launcher.py
Comment on lines +27 to +28
# MUST be set to 1.
OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it if it must be set to 1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, this was done by @barakeinav1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes- this enforces docker to validate the hash when launching the container

Copy link
Copy Markdown
Contributor

@pbeza pbeza Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding that to the code comment then.

Suggested change
# MUST be set to 1.
OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'
# MUST be set to 1 to enforce Docker to validate image signatures when launching containers
# https://docs.docker.com/engine/security/trust/#client-enforcement-with-docker-content-trust
OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST'

Comment thread tee-launcher/launcher.py Outdated
Comment on lines +33 to +36
# Dstack user config. Read from `DSTACK_USER_CONFIG_FILE`
USER_ENV_VAR_LAUNCHER_IMAGE_TAGS = 'LAUNCHER_IMAGE_TAGS'
USER_ENV_VAR_LAUNCHER_IMAGE_NAME = 'LAUNCHER_IMAGE_NAME'
USER_ENV_VAR_LAUNCHER_IMAGE_REGISTRY = 'LAUNCHER_REGISTRY'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. The comment says we should read these from a file but we still have env variables for it? Why not just take a file path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I renamed to DSTACK_USER_CONFIG_*

Comment thread tee-launcher/launcher.py
# Default values for dstack user config file.
DEFAULT_LAUNCHER_IMAGE_NAME = 'nearone/mpc-node-gcp'
DEFAULT_REGISTRY = 'registry.hub.docker.com'
DEFAULT_LAUNCHER_IMAGE_TAG = 'latest'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a specific tag here, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barakeinav1 will know this better than me, I preserved his logic

Copy link
Copy Markdown
Contributor

@barakeinav1 barakeinav1 Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, latest will pull the latest, and it should be the latest voted node.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure if relying on the latest tag is safe. What if someone hacks our Dockerhub and pushes some arbitrary exploit as the latest MPC launcher?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from security perspective - we don't have a problem we always validated against approved hash.
and user can specify a tag if he wants to change it.
using will allow the user to use the launcher without adding config change more often

Comment thread tee-launcher/launcher.py
return val.strip() == val


@dataclass(frozen=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment thread tee-launcher/launcher.py
print(
"[Warning] Exceeded number of maximum RPC requests for any given attempt. Will continue in the hopes of finding the matching image hash among remaining tags"
)
# Q: Do we expect all requests to succeed?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird to keep looping if we have failed the retry loops. If we hit rate limits, wouldn't it be better to use a pretty long exponential backoff? And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.

Comment thread tee-launcher/launcher.py
"""
for attempt in range(1, rpc_max_attempts + 1):
# we sleep at the beginning, to ensure that we respect the timeout. Performance is not a priority in this case.
time.sleep(rpc_request_interval_secs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider increasing this interval on each loop by a factor of e.g. 1.5 until we hit a max interval of 1 minute (or any value that seems sensible).

@kevindeforth kevindeforth added this pull request to the merge queue Jun 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 23, 2025
@ghost ghost added this pull request to the merge queue Jun 23, 2025
Merged via the queue into main with commit 5393551 Jun 23, 2025
2 checks passed
@ghost ghost deleted the kd/launcher-cleanup branch June 23, 2025 11:38
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants