Conversation
| print( | ||
| "[Warning] Exceeded number of maximum RPC requests for any given attempt. Will continue in the hopes of finding the matching image hash among remaining tags" | ||
| ) | ||
| # Q: Do we expect all requests to succeed? |
There was a problem hiding this comment.
Feels weird to keep looping if we have failed the retry loops. If we hit rate limits, wouldn't it be better to use a pretty long exponential backoff? And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.
There was a problem hiding this comment.
And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.
IIUC, then we don't necessarily require these requests to succeed. Seems to me like we are searching for a matching hash
Line 393 in 3610a24
Before this PR, we ignored any tag for which we got a response different from 200.
This meant that if we hit a rate limit at the hash of interest (or any other error that was recoverable), the launcher would not retry and eventually fail, since we would skip the tag of interest and probably not find it anymore.
With this PR, we try to give us the best chances of success. This might lead to wasting some resources, but I don't understand this endpoint enough to judge that. (I would guess @barakeinav1 has more insights here)
For example, if we request a tag that does not exist, will we get a 400 or a 404? In such a case, we should probably just skip the tag. But given my limited understanding of this endpoint and how it is expected to behave, I thought it would be better to err on the side of caution and just retry up to the allotted limit.
If it turns out to be bad performance wise, we can always increase the timeout and decrease the maximum number of attempts through the environment variables.
There was a problem hiding this comment.
I don't have any meaningful input here. I haven't written or tested this part.
I know that Thomas had some rate limit issue, so this is way he added this logic
netrome
left a comment
There was a problem hiding this comment.
Nice stuff. Some thoughts and questions from me.
| # MUST be set to 1. | ||
| OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST' |
There was a problem hiding this comment.
Why do we need it if it must be set to 1?
There was a problem hiding this comment.
yes- this enforces docker to validate the hash when launching the container
There was a problem hiding this comment.
Probably worth adding that to the code comment then.
| # MUST be set to 1. | |
| OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST' | |
| # MUST be set to 1 to enforce Docker to validate image signatures when launching containers | |
| # https://docs.docker.com/engine/security/trust/#client-enforcement-with-docker-content-trust | |
| OS_ENV_DOCKER_CONTENT_TRUST = 'DOCKER_CONTENT_TRUST' |
| # Dstack user config. Read from `DSTACK_USER_CONFIG_FILE` | ||
| USER_ENV_VAR_LAUNCHER_IMAGE_TAGS = 'LAUNCHER_IMAGE_TAGS' | ||
| USER_ENV_VAR_LAUNCHER_IMAGE_NAME = 'LAUNCHER_IMAGE_NAME' | ||
| USER_ENV_VAR_LAUNCHER_IMAGE_REGISTRY = 'LAUNCHER_REGISTRY' |
There was a problem hiding this comment.
I'm confused. The comment says we should read these from a file but we still have env variables for it? Why not just take a file path?
There was a problem hiding this comment.
Good point. I renamed to DSTACK_USER_CONFIG_*
| # Default values for dstack user config file. | ||
| DEFAULT_LAUNCHER_IMAGE_NAME = 'nearone/mpc-node-gcp' | ||
| DEFAULT_REGISTRY = 'registry.hub.docker.com' | ||
| DEFAULT_LAUNCHER_IMAGE_TAG = 'latest' |
There was a problem hiding this comment.
We should use a specific tag here, right?
There was a problem hiding this comment.
@barakeinav1 will know this better than me, I preserved his logic
There was a problem hiding this comment.
no, latest will pull the latest, and it should be the latest voted node.
There was a problem hiding this comment.
Not really sure if relying on the latest tag is safe. What if someone hacks our Dockerhub and pushes some arbitrary exploit as the latest MPC launcher?
There was a problem hiding this comment.
from security perspective - we don't have a problem we always validated against approved hash.
and user can specify a tag if he wants to change it.
using will allow the user to use the launcher without adding config change more often
| return val.strip() == val | ||
|
|
||
|
|
||
| @dataclass(frozen=True) |
| print( | ||
| "[Warning] Exceeded number of maximum RPC requests for any given attempt. Will continue in the hopes of finding the matching image hash among remaining tags" | ||
| ) | ||
| # Q: Do we expect all requests to succeed? |
There was a problem hiding this comment.
Feels weird to keep looping if we have failed the retry loops. If we hit rate limits, wouldn't it be better to use a pretty long exponential backoff? And if we still fail I think we should raise an exception here and let the launcher fail explicitly instead of hanging and continue sending failing requests.
| """ | ||
| for attempt in range(1, rpc_max_attempts + 1): | ||
| # we sleep at the beginning, to ensure that we respect the timeout. Performance is not a priority in this case. | ||
| time.sleep(rpc_request_interval_secs) |
There was a problem hiding this comment.
We could consider increasing this interval on each loop by a factor of e.g. 1.5 until we hit a max interval of 1 minute (or any value that seems sensible).
Follow-up to #524, resolving the most pressing clean-up issues raised during the review (#524 (review)).
stderrof failed processes instead of just the error code.There are still a ton of panics in this code and splitting some of these bigger functions up would improve readability. But in the interest of simplifying review and concentrating on the most pressing issues, this is delayed to later PRs.