Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SageMaker fixes for archive mode and synchronous deployment/deletion execution #717

Merged
merged 41 commits into from Nov 21, 2018

Conversation

Projects
None yet
2 participants
@dbczumar
Copy link
Collaborator

commented Nov 15, 2018

TODO:

  • Follow up with a PR that introduces tests after #718 is merged

@dbczumar dbczumar requested a review from aarondav Nov 16, 2018

:param archive: If True, any pre-existing SageMaker application resources that become inactive
(i.e. as a result of deploying in ``mlflow.sagemaker.DEPLOYMENT_MODE_REPLACE``
mode) are preserved. If False, these resources are deleted.
:param archive: If ``True``, any pre-existing SageMaker application resources that become

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 17, 2018

Contributor

What resources are preserved, btw?

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 20, 2018

Contributor

Still question

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

archive will preserve SageMaker models and endpoint configurations associated with the old version of the application endpoint. I've updated the documentation to reflect this information.

self.cleaned_up = False

def await_completion(self, timeout_seconds):
begin = datetime.now()

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 17, 2018

Contributor

I would generally expect time.time() used here, since that's time relative to Unix epoch. Does datetime.now() have potential for daylight savings to move the time backwards/forwards?

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Good call. datetime.now() doesn't seem safe in the case of daylight savings time. I've switched to using time.time().

:return: S3 path of the uploaded artifact.
"""
sess = boto3.Session()
sess = boto3.Session(region_name=region_name)

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

S3 client already embeds the region -- why was this change needed?

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

By default, the region name is resolved from AWS credentials on the caller's machine. However, the user may not always want to use the default region when creating SageMaker and S3 resources. We allow the user to specify a region name as an argument to deploy(); all boto clients and sessions should use this region name.

However, the current behavior still uses the default region name, which may lead to unexpected behavior.

region_name=region_name,
s3_client=s3_client)
if endpoint_exists:
deployment_operation = _update_sagemaker_endpoint(

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Having two methods that take a lot of arguments, most of them the same, suggests that there may be an internal abstraction we could build that might simplify/reduce duplication. A lot of the code in these functions does look similar; it may be worth considering if we could simplify here.

I don't have a particular idea in mind, though, so it's cool not to do anything if the resultant code is likely to be more complicated or less easily extensible.

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Two slightly odd things, btw. Both the code here and in update lookup the endpoint, and seem to wait for some notion of success.

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Oops, just noticed that the update code is relevant for the status update stuff, so that's already minimal.

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

I think attempting to break this apart using additional abstractions may hurt readability and extensibility; the logical path is pretty distinct for each routine.

return _SageMakerOperationStatus.failed(failure_reason)

def cleanup_fn():
print("There are no resources to clean up!")

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Won't this print unconditionally on create? Probably no reason to do that

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Removed the print statement and made the method body a pass.

def status_check_fn():
endpoint_info = sage_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_update_was_rolled_back = (
endpoint_info["EndpointStatus"] == "InService"

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Is it guaranteed that after calling update_endpoint, the status immediately changes to InProgress? I'm worried about the case where there may be some delay, which could cause us to immediately consider the update failed.

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Added a mandatory 20 second delay before the endpoint status is used to determine whether or not the update succeeded

help=("If specified, resources associated with the application are preserved."
" Otherwise, these resources are deleted. `--archive` must be specified when"
" deleting asynchronously with `--asynchronous`."))
@click.option("--asynchronous", is_flag=True,

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Maybe just --async?

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Shortened the dashed parameter name to --async and added asynchronous as an internal variable name for use by the handling function to avoid conflict with the async keyword in Python 3.

sage_client=sage_client)

if synchronous:
eprint("Waiting for the deployment operation to complete...")

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Btw, we may want to add some indication that we're still waiting (e.g., periodically printing the current status).

"The deployment operation failed with the following error message:"
" \"{error_message}\"".format(error_message=operation_status.message))
elif not archive:
eprint("Cleaning up unused resources...")

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 19, 2018

Contributor

Can probably print this just in delete() and update(), otherwise the create case is kind of weird:

Cleaning up unused resources...
There are no resources to clean up!

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Good point. Added an additional clause to the conditional:

elif not archive and mode is not DEPLOYMENT_MODE_CREATE:
    # Print and clean up

dbczumar added some commits Nov 20, 2018

status = self.status_check_fn()
if status.state == _SageMakerOperationStatus.STATE_IN_PROGRESS:
if iteration % 2 == 0:
# Log the progress status roughly every 10 seconds

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 20, 2018

Contributor

Let's downgrade to every 30 seconds, this gives several screen-fulls of logs right now to do a deployment.

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Downgraded to 20 seconds after offline discussion about my impatience w.r.t longrunning operations.

raise MlflowException(
"The deployment operation failed with the following error message:"
" \"{error_message}\"".format(error_message=operation_status.message))
elif not archive and mode is not DEPLOYMENT_MODE_CREATE:

This comment has been minimized.

Copy link
@aarondav

aarondav Nov 20, 2018

Contributor

Maybe instead of this if statement (which makes assumptions about the implementation of create's cleanup), we can simply not have a logger.info here and assume the cleanup method will log as appropriate?

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Good call. Removed the additional condition and moved the "Cleaning up..." log into the cleanup_fn function defined in the _update_sagemaker_endpoint method. I also moved the log statement in delete into the delete#cleanup_fn method.

@aarondav aarondav added the LGTM label Nov 20, 2018

@aarondav

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Btw, might be good to print a terminal log message like "Update complete". Right now it just says

2018/11/20 14:42:56 INFO mlflow.sagemaker: The update operation is still in progress. Current endpoint status: "Updating"

and then quits.

dbczumar added some commits Nov 20, 2018

@dbczumar
Copy link
Collaborator Author

left a comment

@aarondav Addressed your comments and added operation completion logs for update/create and delete

raise MlflowException(
"The deployment operation failed with the following error message:"
" \"{error_message}\"".format(error_message=operation_status.message))
elif not archive and mode is not DEPLOYMENT_MODE_CREATE:

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Good call. Removed the additional condition and moved the "Cleaning up..." log into the cleanup_fn function defined in the _update_sagemaker_endpoint method. I also moved the log statement in delete into the delete#cleanup_fn method.

def status_check_fn():
endpoint_info = sage_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_update_was_rolled_back = (
endpoint_info["EndpointStatus"] == "InService"

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Added a mandatory 20 second delay before the endpoint status is used to determine whether or not the update succeeded

status = self.status_check_fn()
if status.state == _SageMakerOperationStatus.STATE_IN_PROGRESS:
if iteration % 2 == 0:
# Log the progress status roughly every 10 seconds

This comment has been minimized.

Copy link
@dbczumar

dbczumar Nov 20, 2018

Author Collaborator

Downgraded to 20 seconds after offline discussion about my impatience w.r.t longrunning operations.

@dbczumar dbczumar changed the title [WIP] SageMaker fixes for archive mode and synchronous deployment/deletion execution SageMaker fixes for archive mode and synchronous deployment/deletion execution Nov 20, 2018

dbczumar added some commits Nov 20, 2018

@dbczumar dbczumar merged commit 80dd2a6 into mlflow:master Nov 21, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.