-
Notifications
You must be signed in to change notification settings - Fork 71
Implement atomic deployments for Nulecule application. Fixes #421 #456
base: master
Are you sure you want to change the base?
Conversation
Is this approach better than doing the undeploy inside of the provider files? So inside of the deploy() function for each provider, if error, then call undeploy? Right now you have it all the way up at the NuleculeManager level.. that is ok, just wondering where the best place to do this is. |
Also, if we go with this approach then it affects all providers. Do you mind testing it with all providers and check to make sure it works? |
I had thought of it as well. The reason I did not go ahead with it was because, we wanted to undeploy the entire Nulecule application and not just that component which failed to get deployed.
Because of the above explanation, this is intended.
👍 with this. I need to test it on the other providers as well. |
Let me know when other providers have been tested. Also we need to fix unit tests. We will postpone the merge of this until after tomorrow's release. |
So.. the provider undeploy() functions really need to have "ignore_errors" set as well. Essentially where we have it now is outside of undeply(). Take kubernetes for example; a "componenth" could have multiple artifacts which will have multiple calls to kubernetes api. If the first call fails then subsequent artifacts won't get removed. undeploy() needs to know to ignore errors and attempt to remove each artifact. |
Good point 👍 |
cfd0321
to
3ef9939
Compare
@@ -203,4 +203,8 @@ def undeploy(self): | |||
cmd = [self.kubectl, "delete", "-f", path, "--namespace=%s" % self.namespace] | |||
if self.config_file: | |||
cmd.append("--kubeconfig=%s" % self.config_file) | |||
self._call(cmd) | |||
try: | |||
self._call(cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be best to add ignore_error
to _call()
and then _call()
and call run_cmd()
checkexitcode=
arg set?
In that case the try/except would be all the way in run_cmd
which already has this functionality built in.
3ef9939
to
7db3cca
Compare
@rtnpro now that stop for openshift is in can you update |
ping @rtnpro ^^ |
@dustymabe aye! |
445607b
to
3e85bff
Compare
@dustymabe Pushed changes for atomic deployment on Openshift provider as well. |
@rtnpro I believe everything LGTM. Since we do have 4 providers and there probably isn't much code can you go ahead and do this for marathon as well? I should be able to test this stuff out tomorrow on docker/kubernets. I hopefully will get an openshift setup working again tomorrow to test on that as well. In the meantime can you confirm that you have tested that this works (failure on run.. yields a stop). |
ok. after running through some testing on this I'm not convinced this is the best path to take. We might be able to go with this but I think we can do better. The sticking point I am on now is the state of the system when we start to deploy; after the failed deployment the system should be in the same state it was when we started deployment. One example of how to make our code "fail" and roll back is to have a half deployed application. Essentially part of the application will successfully deploy until it gets to the point at which it tries to deploy an artifact that already exists. That will fail and then we will "roll back" by undeploying the application. The problem with this is that the undeploy will remove the service that existed before we ran our code, since it removes all artifacts. Is this OK? I think a better approach my be to, on deploy, run through all artifacts first to see if they already exist. If all artifacts pass the "exists" test then we can run through them all and create them. Considering everything I have written the change I am proposing (don't start deploy until checked that no artifacts already exist) could actually be done in a separate PR. That PR would take care of the failure case. |
logger.error('Application run error: %s' % e) | ||
logger.debug('Nulecule run error: %s' % e, exc_info=True) | ||
logger.info('Rolling back changes') | ||
self.stop(cli_provider, ignore_errors=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case we still need to error out after the "stop" has been performed. Otherwise the application returns a good exit code and the user might not realize that it didn't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dustymabe pushed fix!
Opened #501 for this. |
ce9d033
to
1bdd309
Compare
I tested this with couple of my examples on Marathon and OpenShift. Everything looks fine 👍
|
0ed4f1c
to
629a96d
Compare
@@ -302,6 +303,7 @@ def stop(self, provider_key=None, dryrun=False): | |||
provider_key, provider = self.get_provider(provider_key, dryrun) | |||
provider.artifacts = self.rendered_artifacts.get(provider_key, []) | |||
provider.init() | |||
provider.ignore_errors = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be provider.ignore_errors = ignore_errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, there's no usecase where ignore_errors
is gonna come as False
here.
629a96d
to
68c2303
Compare
self.oc.delete(url) | ||
except Exception as e: | ||
if not self.ignore_errors: | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the self.oc.scale() above (line 459) will fail if the rc doesn't exist in the target. We may need to consider putting the try/except deeper in the code. One extreme would be to put it at the Utils.make_rest_request() level, the other extreme would be to simply try/except around the scale call above. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be the only part @rtnpro you should have a look at, other code LGTM.
Let's postpone merging this PR til after tomorrows release. |
heads up that all tests pass for this 👍 although this PR needs rebasing now due to time that's passed |
…tomic#421 When there's an error during running a Nulecule application, rollback the changes made by stopping the application.
68c2303
to
2ffb11c
Compare
@cdrage @dustymabe rebased! |
Other than my one comment, LGTM and tests have passed! |
Postponing until after GA as we are going to limit our changes to bugfixes. |
Just an update, we are still planning on implementing this, although focus at the moment has been to converting the docker and k8s providers to their respective API implementations. |
When there's an error during running a Nulecule application, rollback the changes made by stopping the application.
This pull request just takes care of invoking
stop
on the main Nulecule application and not refactoring the internal implementation ofstop
.