Skip to content
This repository has been archived by the owner on Jan 19, 2018. It is now read-only.

Implement atomic deployments for Nulecule application. Fixes #421 #456

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rtnpro
Copy link
Contributor

@rtnpro rtnpro commented Dec 15, 2015

When there's an error during running a Nulecule application, rollback the changes made by stopping the application.

This pull request just takes care of invoking stop on the main Nulecule application and not refactoring the internal implementation of stop.

@rtnpro rtnpro added this to the CDK 2 GA milestone Dec 15, 2015
@dustymabe
Copy link
Contributor

Is this approach better than doing the undeploy inside of the provider files? So inside of the deploy() function for each provider, if error, then call undeploy?

Right now you have it all the way up at the NuleculeManager level.. that is ok, just wondering where the best place to do this is.

@dustymabe
Copy link
Contributor

Also, if we go with this approach then it affects all providers. Do you mind testing it with all providers and check to make sure it works?

@rtnpro
Copy link
Contributor Author

rtnpro commented Dec 15, 2015

@dustymabe

Is this approach better than doing the undeploy inside of the provider files? So inside of the deploy() > function for each provider, if error, then call undeploy?

I had thought of it as well. The reason I did not go ahead with it was because, we wanted to undeploy the entire Nulecule application and not just that component which failed to get deployed.

Right now you have it all the way up at the NuleculeManager level.. that is ok, just wondering where the best place to do this is.

Because of the above explanation, this is intended.

Also, if we go with this approach then it affects all providers. Do you mind testing it with all providers and check to make sure it works?

👍 with this. I need to test it on the other providers as well.

@dustymabe
Copy link
Contributor

Let me know when other providers have been tested. Also we need to fix unit tests.

We will postpone the merge of this until after tomorrow's release.

@dustymabe
Copy link
Contributor

So.. the provider undeploy() functions really need to have "ignore_errors" set as well. Essentially where we have it now is outside of undeply(). Take kubernetes for example; a "componenth" could have multiple artifacts which will have multiple calls to kubernetes api. If the first call fails then subsequent artifacts won't get removed.

undeploy() needs to know to ignore errors and attempt to remove each artifact.

@rtnpro
Copy link
Contributor Author

rtnpro commented Dec 18, 2015

@dustymabe

undeploy() needs to know to ignore errors and attempt to remove each artifact.

Good point 👍

@@ -203,4 +203,8 @@ def undeploy(self):
cmd = [self.kubectl, "delete", "-f", path, "--namespace=%s" % self.namespace]
if self.config_file:
cmd.append("--kubeconfig=%s" % self.config_file)
self._call(cmd)
try:
self._call(cmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be best to add ignore_error to _call() and then _call() and call run_cmd() checkexitcode= arg set?

In that case the try/except would be all the way in run_cmd which already has this functionality built in.

@dustymabe
Copy link
Contributor

@rtnpro now that stop for openshift is in can you update undeploy() for it as well?

@dustymabe
Copy link
Contributor

ping @rtnpro ^^

@rtnpro
Copy link
Contributor Author

rtnpro commented Jan 6, 2016

@dustymabe aye!

@rtnpro
Copy link
Contributor Author

rtnpro commented Jan 11, 2016

@dustymabe Pushed changes for atomic deployment on Openshift provider as well.

@dustymabe
Copy link
Contributor

@rtnpro I believe everything LGTM. Since we do have 4 providers and there probably isn't much code can you go ahead and do this for marathon as well?

I should be able to test this stuff out tomorrow on docker/kubernets. I hopefully will get an openshift setup working again tomorrow to test on that as well. In the meantime can you confirm that you have tested that this works (failure on run.. yields a stop).

@dustymabe
Copy link
Contributor

ok. after running through some testing on this I'm not convinced this is the best path to take. We might be able to go with this but I think we can do better.

The sticking point I am on now is the state of the system when we start to deploy; after the failed deployment the system should be in the same state it was when we started deployment. One example of how to make our code "fail" and roll back is to have a half deployed application. Essentially part of the application will successfully deploy until it gets to the point at which it tries to deploy an artifact that already exists. That will fail and then we will "roll back" by undeploying the application.

The problem with this is that the undeploy will remove the service that existed before we ran our code, since it removes all artifacts. Is this OK?

I think a better approach my be to, on deploy, run through all artifacts first to see if they already exist. If all artifacts pass the "exists" test then we can run through them all and create them.

Considering everything I have written the change I am proposing (don't start deploy until checked that no artifacts already exist) could actually be done in a separate PR. That PR would take care of the failure case.

logger.error('Application run error: %s' % e)
logger.debug('Nulecule run error: %s' % e, exc_info=True)
logger.info('Rolling back changes')
self.stop(cli_provider, ignore_errors=True, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case we still need to error out after the "stop" has been performed. Otherwise the application returns a good exit code and the user might not realize that it didn't fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustymabe pushed fix!

@dustymabe
Copy link
Contributor

I think a better approach my be to, on deploy, run through all artifacts first to see if they already exist. If all artifacts pass the "exists" test then we can run through them all and create them.

Considering everything I have written the change I am proposing (don't start deploy until checked that no artifacts already exist) could actually be done in a separate PR. That PR would take care of the failure case.

Opened #501 for this.

@dustymabe
Copy link
Contributor

@rtnpro can you address the remaining items in this PR? Don't worry about #501 for now.

@kadel
Copy link
Collaborator

kadel commented Jan 19, 2016

I tested this with couple of my examples on Marathon and OpenShift. Everything looks fine 👍

Only thing I have missing is returning error after roll back as @dustymabe mentioned, otherwise LGTM

@@ -302,6 +303,7 @@ def stop(self, provider_key=None, dryrun=False):
provider_key, provider = self.get_provider(provider_key, dryrun)
provider.artifacts = self.rendered_artifacts.get(provider_key, [])
provider.init()
provider.ignore_errors = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be provider.ignore_errors = ignore_errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, there's no usecase where ignore_errors is gonna come as False here.

self.oc.delete(url)
except Exception as e:
if not self.ignore_errors:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the self.oc.scale() above (line 459) will fail if the rc doesn't exist in the target. We may need to consider putting the try/except deeper in the code. One extreme would be to put it at the Utils.make_rest_request() level, the other extreme would be to simply try/except around the scale call above. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be the only part @rtnpro you should have a look at, other code LGTM.

@dustymabe
Copy link
Contributor

Let's postpone merging this PR til after tomorrows release.

@cdrage
Copy link
Member

cdrage commented Feb 8, 2016

heads up that all tests pass for this 👍

although this PR needs rebasing now due to time that's passed

@rtnpro
Copy link
Contributor Author

rtnpro commented Feb 9, 2016

@cdrage @dustymabe rebased!

@cdrage
Copy link
Member

cdrage commented Feb 10, 2016

Other than my one comment, LGTM and tests have passed!

@dustymabe
Copy link
Contributor

Postponing until after GA as we are going to limit our changes to bugfixes.

@dustymabe dustymabe assigned rtnpro and unassigned kadel Feb 11, 2016
@dustymabe dustymabe modified the milestones: CDK 2.1, CDK 2 GA Feb 11, 2016
@dustymabe dustymabe mentioned this pull request Apr 4, 2016
4 tasks
@cdrage
Copy link
Member

cdrage commented Apr 19, 2016

Just an update, we are still planning on implementing this, although focus at the moment has been to converting the docker and k8s providers to their respective API implementations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants