Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove spec file of failed services #6794

Merged
merged 22 commits into from Aug 12, 2019

Conversation

@davidMcneil
Copy link
Contributor

commented Aug 6, 2019

Resolves #4494

This is a third attempt at resolving this issue. I apologize to reviewers. The feedback on the previous two attempts, #6719 and #6730, have illuminated much better ways to address the problem. The new method is sufficiently different that I believe it warrants a new PR.

To recap there are two changes in behavior:

  1. Disallow loading services whose spec cannot be validated. See this commit.
  2. Remove the spec file of a service that fails to be created. See this commit. Important Questions Does this seem reasonable? Are there workflows this breaks? I believe this only happens when a service is updated. There is no other way for a service to get in this state because the change in 1 prevents it. In other words, this change will automatically unload a service if a new version of the service causes its spec file to be invalid. For example, when binds are added to a service.

I am still trying to find a way to encapsulate these changes in our test suite, but I wanted to get this PR up for initial feedback.

@chef-expeditor

This comment has been minimized.

Copy link

commented Aug 6, 2019

Hello davidMcneil! Thanks for the pull request!

Here is what will happen next:

  1. Your PR will be reviewed by the maintainers.
  2. If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

@raskchanky

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

@davidMcneil No apologies are required here. Your continued effort on this (clearly) thorny issue is appreciated.

@davidMcneil

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

@christophermaier this does not address your comment here in so far as adding an in memory representation of loaded but not running services. There are two reasons. First this new implementation did not benefit from it, and second, I tried several times, but always ran into rather ugly clones or Arcs in the code.

@davidMcneil davidMcneil self-assigned this Aug 6, 2019
@baumanj

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

2. Are there workflows this breaks? I believe this only happens when a service is updated. There is no other way for a service to get in this state because the change in 1 prevents it.

The situation that occurs to me is existing installs that have spec files cannot be validated. If we've changed things such that a given spec file may have validated previously, I could see us getting into that scenario.

Copy link
Contributor

left a comment

I don't feel knowledgable enough here to give a formal approval (I'll leave that to @christophermaier), but here are some thoughts and suggestions.

Overall, I really like this and the way it was split into commits made it very easy to review!

@@ -162,9 +111,9 @@ pub struct ServiceSpec {

impl ServiceSpec {
pub fn default_for(ident: PackageIdent) -> Self {

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

It's fine if you'd rather not make this change in this PR, but I think this would more appropriately be named with_ident.

components/sup/src/manager/commands.rs Outdated Show resolved Hide resolved
organization: Option<&str>,
gateway_state: Arc<RwLock<GatewayState>>)
-> Result<Service> {
fn new_impl(sys: Arc<Sys>,

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

I think renaming this with_package makes the difference with new more apparent. Also, I could see this being made a part of the Service public interface, whereas my intent with _impl functions has been to extract methods into (relatively) pure functions for the sake of tests.

components/sup/src/manager.rs Show resolved Hide resolved
format!("{}.{}", ident.name, SPEC_FILE_EXT)
}

pub fn file_name(&self) -> String { Self::ident_file_name(&self.ident) }

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

It seems like both of these functions should more appropriately return PathBufs. Then they could be named ident_file and file respectively.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

Yeah, PathBuf seems like a good change.

ident: &PackageIdent,
shutdown_input: &ShutdownInput,
runtime: &mut Runtime) {
fn try_remove_spec_file(&self, ident: &PackageIdent) {

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

I'd either return std::io::Result<()> or name this remove_spec_file.

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 7, 2019

Author Contributor

Yeah, I was debating what to do with these methods (the other being try_stop_service). I agree using try_ is probably not appropriate.

My goal was to indicate that these methods can fail, but instead of returning Result they handle the errors internally by logging appropriately. I did not want somebody to see remove_spec_file and think that it guaranteed the file was removed. I would like some way to distinguish between methods that propagate errors, and higher-level methods that can still fail but handle the failure case probably by logging. Maybe this distinction is not needed? What do people think? Other possible names would be attempt_remove_spec_file or remove_spec_file_with_log, but maybe these are even more confusing?

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

I think I'd be most in favor of keeping the name (it is a good indicator of fallibility), and returning a Result. We can still log in this function (or move that to the callers, if our desire to log is context dependent). It's easy enough for callers to ignore the result with .ok(); if they don't care about failure. I think we should get used to that idiom as an indicator that a function may not have succeeded, but that the caller doesn't behave differently.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

To add on to @baumanj's suggestion (which I agree with 100%)... we've come across a lot of places in the past where the fact that we swallow errors internally causes problems. If we were to get used to propagating errors, that would give us a bit more flexibility in how we handle things.

This also will require some new error types, I imagine. Coupled with some of the refinements detailed in #6559, I could envision lots of rich error types... for errors that we choose to ignore, we could even have a nice helper function to chain onto calls that basically takes all that rich error information and logs it. Then it's really clear in the code which error cases we care about and which ones we don't, but we never lose that information, and if we do decide we need to act on it, we can easily do so.

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

What about a ResultExt trait or something that would allow us to "lognore" errors. Instead of try_remove_spec_file(…).ok(), it could be something like .ok_log_err()?

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

@baumanj Yup, that's exactly what I was thinking.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

Incidentally, I ❤️ "lognore" 😂

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 8, 2019

Author Contributor

While I completely agree this sounds like the correct way (ok_log_err is a great idea), I think it makes sense to hold off on implementing it in this PR. I will link #6559 to this discussion. I am happy to tackle #6559 if we feel it is a blocker for this to be merge.

components/sup/src/manager.rs Outdated Show resolved Hide resolved
self.organization.as_ref().map(String::as_str),
self.state.gateway_state.clone()).into_iter()
.filter_map(|spec| {
let ident = spec.ident.clone();

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

Why not move the clone inside the else case so we can avoid it when unnecessary?

if existing_idents.contains(&ident) {
None
} else {
let service = Service::new(self.sys.clone(),

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

I might call this something else since it's not a Service, but a Result.

warn!("Failed to create service '{}' from spec: {:?}", ident, e);
None
}
}

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

If service is renamed (say to result, I think it makes the more concise version of the match block more readable:

                        if let Err(ref e) = result {
                            warn!("Failed to create service '{}' from spec: {:?}", ident, e);
                        }
                        result.ok()

I dislike the redundancy of Ok(_) => Some(_), but maybe Result::ok is less clear. Use whichever you think is best.

@davidMcneil

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

The situation that occurs to me is existing installs that have spec files cannot be validated. If we've changed things such that a given spec file may have validated previously, I could see us getting into that scenario.

This is a great point! To make sure I am understanding. The scenario is, supervisor version 1 is able to successfully call Service::new on a spec file, but following an update, supervisor version 2 Service::new now fails.

The old behavior would be that we end up in the weird state this PR fixes.

The new behavior would be that the service is automatically unloaded. Does that seem reasonable?

@baumanj

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

The old behavior would be that we end up in the weird state this PR fixes.

The new behavior would be that the service is automatically unloaded. Does that seem reasonable?

💯

Copy link
Contributor

left a comment

I think this makes sense for now, in that it makes failures explicit. If a new version of a service comes in that has missing binds, the current (i.e., before this PR) situation requires users to reload the service anyway, so removing the file doesn't really change that.

I can envision a future where we could do a bit more validation before trying to upgrade a service and if we detect an incompatibility, refuse to upgrade until the user can intervene... that would at least allow the service to continue operating in the meantime. I can also envision a future where operations like updating binds doesn't require a full unload and reload of a service... in that case, we might not want to delete the spec file, but that's speculating about code that doesn't exist yet. Some tests around the behavior introduced in this PR would help guard against that, though.

All that is far beyond this PR, of course.

I'm curious for @stevendanna's thoughts on this, as this could have impact on Automate (it should at least be on their radar).

There are a couple tweaks I'd like to see before merging, but overall, I like the approach taken here. The breakup of things into commits really helped the review, as well.

Nice work!

}
spec
} else {
ServiceSpec::default()

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

I'm not so sure about this, but then again, I don't really like the idea of ServiceSpec actually implementing Default at all, since the value returned has a junk package identifier (basically, the identifier "/"). (I also have problems with PackageIdent implementing Default, too, as you might guess 🤣 )

On the other hand, if we had a way to convert opts directly into a valid ServiceSpec, that would allow us to begin to eliminate uses of "bad" Default implementations like this. This might be the time to start introducing that.

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 7, 2019

Contributor

What about a builder pattern?

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

Possibly, yeah. I'm agnostic to the specific approach; I just know I don't like the fact that these types implement Default 😂

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 8, 2019

Contributor

The issue here seems to be a conflict between incrementally constructing the object, but not having a reasonable default value for everything. If there's another good pattern for that, I'd be interested to learn it.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 8, 2019

Contributor

I was initially thinking of a From/Into or TryFrom/TryInto to get from a SvcLoad to a ServiceSpec. Builder would definitely be an option as well.

It'd be worth checking the other existing uses of the current Default implementation to get a better idea of how it's currently being used. That could help determine which approach would be useful.

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 8, 2019

Author Contributor

TryFrom seems like a good option. I think we still need merge_svc_load, right? The TryFrom implementation will basically be merge_svc_load with a default ServiceSpec? So I dont think this buys us a whole lot in terms of code reduction, but it is more explicit about what is going on and worth doing.

This comment has been minimized.

Copy link
@baumanj

baumanj Aug 8, 2019

Contributor

Maybe now's not the time if it's too involved, but I strongly agree with @christophermaier's earlier point that implementing Default for PackageIdent is nonsensical and dangerous, thus there should be no Default for ServiceSpec, which contains it.

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 8, 2019

Author Contributor

I agree. I just pushed changes that should address this. I did not go so far as to remove Default for PackageIdent. I started, but it quickly grew into a much larger effort. I would be happy to tackle it, but I think it should be part of a separate PR.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 8, 2019

Contributor

There's a lot of work for PackageIdent, so any removal of Default for it should definitely be a separate PR.

components/sup/src/manager/commands.rs Outdated Show resolved Hide resolved
@@ -248,6 +197,51 @@ impl ServiceSpec {

Ok(())
}

pub fn merge_svc_load(&mut self, svc_load: &habitat_sup_protocol::ctl::SvcLoad) {
self.ident = svc_load.ident.clone().unwrap().into();

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

This unwrap should be an expect at the very least, to ensure we get traceable error messages.

Do we know for certain that svc_load.ident is a Some at this point?

(I wish we had an internal, non-protobuf type to use here... that way we could be certain at the type has an ident.)

self.ident = svc_load.ident.clone().unwrap().into();
self.group = svc_load.group
.clone()
.unwrap_or_else(|| DEFAULT_GROUP.to_string());

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

I wonder if this should be something like

if let Some(ref group) svc_load.group {
    # ...
    self.group = blahblahblah;
}

If self already has a group set, but svc_load doesn't, then we're going to unconditionally set the group to the default here, which could be surprising and unintended.

components/sup/src/manager.rs Show resolved Hide resolved
components/sup/src/manager.rs Show resolved Hide resolved
format!("{}.{}", ident.name, SPEC_FILE_EXT)
}

pub fn file_name(&self) -> String { Self::ident_file_name(&self.ident) }

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

Yeah, PathBuf seems like a good change.

ident: &PackageIdent,
shutdown_input: &ShutdownInput,
runtime: &mut Runtime) {
fn try_remove_spec_file(&self, ident: &PackageIdent) {

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 7, 2019

Contributor

To add on to @baumanj's suggestion (which I agree with 100%)... we've come across a lot of places in the past where the fact that we swallow errors internally causes problems. If we were to get used to propagating errors, that would give us a bit more flexibility in how we handle things.

This also will require some new error types, I imagine. Coupled with some of the refinements detailed in #6559, I could envision lots of rich error types... for errors that we choose to ignore, we could even have a nice helper function to chain onto calls that basically takes all that rich error information and logs it. Then it's really clear in the code which error cases we care about and which ones we don't, but we never lose that information, and if we do decide we need to act on it, we can easily do so.

components/sup/src/manager.rs Outdated Show resolved Hide resolved
@davidMcneil davidMcneil requested a review from christophermaier Aug 8, 2019
components/sup/src/manager/service/spec.rs Outdated Show resolved Hide resolved
components/sup/src/manager/service/spec.rs Outdated Show resolved Hide resolved
components/sup/src/manager/service/spec.rs Outdated Show resolved Hide resolved
components/sup/src/ctl_gateway/server.rs Outdated Show resolved Hide resolved
@davidMcneil davidMcneil requested a review from chefsalim as a code owner Aug 9, 2019
@davidMcneil davidMcneil requested a review from christophermaier Aug 9, 2019
davidMcneil added 13 commits Jul 31, 2019
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
davidMcneil added 8 commits Aug 6, 2019
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
See #6832

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil force-pushed the dmcneil/remove-spec-file-of-failed-service branch from 5cda609 to d6b1690 Aug 9, 2019
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil merged commit cd26965 into master Aug 12, 2019
5 checks passed
5 checks passed
DCO This commit has a DCO Signed-off-by
Details
buildkite/habitat-sh-habitat-master-verify Build #3024 passed (39 minutes, 41 seconds)
Details
buildkite/habitat-sh-habitat-master-website Build #153 passed (2 minutes, 30 seconds)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
expeditor/config-validation Validated your Expeditor config file
Details
@chef-ci chef-ci deleted the dmcneil/remove-spec-file-of-failed-service branch Aug 12, 2019
@davidMcneil davidMcneil referenced this pull request Aug 12, 2019
0 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.