Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent exit on ZK detection timeout #306

Merged
merged 1 commit into from Oct 7, 2015

Conversation

sepiroth887
Copy link

Addresses: #284

This one line fix will do it. There are other instances where mesos-dns could fail fatally due to dependent libraries but thats outside the scope of this fix.

@jdef
Copy link
Contributor

jdef commented Oct 6, 2015

errch reports errors for components other than zk as well and if those fail (dns/http) then the server should probably die. if you want to treat zk separately then use a separate error channel.

@sepiroth887
Copy link
Author

👍 updated

@@ -40,16 +40,17 @@ func main() {
// initialize resolver
config := records.SetConfig(*cjson)
res := resolver.New(Version, config)
errch := make(chan error)
genErrch := make(chan error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt this confuse things a bit? what about generrs?

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

Thanks! Can you squash the commits into one?

@tsenart tsenart changed the title [WIP] Prevent exit on ZK detection timeout Prevent exit on ZK detection timeout Oct 6, 2015
@tsenart tsenart added the PTAL label Oct 6, 2015
@tsenart tsenart added this to the v1.0.0 milestone Oct 6, 2015
@sepiroth887
Copy link
Author

Yup. Though would it be easier to close and reopen ? Not 100% sure but it would require rewriting the history. Does the build system cope with a force update?

@jdef
Copy link
Contributor

jdef commented Oct 6, 2015

you can force-push

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

Wait a minute. Thinking more about this, I don't think this approach is what we want. We'd like ZK to not timeout at all, I think, so that it can connect at any point in the process run-time.

@karlkfi
Copy link
Contributor

karlkfi commented Oct 6, 2015

I don't understand this PR...
Am I correct in reading that it disables the zk timeout and just logs once and keeps on going? Why have the timeout at all? I don't understand.

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

@karlkfi: Agreed, see my comment above.

@karlkfi
Copy link
Contributor

karlkfi commented Oct 6, 2015

Not timing out is debatable. Perhaps the timeout can be optional?

Do we have a readiness endpoint on mesos-dns that fails until zk is available? It might be nice to distinguish between alive & ready. It's a pattern kubernetes adopted. NotAlive means it needs restarting. Alive but NotReady means still bootstrapping or that dependencies are NotReady.

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

Mesos-DNS offers multiple services which can function without Zookeeper. It's even allowed to configure it without Zookeeper altogether. The timeout is strictly not needed as long as the code which connects keeps retrying.

@sepiroth887
Copy link
Author

Would be easy to remove the timeout. It's just a Ticker anyways. I don't see a reason to keep it.

@karlkfi
Copy link
Contributor

karlkfi commented Oct 6, 2015

Technically it's a Timer, not a Ticker. So it only happens once, as is. A Ticker would invoke repeatedly.

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

@sepiroth887: Yes, but in order to satisfy #284 you should make sure that the ZK detector keeps retrying to connect.

@sepiroth887
Copy link
Author

gah. yea it is a Timer. sorry. regardless though it's not really needed. The ZK detector already keeps retrying indefinitely.

Sending yet another commit shortly.

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

Since config.ZkDetectionTimeout isn't used anymore, please remove all references to it, including documentation.

@sepiroth887
Copy link
Author

rabbit hole :D on it.

logging.VeryVerbose.Printf("new masters detected: %v", masters)
res.SetMasters(masters)
res.Reload()
case err := <-errch:
logging.Error.Fatal(err)
log.Fatal(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep using the Error logger please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was unchanged. I'll update it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was it unchanged if git lists it as changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh yes. was keeping it in line with line 59 and 61. Want this changed as well to use the Error logger?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply want this line to not change at all.

@tsenart
Copy link
Contributor

tsenart commented Oct 7, 2015

@sepiroth887: Please squash all your commits into one and the force-push to your branch.

- Dedicated error channel for ZK timeouts
- Updated error channel naming
- ZK Timeout timer removed
- Removed references to ZKTimeout
- Use logging for Fatal errors in switch case
@sepiroth887
Copy link
Author

done.

tsenart added a commit that referenced this pull request Oct 7, 2015
Prevent exit on ZK detection timeout
@tsenart tsenart merged commit b963f95 into mesosphere:master Oct 7, 2015
@tsenart tsenart removed the PTAL label Oct 7, 2015
@sepiroth887 sepiroth887 deleted the still_alive branch October 7, 2015 15:11
@jdef
Copy link
Contributor

jdef commented Oct 7, 2015

For posterity: we added this initial ZK timeout feature because the old mesos-go zk detector implementation was buggy and would sometimes never connect. The problem with this is that even if the masters are initially configured statically in the config, the mesos-dns records that point to masters would get out of date if the master VMs were recycled. This happened in production.

The newer mesos-go zk detector should be much more stable. However, there's no guarantee that it's complete bug free.

I'm not thrilled that this PR has merged. I'd rather have seen a much higher timeout specified, perhaps something along the lines of O(hours)? That way an admin would at least see the mesos-dns service flapping and could investigate why.

@jdef
Copy link
Contributor

jdef commented Oct 7, 2015

/cc @cmaloney

@tsenart
Copy link
Contributor

tsenart commented Oct 7, 2015

@jdef: Couldn't that be solved simply by logging when new masters are detected? An admin would look and see that it's getting data from ZK. When those would be absent, we'd know something is off.

@jdef
Copy link
Contributor

jdef commented Oct 7, 2015

My suspicion is that admins are watching a million things and that a
flapping service that's being monitored via some alert system is more
likely to get attention than a log line.

On Wed, Oct 7, 2015 at 12:11 PM, Tomás Senart notifications@github.com
wrote:

@jdef https://github.com/jdef: Couldn't that be solved simply by
logging when new masters are detected? An admin would look and see that
it's getting data from ZK. When those would be absent, we'd know something
is off.


Reply to this email directly or view it on GitHub
#306 (comment).

@sepiroth887
Copy link
Author

What about a healthy/unhealthy endpoint?

What scared me most about the previous version was that the chance of mesos dns just exiting can have a major impact on a production environment.
Even if it should never recover it would at least not prevent running apps from blowing to pieces.

Monitoring the health of the dns could at least provide ops with the means to alert and take action.

@sepiroth887
Copy link
Author

We experienced this outage btw due to that issue. So naturally id like to mitigate that.

@karlkfi
Copy link
Contributor

karlkfi commented Oct 7, 2015

I like the idea of a readiness endpoint that fails if not connected to mesos. Yes, mesos-dns functions without mesos, but I don't think that's actually a legitimate use case. If it can't talk to mesos the operator is gonna want to know.

@jdef
Copy link
Contributor

jdef commented Oct 7, 2015

+1 for a readiness endpoint. should also integrate well with DCOS. I'm not
very familiar with how they currently monitor readiness of other critical
components.

On Wed, Oct 7, 2015 at 12:50 PM, Karl Isenberg notifications@github.com
wrote:

I like the idea of a readiness endpoint that fails if not connected to
mesos. Yes, mesos-dns functions without mesos, but I don't think that's
actually a legitimate use case. If it can't talk to mesos the operator is
gonna want to know.


Reply to this email directly or view it on GitHub
#306 (comment).

@sepiroth887
Copy link
Author

@karlkfi sorry for blatantly selling your idea as mine :D

one gotcha would be that this doesn't help if someone disables http on mesos-dns right? Unless you always bind the readiness endpoint to another port by default? but this also feels weird. :(

Is there some form of DNS entry that could be hijacked to achive the same and support both DNS and HTTP based health?

@jdef
Copy link
Contributor

jdef commented Oct 7, 2015

some k8s services have a dedicated endpoint/port that respond to "/healthz"
HTTP requests. Our readiness check could (and probably should) be
completely separate from the HTTP API.

On Wed, Oct 7, 2015 at 1:16 PM, Tobias Haag notifications@github.com
wrote:

@karlkfi https://github.com/karlkfi sorry for blatantly selling your
idea as mine :D

one gotcha would be that this doesn't help if someone disables http on
mesos-dns right? Unless you always bind the readiness endpoint to another
port by default? but this also feels weird. :(

Is there some form of DNS entry that could be hijacked to achive the same
and support both DNS and HTTP based health?


Reply to this email directly or view it on GitHub
#306 (comment).

@cmaloney
Copy link

cmaloney commented Oct 7, 2015

We do not have any infrastructure to monitor "Readiness" / "liveliness" of a base system service. We can monitor when a process is up and act on that with systemd.

Practically it would be good if Mesos-DNS just always continued operating serving old / cached results and causing an alert, but that requires a bunch of infrastructure we don't have currently.

@tsenart
Copy link
Contributor

tsenart commented Oct 8, 2015

@cmaloney: So, in the context of DCOS, would you say its preferable to have Mesos-DNS hard crash if it can't connect to Zookeeper at any point in time?

@cmaloney
Copy link

cmaloney commented Oct 8, 2015

The two options I see:

  1. Hard Crash (very easy to detect and forward to a sysadmin)
  2. Refuse all connections (prevents us from thinking it is up / working), and report via the systemd notify API (http://www.freedesktop.org/software/systemd/man/systemd-notify.html, http://www.freedesktop.org/software/systemd/man/sd-daemon.html) that it is in a degraded state. (Reasonably easy to notice, shows up in systemd logs).

We always configure it to listen to a localhost zookeeper though, so the only time it can't connect to zookeeper is if zookeeper is overloaded or not currently up. In the case where it is not up we'd rather use another machine, since using that particular

If the "watch zk" stuff was a separate process / daemon from the "serve records" then we could split this much better / make it monitorable as two different services which would be easier to detect / notify system administrators about, while still keeping the "old" state available so the cluster is less likely to fall over. There would also need to be work around not serving leader.mesos / master.mesos when there is no connection to zk (Which will cause certain things to fail properly), and work in generating /etc/resolv.conf so that we detect when mesos-dns is in this "degraded" state and avoid it if there are other healthy nodes.

@tsenart
Copy link
Contributor

tsenart commented Oct 8, 2015

OK, until we have a decent solution to properly monitor base services in DCOS, I will revert this change set and instead make the timeout optional so that we cover both use cases.

@karlkfi
Copy link
Contributor

karlkfi commented Oct 8, 2015

I'm not a big fan of either of those options, honestly.

The best multi-component systems to maintain are the ones that:

  1. can come up in any order
  2. make it obvious which piece is broken
  3. recover automatically with the least amount of intervention

You don't want to have to trace failures down a cascade of broken services and then have to restart all of them in order. You want to be able to look at a dashboard and see that N systems are Alive but NotReady and 1 system is NotAlive. You go fix that one piece or config that's borked and all the others auto-recover.

Not only that, but every component should be watched for NotAlive and auto-restarted either by process or by container. If the kube-dns crashes every time zk is unreachable it will thrash every time there's a network outage. If that happens at the container level you won't even have a running container to shell into to debug.

@tsenart
Copy link
Contributor

tsenart commented Oct 8, 2015

@karlkfi: Agree.

@cmaloney
Copy link

cmaloney commented Oct 8, 2015

I think it's important to note here that the reason why we're serving DNS is that we're talking to dumb clients. From what I know of internals / reliance on DNS inside of DCOS (Universe, DCOS UI, Frameworks), Mesos-DNS soft failing is going to result in a lot worse error cases than hard failing.

If Mesos-DNS hard fails from a clients perspective it looks just like you lost network access to it. Clients which want to be resillient against network partitions from Mesos-DNS need to handle this failure scenario already.

If we add soft failure of Mesos-DNS, we break some "liveliness" guarantees of responses from it from the PoV of different "dumb" clients. These needs / requirements are often unstated, and happen both inside DCOS components, and likely in things others write on top of Mesos-DNS. Two I'm going to focus on are leader.mesos and master.mesos. There is a lot of software in DCOS (every universe service) which assumes master.mesos resolves to all the masters for using as a zk url. There is some software in DCOS (The UI, dcos history service, dependence on there being a leading mesos master by slaves) which assumes leader.mesos points to the current leading Mesos master.

If Mesos-DNS keeps serving even though it doesn't have a current / up to date leader.mesos + master.mesos (what it gets from zookeeper), then clients which query those are going to get incorrect responsesfrom their point of view. For some clients this is acceptable as a transient error. For others, (the UI, dcos history service, dependence on there being a leading mesos master by slaves) they need to change behavior when they know they are using an old, cached value. If Mesos-DNS keeps serving the old value, they are unable to make this choice. From their perspective as consumers of DNS, they are getting a new authoritative value, and have no reason to suspect the value is possibly outdated. This will lead to that particular service not realizing it is in a degraded state. Services which depend on it, depending on that dependance, may also be in a degraded state but not know it.

If Mesos-DNS hard fails, the clients see that immediately, and can opt to use the old value if they are smart clients, if they are dumb clients they are guaranteed that they won't have their assumptions around the liveliness of the values reported invalidated.

By introducing "soft" failure as keeping up the existing service port listenging / respondign in Mesos-DNS there are more failure scenarios a client must be aware of in order to operate properly with certain guarantees around service discovery. This breaks the simple client implementation. That implementation is why we are doing DNS in the first place.

If we implement "soft" failure in Mesos-DNS as serving on a different fallback port (ex: Port 153 is "always available", port 53 serving is "only when we're all good"), the guarnatees aren't broken, as dumb clients will get the failure they need for correct operation. To use the fallback clients need to opt-in rather than opt out.

One failure line which would happen today with soft failure:

  1. Mesos-DNS keeps serving even though it lost a zk connect
  2. /etc/resolv.conf keeps using 127.0.0.1 as the resolver, because Mesos-DNS is still up
  3. The DCOS UI nginx keeps resolving master.mesos, leader.mesos. It uses leader.mesos to grab "state-summary" from the leading master. Mesos serves this file from every single master always. If you get it from not the leading master, it's contents are frequently incorrect.
  4. An administrator uses the DCOS UI to check on the state of their cluster. The cluster "looks" healthy, because the state-summary being returned says everything is healthy, even though on the current leading master this is not the case.
  5. An administrator uses the DCOS UI to view the Web UI of a particular service inside their cluster. The DCOS UI looks up webui_url as reported by the framework and sent to the ui in "state-summary". The DCOS UI forwards the administators request to that particular server. That server is no longer supposed to be serving the WebUI for the framework. The server is still there, but the framework migrated it to a different host, and updated the webui_url in mesos. The administrator therefore gets old / incorrect information about the service which they believe to be correct / authoritative information.

@tsenart
Copy link
Contributor

tsenart commented Oct 8, 2015

@cmaloney: Thanks for the detailed insight into these DCOS failure modes. I hope #320 makes everyone happy.

@tsenart tsenart mentioned this pull request Oct 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants