New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new Chaos transport that can simulate network failure and add it to the kubelet #6729

Merged
merged 3 commits into from Apr 13, 2015

Conversation

Projects
None yet
5 participants
@smarterclayton
Contributor

smarterclayton commented Apr 11, 2015

A new package pkg/client/chaosclient has a framework for simulating random HTTP
client failures as well as returning arbitrary responses. The client.Config is
extended to support the ability to wrap the transport to inject those errors, and
the kubelet now takes an argument --chaos_chance=<p> which reflects the probability
a request to the master will be rejected with a "connection reset by peer" error.

To try this locally, pass --chaos_chance=0.1 to the kubelet on start, or try this
with the local cluster:

$ CHAOS_CHANCE=0.1 hack/local-up-cluster.sh

The Chaos transport will log when it replaces a request - since the default error
is a fairly generic one, most parts of the code should immediately log that to
glog at at least V(2).

Future enhancements will be including more error scenarios, the ability to
simulate network latency, and possibly panics.

@googlebot

This comment has been minimized.

googlebot commented Apr 11, 2015

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.
@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Apr 11, 2015

@timothysc this will be of relevance for our reliability testing

@roberthbailey

This comment has been minimized.

Member

roberthbailey commented Apr 11, 2015

/cc @fabioy (who was also working on simulating failures, albeit in other ways).

@timothysc

This comment has been minimized.

Member

timothysc commented Apr 13, 2015

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Apr 13, 2015

Adding this to the client and controller manager will come in a separate pull. I want to get the basics sorted and agreed on here.

----- Original Message -----

/cc @jayunit100 @satnam6502 fyi.


Reply to this email directly or view it on GitHub:
#6729 (comment)

@timothysc

This comment has been minimized.

Member

timothysc commented Apr 13, 2015

I looked through the PR, and in general I'm a +1, but I would still like a daemon-killing chaos-monkey re: #4548 . Primarily b/c it's pretty difficult to simulate a start-up storm, or net-split, etc.

That being said, I think both would be good to have.

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Apr 13, 2015

Yeah, different problem set. Although this could trigger random panics, I don't think it's the right place for external chaos.

----- Original Message -----

I looked through the PR, and in general I'm a +1, but I would still like a
daemon-killing chaos-monkey re:
#4548 . Primarily
b/c it's pretty difficult to simulate a start-up storm, or net-split, etc.

That being said, I think both would be good to have.


Reply to this email directly or view it on GitHub:
#6729 (comment)

@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration
ChaosChance float64

This comment has been minimized.

@fabioy

fabioy Apr 13, 2015

Member

Not a fan of having oddball testing options in this struct ("ReallyCrashForTesting"?). I'd prefer is these settings were coalesced into a separate struct and referenced either by a pointer from KubeletServer or perhaps just as a floating global config (presumably there won't be more than 1 per process...).

This comment has been minimized.

@smarterclayton

smarterclayton Apr 13, 2015

Contributor

These are internal configs, so it's appropriate to put them under the struct. Globals are far worse and break encapsulation so I don't think we should ever add those (ReallyCrash is probably the exception).

I don't see a ton of value in separating them out, although I'd be happy to add a better description of when you would use those and separate them visually.

----- Original Message -----

@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration

  • ChaosChance float64

Not a fan of having oddball testing options in this struct
("ReallyCrashForTesting"?). I'd prefer is these settings were coalesced into
a separate struct and referenced either by a pointer from KubeletServer or
perhaps just as a floating global config (presumably there won't be more
than 1 per process...).


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28259029

This comment has been minimized.

@fabioy

fabioy Apr 13, 2015

Member

My concern is with "spinkling of test code" throughout the codebase. I'd prefer to have them separated out in an easily discernible way, in case we wish to rip them out or refactor them in the future. At the least, please comment.

This comment has been minimized.

@smarterclayton

smarterclayton Apr 13, 2015

Contributor

I think there is a small set of folks that would run Chaos in production either as canaries or continuous test, but I agree with the sentiment and will add comments to them.

----- Original Message -----

@@ -95,6 +96,7 @@ type KubeletServer struct {
TLSPrivateKeyFile string
CertDirectory string
NodeStatusUpdateFrequency time.Duration

  • ChaosChance float64

My concern is with "spinkling of test code" throughout the codebase. I'd
prefer to have them separated out in an easily discernible way, in case we
wish to rip them out or refactor them in the future. At the least, please
comment.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28284595

// Intercept should return true if the normal flow should be skipped, and the
// return response and error used instead. Modifications to the request will
// be ignored, but may be used to make decisions about types of failures.
Intercept(req *http.Request) (bool, *http.Response, error)

This comment has been minimized.

@fabioy

fabioy Apr 13, 2015

Member

Is the "bool" return value needed? You could have the presence of the Response object be the signal to override the response.

This comment has been minimized.

@smarterclayton

smarterclayton Apr 13, 2015

Contributor

In general in Go I prefer not to overload the meaning of nil, but in this case I don't think it would be an issue.

----- Original Message -----

+// chaosrt provides the ability to perform simulations of HTTP client
failures
+// under the Golang http.Transport interface.
+type chaosrt struct {

  • rt http.RoundTripper
  • notify ChaosNotifier
  • c []Chaos
    +}

+// Chaos intercepts requests to a remote HTTP endpoint and can inject
arbitrary
+// failures.
+type Chaos interface {

  • // Intercept should return true if the normal flow should be skipped, and
    the
  • // return response and error used instead. Modifications to the request
    will
  • // be ignored, but may be used to make decisions about types of failures.
  • Intercept(req *http.Request) (bool, *http.Response, error)

Is the "bool" return value needed? You could have the presence of the
Response object be the signal to override the response.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28261402

This comment has been minimized.

@smarterclayton

smarterclayton Apr 13, 2015

Contributor

I remember now - we also want to emulate errors, so the return code would check (response != nil || err != nil) which is sufficiently un-Golike that I felt the bool was more accurate.

----- Original Message -----

In general in Go I prefer not to overload the meaning of nil, but in this
case I don't think it would be an issue.

----- Original Message -----

+// chaosrt provides the ability to perform simulations of HTTP client
failures
+// under the Golang http.Transport interface.
+type chaosrt struct {

  • rt http.RoundTripper
  • notify ChaosNotifier
  • c []Chaos
    +}
    +
    +// Chaos intercepts requests to a remote HTTP endpoint and can inject
    arbitrary
    +// failures.
    +type Chaos interface {
  • // Intercept should return true if the normal flow should be skipped,
    and
    the
  • // return response and error used instead. Modifications to the request
    will
  • // be ignored, but may be used to make decisions about types of
    failures.
  • Intercept(req *http.Request) (bool, *http.Response, error)

Is the "bool" return value needed? You could have the presence of the
Response object be the signal to override the response.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28261402

This comment has been minimized.

@fabioy

fabioy Apr 13, 2015

Member

Fair enough.

// TODO: make this more accurate
// TODO: add other error types
// TODO: add a helper for returning multiple errors randomly.
var ErrSimulatedConnectionResetByPeer = Error{errors.New("connection reset by peer")}

This comment has been minimized.

@fabioy

fabioy Apr 13, 2015

Member

May be nice to make the error message clear that it's a Chaos-induced reset.

This comment has been minimized.

@smarterclayton

smarterclayton Apr 13, 2015

Contributor

For cases where we detect based on the error type (which simulating real client errors is important) it becomes harder to do that. ChaosNotifier is really supposed to bear that burden - you should see the message from Chaos in the log, followed by either something printed by the core component (at high debug levels) or the proper behavior.

----- Original Message -----

  • if c.s.Float64() < c.p {
  •   return c.Chaos.Intercept(req)
    
  • }
  • return false, nil, nil
    +}

+func (c pIntercept) String() string {

  • return fmt.Sprintf("P{%f %s}", c.p, c.Chaos)
    +}

+// ErrSimulatedConnectionResetByPeer emulates the golang net error when a
connection
+// is reset by a peer.
+// TODO: make this more accurate
+// TODO: add other error types
+// TODO: add a helper for returning multiple errors randomly.
+var ErrSimulatedConnectionResetByPeer = Error{errors.New("connection reset
by peer")}

May be nice to make the error message clear that it's a Chaos-induced reset.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/6729/files#r28261672

return chaosclient.NewChaosRoundTripper(rt, chaosclient.LogChaos, seed.P(s.ChaosChance, chaosclient.ErrSimulatedConnectionResetByPeer))
}
}

This comment has been minimized.

@fabioy

fabioy Apr 13, 2015

Member

Small ask, could you put this into a separate method with a "chaos" description? Again, just to help organize the stuff that is for test purposes only.

@fabioy

This comment has been minimized.

Member

fabioy commented Apr 13, 2015

LGTM. Just update with the changes and I'll merge.

Thanks.

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Apr 13, 2015

Updated

----- Original Message -----

LGTM. Just update with the changes and I'll merge.

Thanks.


Reply to this email directly or view it on GitHub:
#6729 (comment)

fabioy added a commit that referenced this pull request Apr 13, 2015

Merge pull request #6729 from smarterclayton/chaosclient
Add a new Chaos transport that can simulate network failure and add it to the kubelet

@fabioy fabioy merged commit e99141d into kubernetes:master Apr 13, 2015

3 checks passed

Shippable Shippable builds completed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.01%) to 54.13%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment