New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apiserver returns 0-length responses #71696

Closed
mborsz opened this Issue Dec 4, 2018 · 39 comments

Comments

@mborsz
Copy link
Member

mborsz commented Dec 4, 2018

What happened:
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-killer/83 has failed to create namespace with error:

Dec  3 18:44:39.236: INFO: Unexpected error while creating namespace: 0-length response

Related kube-apiserver log says:

I1203 18:44:39.209545       1 wrap.go:47] POST /api/v1/namespaces: (3.864455ms) 0 [e2e.test/v1.14.0 (linux/amd64) kubernetes/dc9261b -- [sig-scalability] Load capacity [Feature:Performance] should be able to handle 30 pods per node ReplicationController with 0 secrets, 0 configmaps and 0 daemons 35.192.223.3:57508]

What I find interesting is "0" status code.

What you expected to happen:
Nonempty response returned.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): master version, this attempt ran at v1.14.0-alpha.0.793+dc9261bc3b6b81

  • Cloud provider or hardware configuration: gce

  • OS (e.g. from /etc/os-release): cos

  • Kernel (e.g. uname -a):

  • Install tools:

  • Others:

/kind bug

/cc @kubernetes/sig-api-machinery-bugs

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Dec 6, 2018

/assign @jpbetz

@mborsz

This comment has been minimized.

Copy link
Member

mborsz commented Dec 14, 2018

/cc @wojtek-t

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Dec 14, 2018

@mborsz - from what I understand, we also know that (at least in some cases) the namespace was actually created. Is that true?
[I'm asking, because it significantly reduces the number of places where something could go wrong in those cases]

@mborsz

This comment has been minimized.

Copy link
Member

mborsz commented Dec 14, 2018

Yes, at least in https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-killer/83 the namespace was created despite the 0-length response.
All occurrences I saw so far was for creating object of some kind (namespace and replicationcontroller)

@mborsz mborsz changed the title Apiserver turns 0-length responses Apiserver returns 0-length responses Dec 14, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Dec 20, 2018

It seems we started seeing that quite regularly recently. There was at least one more yesterday (and as in all other cases, it was again on POST operation).

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Dec 20, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Dec 20, 2018

I'm wondering if it may be related to this:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/responsewriters/status.go#L42

In theory, status.Status should only be either StatusSuccess or StatusFailure, but if the data is somewhat corrupted, we would return status.Code = 0.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Dec 20, 2018

Though, I went through the code and didn't find anything that could have caused that TBH.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 21, 2018

we are seeing this cause many integration test failures, across completely different tests

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/70862/pull-kubernetes-integration/39056/

--- FAIL: TestNewDeployment (3.54s)
    deployment_test.go:52: failed to create deployment : 0-length response

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/72239/pull-kubernetes-integration/39059/

--- FAIL: Test202StatusCode (3.58s)
    apiserver_test.go:140: Failed to create rs: 0-length response

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/72171/pull-kubernetes-integration/39055/

--- FAIL: TestPersistentVolumeControllerStartup (4.10s)
    persistent_volumes_test.go:775: Cannot create volume "pv-startup-74": 0-length response

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/72057/pull-kubernetes-integration/39048/

--- FAIL: TestPodDeletionWithDswp (4.84s)
    attach_detach_test.go:172: Failed to create pod : 0-length response

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/72239/pull-kubernetes-integration/39037/

--- FAIL: TestSecrets (3.64s)
    secrets_test.go:106: Failed to create pod: 0-length response

we also saw an odd error that appears to indicate truncated input:

https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/68283/pull-kubernetes-integration/39034/

    --- FAIL: TestEtcdStoragePath//v1,_Resource=serviceaccounts (0.01s)
        etcd_storage_path_test.go:123: failed to create stub for /v1, Resource=serviceaccounts: &json.SyntaxError{msg:"unexpected end of JSON input", Offset:0}
@dims

This comment has been minimized.

Copy link
Member

dims commented Dec 21, 2018

/kind failing-test

@jpbetz

This comment has been minimized.

Copy link
Contributor

jpbetz commented Dec 21, 2018

Attempting to isolate this with #72289

@dims

This comment has been minimized.

Copy link
Member

dims commented Dec 26, 2018

@liggitt - tracking down the earliest time when the "0-length" showed up seems to be at https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2018-11-16&text=0-length%20response

of the CI jobs that shows as failed, the earliest seems to be 2018-11-16 09:24 EST which is a ci-kubernetes-e2e-gci-gke-multizone job ( logs are at https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-multizone/33114 ). So picking the SHA from that 33114 and the one before that 33113 ( https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-multizone/33113/ ) shows the following:

9e2820e...e94a9f9

seems to include the apiserver panic propagation PR too. may be you can spot something in this list of changes when you get a chance

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 26, 2018

prior to the panic propagation PR, if the utilruntime.HandleCrash call in finishRequest was hit, the kube-apiserver process would exit, so I'm pretty sure that didn't make anything that was working before work less well :)

edit on 1/12: @dims root-caused it to this PR. see #72856

I'm also not seeing panic output in the logs of those test failures

most of the other changes in that range were actually reverted in #71158

@dims

This comment has been minimized.

Copy link
Member

dims commented Dec 26, 2018

@liggitt ack, if i go back a few ci jobs and pick 33110, i see the timeout thing from sttts - 973b5d2...9e2820e

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 26, 2018

https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2018-11-17&pr=1&text=(creat%7Cupdat%7Cdeploy).*0-length%20response

I see this in the kube-apiserver logs from https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/70862/pull-kubernetes-e2e-gce-100-performance/27503, with no other info/error/panic logs around the request line:

I1116 20:26:40.551376 1 wrap.go:47] POST /api/v1/namespaces/e2e-tests-load-30-nodepods-1-z4b8d/replicationcontrollers: (8.781533ms) 0 [e2e.test/v0.0.0 (linux/amd64) ...

this audit event appears to correlate, and has an interesting status:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "RequestResponse",
  "auditID": "92cfec74-e3ec-4cce-b7e0-7a7a1ebd255b",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/e2e-tests-load-30-nodepods-1-z4b8d/replicationcontrollers",
  "verb": "create",
  "user": {
    "username": "kubecfg",
    "groups": [
      "system:masters",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "35.232.4.199"
  ],
  "userAgent": "e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "replicationcontrollers",
    "namespace": "e2e-tests-load-30-nodepods-1-z4b8d",
    "name": "load-small-78",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "status": "Success",
    "message": "Connection closed early",
    "code": 200
  },
  "requestObject": {
    "kind": "ReplicationController",
    "apiVersion": "v1",
    "metadata": {
      "name": "load-small-78",
      "creationTimestamp": null,
      "labels": {
        "name": "load-small-78",
        "svc-label": "load-small-39"
      }
    },
    "spec": {
      "replicas": 5,
      "selector": {
        "name": "load-small-78"
      },
      "template": {
        "metadata": {
          "creationTimestamp": null,
          "labels": {
            "name": "load-small-78",
            "svc-label": "load-small-39"
          }
        },
        "spec": {
          "containers": [
            {
              "name": "load-small-78",
              "image": "gcr.io/kubernetes-e2e-test-images/serve-hostname:1.1",
              "ports": [
                {
                  "containerPort": 80,
                  "protocol": "TCP"
                }
              ],
              "resources": {
                "requests": {
                  "cpu": "10m",
                  "memory": "26214400"
                }
              },
              "terminationMessagePath": "/dev/termination-log",
              "terminationMessagePolicy": "File",
              "imagePullPolicy": "IfNotPresent"
            }
          ],
          "restartPolicy": "Always",
          "terminationGracePeriodSeconds": 1,
          "dnsPolicy": "Default",
          "securityContext": {},
          "schedulerName": "default-scheduler"
        }
      }
    },
    "status": {
      "replicas": 0
    }
  },
  "requestReceivedTimestamp": "2018-11-16T20:26:40.542985Z",
  "stageTimestamp": "2018-11-16T20:26:40.551212Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": ""
  }
}
@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Dec 28, 2018

@liggitt so fake success? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/filters/audit.go#L91-L97

In all cases I've seen, the request finished successfully in etcd (i.e. I've seen a couple cases where the request was creation of an object (POST xxx), and the object was then readable from etcd (so it successed). Not sure what etcd returned though..

@sbezverk

This comment has been minimized.

Copy link
Contributor

sbezverk commented Dec 28, 2018

@dims @liggitt if grpc is suspected I could add a grpc interceptor whcih would log grpc packets and if the issue is triggered then it would be easier to confirm where truncation occurs in grpc library or higher in the stack. Please let me know if you think it is worth to try.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 30, 2018

seen in https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/72190/pull-kubernetes-integration/39569/:

--- FAIL: TestAuthModeAlwaysAllow (3.74s)
    auth_test.go:451: case {POST /api/v1/namespaces/auth-always-allow/pods?timeout=60s 
        {
          "kind": "Pod",
          "apiVersion": "v1",
          "metadata": {
            "name": "a",
            "creationTimestamp": null
          },
          "spec": {
            "containers": [
              {
                "name": "foo",
                "image": "bar/foo"
              }
            ]
          }
        }
         map[201:true]}
    auth_test.go:452: Expected status one of map[201:true], but got 200
    auth_test.go:453: Body: 
@dims

This comment has been minimized.

Copy link
Member

dims commented Dec 31, 2018

@liggitt that seems to rule out issues on the client-go side asTestAuthModeAlwaysAllow uses http.NewRequest directly

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 31, 2018

@liggitt that seems to rule out issues on the client-go side asTestAuthModeAlwaysAllow uses http.NewRequest directly

agreed

@dims

This comment has been minimized.

Copy link
Member

dims commented Jan 2, 2019

Here's an interesting run from CI about an hour or so ago:
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-beta/9556

Seems to be happening in release-1.13 branch too!

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 2, 2019

Seems to be happening in release-1.13 branch

that's not surprising, given the first occurrence was well before code freeze lifted

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 2, 2019

http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/72249/pull-kubernetes-integration/39697/

--- FAIL: TestPDBInPreemption (12.84s)
    preemption_test.go:927: ================ Running test: A non-PDB violating pod is preempted despite its higher priority
    preemption_test.go:927: ================ Running test: A node without any PDB violating pods is preferred for preemption
    preemption_test.go:927: ================ Running test: A node with fewer PDB violating pods is preferred for preemption
    preemption_test.go:940: Test [A node with fewer PDB violating pods is preferred for preemption]: Error running pause pod: Error creating pause pod: 0-length response with status code: 200 and content type:
@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 2, 2019

This seems to be yet another failure mode - most of previous ones were status code 0 - this one has 200.

@mborsz

This comment has been minimized.

Copy link
Member

mborsz commented Jan 3, 2019

I believe that also https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/289 is affected by this.

It has failed with

I1231 13:12:51.851] Dec 31 13:12:51.851: INFO: Unexpected error occurred: error while scaling RC load-small-11799 to 2 replicas: Error while scaling load-small-11799 to 2 replicas: Object 'Kind' is missing in ''

and associated kube-apiserver.log is:

I1231 13:12:51.612580       1 wrap.go:47] PUT /api/v1/namespaces/load-30-nodepods-50-3431/replicationcontrollers/load-small-11799/scale: (799.133205ms) 0 [Go-http-client/2.0 35.192.225.242:35028]

The kube-apiserver.log entry has 0 status code and the error suggests that response was incomplete.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 3, 2019

xref #72538

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 4, 2019

the logged status of 0 on the server side, and no status stack even with #72498 merged indicates that neither Write nor WriteHeader is called. the default go server handles that as a 0-length response with a 200 status code, which is why the server logs a status of 0 and clients see a 0-length 200 response.

I tried creating all sorts of errors (metav1.Status errors with and without valid status and code fields, non-status errors, etc), and could not reproduce a case where an error passed to writers.go#ErrorNegotiated would result in a zero-length response and no log.

If the switch statement referenced in #71696 (comment) was hit with a status that was neither success/failure, and a zero code, the server would attempt to write it and would panic and log in the server log.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 7, 2019

with the unit test output fix merged, we're getting better visibility to the occurrences of this:

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=0-length%20response&job=kubernetes-integration

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 7, 2019

opened #72650 to try to capture more details if any code paths are calling WriteObjectNegotiated and returning without calling Write or WriteHeader

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 7, 2019

strangely, that PR just caught the following, but the check for having called Write/WriteHeader passed and did not log any other details:

I0107 18:46:42.548979  118229 wrap.go:47] POST /apis/batch/v1beta1/namespaces/simple-cronjob-test/cronjobs: (19.549103ms) 0 [cronjob.test/v0.0.0 (linux/amd64) kubernetes/$Format 127.0.0.1:55116]
cronjob_test.go:169: Failed to create CronJob: 0-length response with status code: 200 and content type: 

that might seem to indicate that one of the response writers (we wrap several times, for metrics, auditing, etc) might not be propagating header/body data to the original ResponseWriter, which is super strange

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 8, 2019

seeing more and more of these failures... for example, 3 out of the last 6 runs of https://gubernator.k8s.io/pr/72666 failed on this error

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 12, 2019

@dims tracked it down and deserves all praise. Fix opened in #72856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment