State of a failing revision is not carry-forwarded to the CR status and the logs #10267

Shashankft9 · 2020-12-04T09:49:38Z

/area API

What version of Knative?

I have tested the behavior in v0.14.0 and v0.19.0

Expected Behavior

When a revision fails, and the pod shows Error state due to a bad gateway 502 response for example, this state should be captured in the logs of controller pod as https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L111 and similarly this should be carried forward to the status of revision, configuration and ksvc CRs through https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L113 for example.

Actual Behavior

I will distinguish the behavior in the two versions I have tested:

v01.14.0 - Here the logs and the status are captured correctly but the problem is that this happens only when the queue-proxy container is ready (I am doubtful if this is the reason but on observation it appears so) - when the ksvc instance is scaled down and the first request goes, even though the response might be 502 and the pod will go into Error, it wont report the log or the status of the CRs (RevisionFailed) and the pod goes from 0/2 to 1/2 and then into Error. But after some seconds, when pod the goes back to 2/2 with a new user-container(note: ksvc hasnt scaled down to zero) and I send a request, the pod goes into Error and both the logs and status of the CRs are captured correctly. So essentially, it won't report the failure for the first time and till the pod goes to 2/2 stage for the first time.

v0.19.0 - Here even the above behavior is absent, I can't see the relevant log and the status stays True for all the CRs at all times.

Steps to Reproduce the Problem

A serving workload that deliberately fails when a request comes, my workload which has a problem with the unmarshalling because of the variableresult having a wrong data type:

package main

import (
  "fmt"
  "log"
  "net/http"
  "os"
  "io/ioutil"
  "encoding/json"
  "github.com/golang/glog"
)

func handler(w http.ResponseWriter, r *http.Request) {
  log.Print("helloworld: received a request")
  target := os.Getenv("TARGET")
  if target == "" {
    target = "World"
  }
  fmt.Fprintf(w, "Hello %s!\n", target)

  var result string
  data, err := ioutil.ReadAll(r.Body)
  if err != nil {
     glog.Fatalf(fmt.Sprintf("Failed to read the request body: %v\n"), err)
  }
  glog.Infof("Data sent with the request: %s", string(data))
  if err = json.Unmarshal(data, &result); err != nil{
     glog.Fatal(err)
  }
}

func main() {
  log.Print("helloworld: starting server...")

  http.HandleFunc("/", handler)

  port := os.Getenv("PORT")
  if port == "" {
    port = "8080"
  }

  log.Printf("helloworld: listening on port %s", port)
  log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
}

Deploying it:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: failing-func
  namespace: serverless
spec:
  template:
    metadata:
      name: failing-func-v1
      namespace: serverless
      annotations:
        autoscaling.knative.dev/window: "180s"
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/target: "100"
    spec:
      containers:
        - image: xxx
          imagePullPolicy: Always
          env:
            - name: TARGET
              value: "Go Sample v1"
          ports:
            - containerPort: 8000

Sending a request:

curl http://xxx -H "Host: failing-func.serverless.xxx.com" -vvv
* Rebuilt URL to: http://xxx/
*   Trying xxx...
* TCP_NODELAY set
* Connected to xxx (xxx) port 80 (#0)
> GET / HTTP/1.1
> Host: failing-func.serverless.xxx.com
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 502 Bad Gateway
< content-length: 4
< content-type: text/plain; charset=utf-8
< date: Fri, 04 Dec 2020 09:12:01 GMT
< x-content-type-options: nosniff
< x-envoy-upstream-service-time: 4769
< server: istio-envoy
<
EOF

The text was updated successfully, but these errors were encountered:

markusthoemmes · 2020-12-07T07:04:32Z

This is indeed a hole in our readiness state. A revision that was once considered Ready will currently never become unready.

Shashankft9 · 2020-12-07T08:01:58Z

Do you think that this known hole also covers the behavior I noticed in the v0.14.0 - the status never reporting unready state when the first request comes on the scaled down to zero ksvc?
I can also try using the v0.18.0 to verify, I remember Dave wanted to test this on v0.18.0 here: https://gist.github.com/dprotaso/56c1fd920291eff29cce48f0501732a0

ref: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300

Shashankft9 · 2020-12-07T11:14:24Z

yep, same with v0.18.0, just tested the flow. The status remains True.

mattmoor · 2021-01-06T19:34:17Z

I'm wondering about how legitimately we should consider this for a few reasons:

log.Fatal as a failure mode seems extreme, even a panic in Go is generally handled by the http server more gracefully IIRC. Generally the term I have used for requests that kill the server is a "query of death".
The default readiness check (what's used here) is our built-in tcp probe, which is clearly inadequate to establish this. If you add an http get probe that issues a malformed request, it would never become ready in the first place IIUC.

I think that there are two scenarios worth spelling out:

Occasional "query of death", which we will have a very hard time protecting against.
All "real" traffic is a "query of death", which we might be able to protect against post-facto.

I think that if we can manage 2. then bad changes rolled out via latestRevision: true either with or without the new rollout stuff @vagababov has been working on could theoretically be rolled back, which would be a nice protection to offer folks, but we need to work out the details very carefully.

mattmoor · 2021-01-06T19:34:28Z

cc @dprotaso

dprotaso · 2021-01-06T19:59:01Z

slack thread discussion: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300

evankanderson · 2021-03-19T18:00:05Z

/area API

Is there more that needs to be done here? @dprotaso , can you put a priority/hint here?

evankanderson · 2021-03-19T18:00:26Z

/triage needs-user-input

Shashankft9 · 2021-05-24T12:59:18Z

Hey, I just noticed the label, is there any more info I could provide?

dprotaso · 2021-06-15T16:09:52Z

I think Evan was looking for input from myself. I'll surface what I mentioned slack.

We can't automatically distinguish when a runtime failure should gate readiness of the revision. We need hints from the user via readiness/liveness probes.

Matt's description of 2) in my opinion should be managed by some higher level continuous delivery tool. That would not only capture runtime errors (counting HTTP status codes) but also performance regressions (changes in request latency).

I'm going to close this issue out and recommend maybe looking at spinnaker and other CD tools

dprotaso · 2021-06-15T16:20:29Z

I made a follow up issue to confirm that if we have the right hints (readiness/liveness failing) that we behave properly.

#11531

Shashankft9 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2020

knative-prow-robot added the area/API API objects and controllers label Mar 19, 2021

knative-prow-robot added the triage/needs-user-input Issues which are waiting on a response from the reporter label Mar 19, 2021

dprotaso closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of a failing revision is not carry-forwarded to the CR status and the logs #10267

State of a failing revision is not carry-forwarded to the CR status and the logs #10267

Shashankft9 commented Dec 4, 2020 •

edited

markusthoemmes commented Dec 7, 2020

Shashankft9 commented Dec 7, 2020 •

edited

Shashankft9 commented Dec 7, 2020

mattmoor commented Jan 6, 2021

mattmoor commented Jan 6, 2021

dprotaso commented Jan 6, 2021

evankanderson commented Mar 19, 2021

evankanderson commented Mar 19, 2021

Shashankft9 commented May 24, 2021

dprotaso commented Jun 15, 2021

dprotaso commented Jun 15, 2021

State of a failing revision is not carry-forwarded to the CR status and the logs #10267

State of a failing revision is not carry-forwarded to the CR status and the logs #10267

Comments

Shashankft9 commented Dec 4, 2020 • edited

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

markusthoemmes commented Dec 7, 2020

Shashankft9 commented Dec 7, 2020 • edited

Shashankft9 commented Dec 7, 2020

mattmoor commented Jan 6, 2021

mattmoor commented Jan 6, 2021

dprotaso commented Jan 6, 2021

evankanderson commented Mar 19, 2021

evankanderson commented Mar 19, 2021

Shashankft9 commented May 24, 2021

dprotaso commented Jun 15, 2021

dprotaso commented Jun 15, 2021

Shashankft9 commented Dec 4, 2020 •

edited

Shashankft9 commented Dec 7, 2020 •

edited