Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of a failing revision is not carry-forwarded to the CR status and the logs #10267

Closed
Shashankft9 opened this issue Dec 4, 2020 · 11 comments
Labels
area/API API objects and controllers kind/bug Categorizes issue or PR as related to a bug. triage/needs-user-input Issues which are waiting on a response from the reporter

Comments

@Shashankft9
Copy link
Member

Shashankft9 commented Dec 4, 2020

/area API

What version of Knative?

I have tested the behavior in v0.14.0 and v0.19.0

Expected Behavior

When a revision fails, and the pod shows Error state due to a bad gateway 502 response for example, this state should be captured in the logs of controller pod as https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L111 and similarly this should be carried forward to the status of revision, configuration and ksvc CRs through https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L113 for example.

Actual Behavior

I will distinguish the behavior in the two versions I have tested:

v01.14.0 - Here the logs and the status are captured correctly but the problem is that this happens only when the queue-proxy container is ready (I am doubtful if this is the reason but on observation it appears so) - when the ksvc instance is scaled down and the first request goes, even though the response might be 502 and the pod will go into Error, it wont report the log or the status of the CRs (RevisionFailed) and the pod goes from 0/2 to 1/2 and then into Error. But after some seconds, when pod the goes back to 2/2 with a new user-container(note: ksvc hasnt scaled down to zero) and I send a request, the pod goes into Error and both the logs and status of the CRs are captured correctly. So essentially, it won't report the failure for the first time and till the pod goes to 2/2 stage for the first time.

v0.19.0 - Here even the above behavior is absent, I can't see the relevant log and the status stays True for all the CRs at all times.

Steps to Reproduce the Problem

  1. A serving workload that deliberately fails when a request comes, my workload which has a problem with the unmarshalling because of the variableresult having a wrong data type:
package main

import (
  "fmt"
  "log"
  "net/http"
  "os"
  "io/ioutil"
  "encoding/json"
  "github.com/golang/glog"
)

func handler(w http.ResponseWriter, r *http.Request) {
  log.Print("helloworld: received a request")
  target := os.Getenv("TARGET")
  if target == "" {
    target = "World"
  }
  fmt.Fprintf(w, "Hello %s!\n", target)

  var result string
  data, err := ioutil.ReadAll(r.Body)
  if err != nil {
     glog.Fatalf(fmt.Sprintf("Failed to read the request body: %v\n"), err)
  }
  glog.Infof("Data sent with the request: %s", string(data))
  if err = json.Unmarshal(data, &result); err != nil{
     glog.Fatal(err)
  }
}

func main() {
  log.Print("helloworld: starting server...")

  http.HandleFunc("/", handler)

  port := os.Getenv("PORT")
  if port == "" {
    port = "8080"
  }

  log.Printf("helloworld: listening on port %s", port)
  log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
}
  1. Deploying it:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: failing-func
  namespace: serverless
spec:
  template:
    metadata:
      name: failing-func-v1
      namespace: serverless
      annotations:
        autoscaling.knative.dev/window: "180s"
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/target: "100"
    spec:
      containers:
        - image: xxx
          imagePullPolicy: Always
          env:
            - name: TARGET
              value: "Go Sample v1"
          ports:
            - containerPort: 8000
  1. Sending a request:
curl http://xxx -H "Host: failing-func.serverless.xxx.com" -vvv
* Rebuilt URL to: http://xxx/
*   Trying xxx...
* TCP_NODELAY set
* Connected to xxx (xxx) port 80 (#0)
> GET / HTTP/1.1
> Host: failing-func.serverless.xxx.com
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 502 Bad Gateway
< content-length: 4
< content-type: text/plain; charset=utf-8
< date: Fri, 04 Dec 2020 09:12:01 GMT
< x-content-type-options: nosniff
< x-envoy-upstream-service-time: 4769
< server: istio-envoy
<
EOF
@Shashankft9 Shashankft9 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2020
@markusthoemmes
Copy link
Contributor

This is indeed a hole in our readiness state. A revision that was once considered Ready will currently never become unready.

@Shashankft9
Copy link
Member Author

Shashankft9 commented Dec 7, 2020

Do you think that this known hole also covers the behavior I noticed in the v0.14.0 - the status never reporting unready state when the first request comes on the scaled down to zero ksvc?
I can also try using the v0.18.0 to verify, I remember Dave wanted to test this on v0.18.0 here: https://gist.github.com/dprotaso/56c1fd920291eff29cce48f0501732a0

ref: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300

@Shashankft9
Copy link
Member Author

yep, same with v0.18.0, just tested the flow. The status remains True.

@mattmoor
Copy link
Member

mattmoor commented Jan 6, 2021

I'm wondering about how legitimately we should consider this for a few reasons:

  1. log.Fatal as a failure mode seems extreme, even a panic in Go is generally handled by the http server more gracefully IIRC. Generally the term I have used for requests that kill the server is a "query of death".
  2. The default readiness check (what's used here) is our built-in tcp probe, which is clearly inadequate to establish this. If you add an http get probe that issues a malformed request, it would never become ready in the first place IIUC.

I think that there are two scenarios worth spelling out:

  1. Occasional "query of death", which we will have a very hard time protecting against.
  2. All "real" traffic is a "query of death", which we might be able to protect against post-facto.

I think that if we can manage 2. then bad changes rolled out via latestRevision: true either with or without the new rollout stuff @vagababov has been working on could theoretically be rolled back, which would be a nice protection to offer folks, but we need to work out the details very carefully.

@mattmoor
Copy link
Member

mattmoor commented Jan 6, 2021

cc @dprotaso

@dprotaso
Copy link
Member

dprotaso commented Jan 6, 2021

@evankanderson
Copy link
Member

/area API

Is there more that needs to be done here? @dprotaso , can you put a priority/hint here?

@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label Mar 19, 2021
@evankanderson
Copy link
Member

/triage needs-user-input

@knative-prow-robot knative-prow-robot added the triage/needs-user-input Issues which are waiting on a response from the reporter label Mar 19, 2021
@Shashankft9
Copy link
Member Author

Hey, I just noticed the label, is there any more info I could provide?

@dprotaso
Copy link
Member

I think Evan was looking for input from myself. I'll surface what I mentioned slack.

We can't automatically distinguish when a runtime failure should gate readiness of the revision. We need hints from the user via readiness/liveness probes.

Matt's description of 2) in my opinion should be managed by some higher level continuous delivery tool. That would not only capture runtime errors (counting HTTP status codes) but also performance regressions (changes in request latency).

I'm going to close this issue out and recommend maybe looking at spinnaker and other CD tools

@dprotaso
Copy link
Member

I made a follow up issue to confirm that if we have the right hints (readiness/liveness failing) that we behave properly.

#11531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers kind/bug Categorizes issue or PR as related to a bug. triage/needs-user-input Issues which are waiting on a response from the reporter
Projects
None yet
Development

No branches or pull requests

6 participants