Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Container exit message lost due to FallbackToLogsOnError is not compatible with ContainerCannotRun #81280

Merged
merged 1 commit into from
Oct 8, 2019

Conversation

yqwang-ms
Copy link
Contributor

@yqwang-ms yqwang-ms commented Aug 12, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Bug: FallbackToLogsOnError is not compatiable with ContainerCannotRun.
If a container cannot even be started, such as due to docker run parameters incorrect (like below example, unknown linux capability), it certainly does not have logs, so in this case, we should not override the exit message got from dockerd (like below example, "unknown linux capability"), to be "Error on reading termination message from logs". Otherwise, the truely useful exit message will be lost.

Reproduce Steps:
Submit below Pod:

apiVersion: v1
kind: Pod
metadata:
  name: pe12
spec:
  restartPolicy: Never
  containers:
  - name: ubuntu
    image: ubuntu:trusty
    command: ["sh", "-c", "sleep 1 && exit 0"]
    securityContext:
      capabilities:
        add:
        - CAP_DAC_READ_SEARCHXXX
    terminationMessagePolicy: FallbackToLogsOnError

Pod failed without useful info:
We only have info "Error on reading termination message from logs"

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "pe12",
    "namespace": "default",
    "selfLink": "/api/v1/namespaces/default/pods/pe12",
    "uid": "8918753c-bcd6-11e9-bcf8-000d3af801a9",
    "resourceVersion": "20357386",
    "creationTimestamp": "2019-08-12T07:55:22Z"
  },
  "spec": {
    "volumes": [
      {
        "name": "default-token-kqww5",
        "secret": {
          "secretName": "default-token-kqww5",
          "defaultMode": 420
        }
      }
    ],
    "containers": [
      {
        "name": "ubuntu",
        "image": "ubuntu:trusty",
        "command": [
          "sh",
          "-c",
          "sleep 1 \u0026\u0026 exit 0"
        ],
        "resources": {
          
        },
        "volumeMounts": [
          {
            "name": "default-token-kqww5",
            "readOnly": true,
            "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
          }
        ],
        "terminationMessagePath": "/dev/termination-log",
        "terminationMessagePolicy": "FallbackToLogsOnError",
        "imagePullPolicy": "IfNotPresent",
        "securityContext": {
          "capabilities": {
            "add": [
              "CAP_DAC_READ_SEARCHXXX"
            ]
          },
          "procMount": "Default"
        }
      }
    ],
    "restartPolicy": "Never",
    "terminationGracePeriodSeconds": 30,
    "dnsPolicy": "ClusterFirst",
    "serviceAccountName": "default",
    "serviceAccount": "default",
    "nodeName": "minikube",
    "securityContext": {
      
    },
    "schedulerName": "default-scheduler",
    "tolerations": [
      {
        "key": "node.kubernetes.io/not-ready",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      },
      {
        "key": "node.kubernetes.io/unreachable",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      }
    ],
    "priority": 0,
    "enableServiceLinks": true
  },
  "status": {
    "phase": "Failed",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T07:55:22Z"
      },
      {
        "type": "Ready",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T07:55:22Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [ubuntu]"
      },
      {
        "type": "ContainersReady",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T07:55:22Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [ubuntu]"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T07:55:22Z"
      }
    ],
    "hostIP": "10.169.8.232",
    "podIP": "172.17.0.6",
    "startTime": "2019-08-12T07:55:22Z",
    "containerStatuses": [
      {
        "name": "ubuntu",
        "state": {
          "terminated": {
            "exitCode": 128,
            "reason": "ContainerCannotRun",
            "message": "Error on reading termination message from logs: failed to open log file \"/var/log/pods/default_pe12_8918753c-bcd6-11e9-bcf8-000d3af801a9/ubuntu/0.log\": open /var/log/pods/default_pe12_8918753c-bcd6-11e9-bcf8-000d3af801a9/ubuntu/0.log: no such file or directory",
            "startedAt": "2019-08-12T07:55:23Z",
            "finishedAt": "2019-08-12T07:55:23Z",
            "containerID": "docker://b457ceac28cf1ea70b1699ada8c684e73b62f535bf4abf2f7ad08e5f567f51d5"
          }
        },
        "lastState": {
          
        },
        "ready": false,
        "restartCount": 0,
        "image": "ubuntu:trusty",
        "imageID": "docker-pullable://ubuntu@sha256:f961d3d101e66017fc6f0a63ecc0ff15d3e7b53b6a0ac500cd1619ded4771bd6",
        "containerID": "docker://b457ceac28cf1ea70b1699ada8c684e73b62f535bf4abf2f7ad08e5f567f51d5"
      }
    ],
    "qosClass": "BestEffort"
  }
}

If the FallbackToLogsOnError is removed, and submit the Pod again, then we can get useful info:
"Linux spec capabilities: Unknown capability to add"

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "pe4",
    "namespace": "default",
    "selfLink": "/api/v1/namespaces/default/pods/pe4",
    "uid": "3d52669f-bcc7-11e9-bcf8-000d3af801a9",
    "resourceVersion": "20348936",
    "creationTimestamp": "2019-08-12T06:05:52Z"
  },
  "spec": {
    "volumes": [
      {
        "name": "default-token-kqww5",
        "secret": {
          "secretName": "default-token-kqww5",
          "defaultMode": 420
        }
      }
    ],
    "containers": [
      {
        "name": "ubuntu",
        "image": "ubuntu:trusty",
        "command": [
          "sh",
          "-c",
          "sleep 1 \u0026\u0026 exit 0"
        ],
        "resources": {
          
        },
        "volumeMounts": [
          {
            "name": "default-token-kqww5",
            "readOnly": true,
            "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
          }
        ],
        "terminationMessagePath": "/dev/termination-log",
        "terminationMessagePolicy": "File",
        "imagePullPolicy": "IfNotPresent",
        "securityContext": {
          "capabilities": {
            "add": [
              "CAP_DAC_READ_SEARCHXXX"
            ]
          },
          "procMount": "Default"
        }
      }
    ],
    "restartPolicy": "Never",
    "terminationGracePeriodSeconds": 30,
    "dnsPolicy": "ClusterFirst",
    "serviceAccountName": "default",
    "serviceAccount": "default",
    "nodeName": "minikube",
    "securityContext": {
      
    },
    "schedulerName": "default-scheduler",
    "tolerations": [
      {
        "key": "node.kubernetes.io/not-ready",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      },
      {
        "key": "node.kubernetes.io/unreachable",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      }
    ],
    "priority": 0,
    "enableServiceLinks": true
  },
  "status": {
    "phase": "Failed",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T06:05:52Z"
      },
      {
        "type": "Ready",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T06:05:52Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [ubuntu]"
      },
      {
        "type": "ContainersReady",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T06:05:52Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [ubuntu]"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-08-12T06:05:52Z"
      }
    ],
    "hostIP": "10.169.8.232",
    "podIP": "172.17.0.6",
    "startTime": "2019-08-12T06:05:52Z",
    "containerStatuses": [
      {
        "name": "ubuntu",
        "state": {
          "terminated": {
            "exitCode": 128,
            "reason": "ContainerCannotRun",
            "message": "linux spec capabilities: Unknown capability to add: \"CAP_CAP_DAC_READ_SEARCHXXX\"",
            "startedAt": "2019-08-12T06:05:53Z",
            "finishedAt": "2019-08-12T06:05:53Z",
            "containerID": "docker://a3375cbb0eb16f40f47624b394313128d92da847e528e16705c0fe0d034b6c61"
          }
        },
        "lastState": {
          
        },
        "ready": false,
        "restartCount": 0,
        "image": "ubuntu:trusty",
        "imageID": "docker-pullable://ubuntu@sha256:f961d3d101e66017fc6f0a63ecc0ff15d3e7b53b6a0ac500cd1619ded4771bd6",
        "containerID": "docker://a3375cbb0eb16f40f47624b394313128d92da847e528e16705c0fe0d034b6c61"
      }
    ],
    "qosClass": "BestEffort"
  }
}

Which issue(s) this PR fixes:

Fixes # NO See issue above.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

If container fails because ContainerCannotRun, do not utilize the FallbackToLogsOnError TerminationMessagePolicy, as it masks more useful logs.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 12, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @yqwang-ms. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 12, 2019
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 12, 2019
@yqwang-ms
Copy link
Contributor Author

/assign @dchen1107

@yqwang-ms
Copy link
Contributor Author

/assign @sjenning
/assign @derekwaynecarr

Could you please take a look at this small fix when you free? :)
Or could you please tell me what should I do next?
Thanks!

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed pr - especially the detailed reproduction use case :)

This change makes sense to me :)
/lgtm

Could I trouble you to update your release note? I don't find ContainerCannotRun should not FallbackToLogsOnError to be particularly descriptive of the change you're making. What about something like If container fails because ContainerCannotRun, do not utilize the FallbackToLogsOnError TerminationMessagePolicy, as it masks more useful logs.

if len(tMessage) != 0 {
cStatus.Message = tMessage
if len(cStatus.Message) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a little more about this change? I'm not totally following how it relates to the overall change you are describing in your pr.

Copy link
Contributor Author

@yqwang-ms yqwang-ms Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review.
You are right, strictly speaking, it should not belong to this PR, however, in more general sense, it can prevent future useful info being lost.

For example, kubelet failed to read the termination file which may due to a reason described in dockerd's container error message.

So the principle here is that, discard exit info ONLY if we can ensure it is useless or duplicated.

For this case, we cannot ensure that dockerd's container error message is useless given "failed to read the termination file" or even we successfully read the file (because we also donot know if the dockerd's container error message is the root cause of the just read content).

So, for simplicity, I just append the error message, instead of one replace another.

It is just my preference, and TBD, we can disscus on it. :)

(Release note also updated)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 12, 2019
@mattjmcnaughton
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 12, 2019
@yqwang-ms
Copy link
Contributor Author

/retest

@yqwang-ms
Copy link
Contributor Author

@sjenning @derekwaynecarr Could you please take a look at this small fix? :)

@yqwang-ms
Copy link
Contributor Author

Pinging @sjenning @derekwaynecarr , PTAL

@yqwang-ms
Copy link
Contributor Author

/unassign @dchen1107
/unassign @sjenning
/unassign @derekwaynecarr

/assign @Random-Liu
/assign @tallclair
/assign @yujuhong

Because no responses from bot suggested approvers, could you please kindly take a look at this small fix when you free? :)
This issues really cause the failed pod hard to debug.
Thanks!

@yqwang-ms
Copy link
Contributor Author

Pinging @Random-Liu and @tallclair , PTAL

if len(cStatus.Message) != 0 {
cStatus.Message += ": "
}
cStatus.Message += tMessage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for cStatus.Message to grow infinitely, or do we reset it each time this function is called?

Copy link
Contributor Author

@yqwang-ms yqwang-ms Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

I can confirm that cStatus.Message will NOT grow infinitely, given that getPodContainerStatuses is not stateful for the cStatus.Message (and it is by design should not be stateful), so that cStatus.Message will NOT accumulated.

We can prove this in details, please check:

  1. The background is that cStatus.Message will only be updated in 2 places:
    image

    Note the cStatus.Message is just string instead of *string

  2. cStatus.Message in line 421 can only come from above toKubeContainerStatus in the same getPodContainerStatuses func:

    cStatus := toKubeContainerStatus(status, m.runtimeName)

  3. In the getPodContainerStatuses, it will only init the cStatus.Message from runtime container message:

    cStatus.Message = status.Message

  4. The runtime container message can only come directly from the underlay runtime:

  5. The runtime container message will NOT from the cStatus.Message. So, the cStatus.Message will only be appended at most once by:

    if len(tMessage) != 0 {
    if len(cStatus.Message) != 0 {
    cStatus.Message += ": "
    }
    cStatus.Message += tMessage
    }

  6. So the cStatus.Message will NOT grow infinitely.

Overall, the cStatus.Message is composed from 2 parts, first part is the runtime container message, see steps 2-4, and second part is TerminationMessage from:

tMessage, checkLogs := getTerminationMessage(status, annotatedInfo.TerminationMessagePath, fallbackToLogs)

So, no more chars in the cStatus.Message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly ping @tallclair , PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any comments for this? :)

@tallclair
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tallclair, yqwang-ms

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2019
@k8s-ci-robot k8s-ci-robot merged commit 4cff1c3 into kubernetes:master Oct 8, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 8, 2019
@yqwang-ms yqwang-ms deleted the yqwang/exitmsglost branch October 9, 2019 02:29
ohsewon pushed a commit to ohsewon/kubernetes that referenced this pull request Oct 16, 2019
Fix Container exit message lost due to FallbackToLogsOnError is not compatible with ContainerCannotRun
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants