Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💚 add debug output for e2e service and job failures #860

Merged
merged 1 commit into from
Aug 17, 2020

Conversation

mboersma
Copy link
Contributor

@mboersma mboersma commented Aug 4, 2020

What this PR does / why we need it:

Prints a JSON dump of the service or job that failed a WaitFor func in e2e tests. If this seems like a reasonable approach, I'll follow with:

  • a similar change to WaitForDeploymentsAvailable in CAPI
  • printing relevant k8s events from the DescribeFailedX funcs

Which issue(s) this PR fixes:

Refs #807

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

💚 add debug output for e2e service and job failures

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 4, 2020
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 4, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @mboersma. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 4, 2020
@mboersma
Copy link
Contributor Author

mboersma commented Aug 4, 2020

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test

Ok, ok, I'll open a kubernetes/org issue for sig membership now.

@nader-ziada
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 4, 2020
@CecileRobertMichon CecileRobertMichon added this to the v0.4.8 milestone Aug 4, 2020
@nader-ziada
Copy link
Contributor

what are you expecting the output to look like, I tried to make the test fail and got a big blob of html

@mboersma
Copy link
Contributor Author

mboersma commented Aug 6, 2020

@nader-ziada my thought was to dump the Service or Job to stdout so we could at least examine what its status and selectors are. The next step would be to dump events related to the object that failed, but I planned to do that as a second change after something like this is in CAPI too for deployments.

I meant to prefix the JSON dump with at least Service:, so I'll add that for context. Maybe that's clear enough?

STEP: waiting for cluster capz-e2e-q4nyjd to be deleted


• Failure [2288.132 seconds]
Workload cluster creation
/Users/matt/Projects/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:36
  Creating highly available control-plane cluster
  /Users/matt/Projects/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:95
    With 3 control-plane nodes and 2 worker nodes [It]
    /Users/matt/Projects/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:96

    Timed out after 10.004s.
    Service default/ingress-nginx-ilb failed to get an IP for LoadBalancer.Ingress
    Service:
    {
      "metadata": {
        "name": "ingress-nginx-ilb",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/services/ingress-nginx-ilb",
        "uid": "d7cdfa43-525a-43b7-a53a-165ac558a6bc",
        "resourceVersion": "1106",
        "creationTimestamp": "2020-08-05T21:32:20Z",
        "annotations": {
          "service.beta.kubernetes.io/azure-load-balancer-internal": "true"
        },
        "finalizers": [
          "service.kubernetes.io/load-balancer-cleanup"
        ],
        "managedFields": [
          {
            "manager": "cluster-api-e2e",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2020-08-05T21:32:20Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:metadata": {
                "f:annotations": {
                  ".": {},
                  "f:service.beta.kubernetes.io/azure-load-balancer-internal": {}
                }
              },
              "f:spec": {
                "f:externalTrafficPolicy": {},
                "f:ports": {
                  ".": {},
                  "k:{\"port\":80,\"protocol\":\"TCP\"}": {
                    ".": {},
                    "f:name": {},
                    "f:port": {},
                    "f:protocol": {},
                    "f:targetPort": {}
                  },
                  "k:{\"port\":443,\"protocol\":\"TCP\"}": {
                    ".": {},
                    "f:name": {},
                    "f:port": {},
                    "f:protocol": {},
                    "f:targetPort": {}
                  }
                },
                "f:selector": {
                  ".": {},
                  "f:app": {}
                },
                "f:sessionAffinity": {},
                "f:type": {}
              }
            }
          },
          {
            "manager": "kube-controller-manager",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2020-08-05T21:32:20Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:metadata": {
                "f:finalizers": {
                  ".": {},
                  "v:\"service.kubernetes.io/load-balancer-cleanup\"": {}
                }
              }
            }
          }
        ]
      },
      "spec": {
        "ports": [
          {
            "name": "http",
            "protocol": "TCP",
            "port": 80,
            "targetPort": 80,
            "nodePort": 31535
          },
          {
            "name": "https",
            "protocol": "TCP",
            "port": 443,
            "targetPort": 443,
            "nodePort": 30891
          }
        ],
        "selector": {
          "app": "ingress-nginx"
        },
        "clusterIP": "10.106.132.19",
        "type": "LoadBalancer",
        "sessionAffinity": "None",
        "externalTrafficPolicy": "Cluster"
      },
      "status": {
        "loadBalancer": {}
      }
    }
    
    Expected
        <bool>: false
    to be true

    /Users/matt/Projects/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:98

    Full Stack Trace
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.WaitForServiceAvailable(0x2ae21e0, 0xc0000460f0, 0x2a9bd00, 0xc000984750, 0xc000b0d680, 0xc00098aa00, 0x2, 0x2)
    	/Users/matt/Projects/src/sigs.k8s.io
...

@mboersma
Copy link
Contributor Author

mboersma commented Aug 7, 2020

I opened kubernetes-sigs/cluster-api#3468 to give Deployments the same treatment.

@mboersma
Copy link
Contributor Author

kubernetes-sigs/cluster-api#3468 did get merged, so we should benefit from that the next time we vendor in CAPI.

@CecileRobertMichon
Copy link
Contributor

do we still want this one @mboersma ?

@mboersma
Copy link
Contributor Author

do we still want this one ?

Yes, I think so. While Deployments are what we have seen failing in e2e runs, there's no reason a Service or Job couldn't fail similarly, in which case we would glad to have this debug logging. Also our CAPZ WaitFor[Service|Job] funcs are based on CAPI's e2e framework helper funcs, so I think we should keep the implementations similar.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @nader-ziada

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2020
@nader-ziada
Copy link
Contributor

/approve

I agree this is helpful in debugging

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nader-ziada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2020
@mboersma
Copy link
Contributor Author

/retest

2 similar comments
@mboersma
Copy link
Contributor Author

/retest

@mboersma
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 01aa442 into kubernetes-sigs:master Aug 17, 2020
@mboersma mboersma deleted the fix-e2e-adaptors branch August 17, 2020 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants