Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition while updating khstate #858

Closed
adriananeci opened this issue Feb 26, 2021 · 13 comments
Closed

Possible race condition while updating khstate #858

adriananeci opened this issue Feb 26, 2021 · 13 comments
Assignees
Labels
bug Something isn't working Stale

Comments

@adriananeci
Copy link
Contributor

Describe the bug
Possible race condition while updating khstate.
Most probably a check failed few days ago due to Reporting took too long to complete and timed out. and was never scheduled anymore since then.
From kuberhealthy logs:

[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:13:58Z" level=info msg="ee01135b-6584-44a0-a996-9cb131eb80d8 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:03Z" level=info msg="ee01135b-6584-44a0-a996-9cb131eb80d8 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:08Z" level=info msg="ee01135b-6584-44a0-a996-9cb131eb80d8 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:09Z" level=debug msg="state reflector store item from listing: 6 {\n\t\"metadata\": {\n\t\t\"name\": \"e2e-k8sapi\",\n\t\t\"namespace\": \"e2e-test\",\n\t\t\"selfLink\": \"/apis/comcast.github.io/v1/namespaces/e2e-test/khstates/e2e-k8sapi\",\n\t\t\"uid\": \"ee12093d-3364-467a-bf7f-f4f9cca1b798\",\n\t\t\"resourceVersion\": \"10661588036\",\n\t\t\"generation\": 43298,\n\t\t\"creationTimestamp\": \"2021-02-15T06:38:43Z\",\n\t\t\"managedFields\": [\n\t\t\t{\n\t\t\t\t\"manager\": \"kuberhealthy\",\n\t\t\t\t\"operation\": \"Update\",\n\t\t\t\t\"apiVersion\": \"comcast.github.io/v1\",\n\t\t\t\t\"time\": \"2021-02-15T06:38:43Z\",\n\t\t\t\t\"fieldsType\": \"FieldsV1\",\n\t\t\t\t\"fieldsV1\": {\n\t\t\t\t\t\"f:spec\": {\n\t\t\t\t\t\t\".\": {},\n\t\t\t\t\t\t\"f:AuthoritativePod\": {},\n\t\t\t\t\t\t\"f:Errors\": {},\n\t\t\t\t\t\t\"f:LastRun\": {},\n\t\t\t\t\t\t\"f:Namespace\": {},\n\t\t\t\t\t\t\"f:OK\": {},\n\t\t\t\t\t\t\"f:RunDuration\": {},\n\t\t\t\t\t\t\"f:uuid\": {}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t]\n\t},\n\t\"spec\": {\n\t\t\"OK\": false,\n\t\t\"Errors\": [\n\t\t\t\"Reporting took too long to complete and timed out.\"\n\t\t],\n\t\t\"RunDuration\": \"\",\n\t\t\"Namespace\": \"e2e-test\",\n\t\t\"LastRun\": \"2021-02-25T07:30:46.027218995Z\",\n\t\t\"AuthoritativePod\": \"kuberhealthy-9f49c5bf9-dg5vl\",\n\t\t\"uuid\": \"34d112aa-baf6-40b5-aded-0c0810d7f322\"\n\t}\n}"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:09Z" level=debug msg="Getting status of check for web request to status page: e2e-k8sapi e2e-test"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:09Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:09Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
[kuberhealthy-9f49c5bf9-dg5vl kuberhealthy] time="2021-02-26T12:14:13Z" level=info msg="ee01135b-6584-44a0-a996-9cb131eb80d8 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"

Steps To Reproduce

  • I dont' have some solid reproductions steps at the moment

Expected behavior

If a check failed, I would like kuberhealthy to be able to schedule a new check and update the khstate accordingly.

Versions

  • Cluster OS: Flatcar
  • Kubernetes Version: 1.18.14
  • Kuberhealthy Release or build v2.4.0

Additional context
khstate status:

# kubectl describe khstates.comcast.github.io e2e-k8sapi
Name:         e2e-k8sapi
Namespace:    e2e-test
Labels:       <none>
Annotations:  <none>
API Version:  comcast.github.io/v1
Kind:         KuberhealthyState
Metadata:
  Creation Timestamp:  2021-02-15T06:38:43Z
  Generation:          43298
  Managed Fields:
    API Version:  comcast.github.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:AuthoritativePod:
        f:Errors:
        f:LastRun:
        f:Namespace:
        f:OK:
        f:RunDuration:
        f:uuid:
    Manager:         kuberhealthy
    Operation:       Update
    Time:            2021-02-15T06:38:43Z
  Resource Version:  10661588036
  Self Link:         /apis/comcast.github.io/v1/namespaces/e2e-test/khstates/e2e-k8sapi
  UID:               ee12093d-3364-467a-bf7f-f4f9cca1b798
Spec:
  Authoritative Pod:  kuberhealthy-9f49c5bf9-dg5vl
  Errors:
    Reporting took too long to complete and timed out.
  Last Run:      2021-02-25T07:30:46.027218995Z
  Namespace:     e2e-test
  OK:            false
  Run Duration:
  Uuid:          34d112aa-baf6-40b5-aded-0c0810d7f322
Events:          <none>

The workaround was to remove the khstate and let kuberhealthy recreate it.

@adriananeci adriananeci added the bug Something isn't working label Feb 26, 2021
@joshulyne joshulyne self-assigned this Mar 10, 2021
@joshulyne
Copy link
Collaborator

joshulyne commented Mar 11, 2021

@adriananeci sorry for the late response! The race condition that you see in your logs is expected, therefore the retry. If your check is stuck in that loop, that's an issue. But it seems like the uuid for that check (ee01135b-6584-44a0-a996-9cb131eb80d8) in the retry isn't the same as the check pod that reported the error Reporting took too long to complete and timed out.

I'm pretty sure the error message Reporting took too long to complete and timed out isn't from Kuberhealthy. Usually when a checker pod fails to report in, the error message from Kuberhealthy is timed out waiting for checker pod to report in. Is this an error you're reporting from your personal/own check container? If that's the case, then it seems like your check container is able to report to Kuberhealthy -- contrary to that error message. In your code, how do you try to report to Kuberhealthy? Do you use our checkClient package kuberhealthy/pkg/checks/external/checkclient?

You mentioned that you can't repro this, which makes it a little bit more difficult for us, but I do think its a big concern if the check isn't getting rescheduled / is stuck... The next time this happens, it'd be great to see more kuberhealthy logs to learn more about where it's failing to reschedule. Usually if you do a grep of the khcheck name -- ie e2e-test/e2e-k8sapi you should be able to see the lifecycle of that khcheck and where it's at.

@adriananeci
Copy link
Contributor Author

Hey @joshulyne ,

Reporting took too long to complete and timed out is an internal reporting message related to the check itself and not to kuberhealthy. Maybe the message is a bit confusing, but it is unrelated to kuberhealthy status reporting.
Yes, we are using the checkClient by calling ReportSuccess() or ReportFailure([]string{errorMsg}) associated methods.

We hit the same issue again today. Another workaround was to rolling restart kuberhealthy deployment.

k get khstates.comcast.github.io | egrep "e2e-k8sapi|NAME"
NAME                               OK      AGE LASTRUN   AGE
e2e-k8sapi                         false   8d            8d

Looking at the kuberhealthy logs, if my assumption is correct, at a moment in time, 2 kuberhealthy instances were simultaneously tried to adjust the khstate CRD:

  • kuberhealthy-565b4b4d46-7hsdc using UUID: 41075ba1-ab29-4db9-a231-14124a475a32
    and
  • kuberhealthy-565b4b4d46-qf89s using UUID: a9d17db3-3672-4e88-bda9-1966e41cf61a
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:16Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:11Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:06Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:01Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 (e2e-test/e2e-k8sapi) Request completed successfully."
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 (e2e-test/e2e-k8sapi) Setting check with name e2e-k8sapi in namespace e2e-test to 'OK' state: false uuid a9d17db3-3672-4e88-bda9-1966e41cf61a KHCheck"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [aborting wait for external checker pod to report in due to context cancellation]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [waiting for external checker pod to report in...]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 (e2e-test/e2e-k8sapi) Request completed successfully."
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 (e2e-test/e2e-k8sapi) Setting check with name e2e-k8sapi in namespace e2e-test to 'OK' state: true uuid a9d17db3-3672-4e88-bda9-1966e41cf61a KHCheck"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Updating khstate e2e-k8sapi e2e-test to setUUID: 41075ba1-ab29-4db9-a231-14124a475a32]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Setting expected UUID to: 41075ba1-ab29-4db9-a231-14124a475a32]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=debug msg="e2e-test e2e-k8sapi writing khstate with ok: false and errors: [Check execution error: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in] at last run: 2021-03-03 10:10:55.722680012 +0000 UTC m=+58264.275673655"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:54Z" level=debug msg="Setting execution state of check e2e-k8sapi to false [Check execution error: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in] a9d17db3-3672-4e88-bda9-1966e41cf61a KHCheck"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=error msg="Error running check: e2e-k8sapi in namespace e2e-test: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [Error with running external check: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [evicting pod e2e-k8sapi-1614766193 from namespace e2e-test]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod: e2e-k8sapi-1614766193 is in status: Running]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [finding pods that are not in status.phase=Failed or status.phase=Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod: e2e-k8sapi-1614766132 is in status: Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [finding pods that are not in status.phase=Failed or status.phase=Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod: e2e-k8sapi-1614766072 is in status: Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [finding pods that are not in status.phase=Failed or status.phase=Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [wait for deleted event watcher has closed]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod shutdown monitor stopping gracefully]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [eviction: looking for pods with the label kuberhealthy-check-name = e2e-k8sapi]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [Evicting up any running pods with name e2e-k8sapi-1614766193]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [timed out waiting for checker pod to report in]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [timed out waiting for pod status to be reported]"

After this race condition, the only logs related to this check were:

kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:16Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"

@joshulyne
Copy link
Collaborator

joshulyne commented Mar 12, 2021

@adriananeci thanks for the quick response!

So among the kuberhealthy instances, only one of them can modify a khstate resource at a time, this is generally the master kuberhealthy -- a constant master calculation is done in the background to ensure that its the only instance that can modify resources. If two kuberhealthy instances were modifying a khstate resource at the same time, you would usually see in the logs where one of them stopped being master or control: Lost master. Stopping checks..

Looking at the logs it seems like the khcheck e2e-k8sapi started a new run before the previous khcheck had completely updated the khstate.

I see that it tries to here at the end of the first run (check uuid: a9d17db3-3672-4e88-bda9-1966e41cf61a)

kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=debug msg="e2e-test e2e-k8sapi writing khstate with ok: false and errors: [Check execution error: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in] at last run: 2021-03-03 10:10:55.722680012 +0000 UTC m=+58264.275673655"

And if it had been successful, you should see a log line like: Successfully updated CRD for check: ...

However at the same time, a new run for the same khcheck is starting (check uuid: 41075ba1-ab29-4db9-a231-14124a475a32) and I can see that through the log line here:

kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Setting expected UUID to: 41075ba1-ab29-4db9-a231-14124a475a32]"

Setting the new check UUID is the first thing that happens during a new check run and this requires an update of the khstate and I THINK that's where the race condition is happening... When the previous khcheck run and new run are trying to update the same khstate resource.

What are you setting the runInterval and timeout for this check? That would be helpful to know... Let me know if this makes sense and matches what you're observing. I'll verify on my side to see if this is a possible scenario.

@joshulyne
Copy link
Collaborator

joshulyne commented Mar 12, 2021

So I tried to replicate the scenario -- creating a bad khcheck that fails to report back to kuberhealthy within its timeout and setting the runInterval low so that check runs immediately after. This is what I see in the logs:

time="2021-03-12T03:01:41Z" level=info msg="763e8f5e-f4e2-4312-977b-fab73f4fac8d kuberhealthy/kh-test-check: [Error with running external check: kuberhealthy/kh-test-check: timed out waiting for checker pod to report in]"
time="2021-03-12T03:01:41Z" level=error msg="Error running check: kh-test-check in namespace kuberhealthy: kuberhealthy/kh-test-check: timed out waiting for checker pod to report in"
time="2021-03-12T03:01:41Z" level=debug msg="Checking existence of custom resource: kh-test-check"
time="2021-03-12T03:01:41Z" level=debug msg="khstate custom resource found: kh-test-check"
time="2021-03-12T03:01:41Z" level=debug msg="Retrieving khstate custom resource for: kh-test-check"
time="2021-03-12T03:01:41Z" level=debug msg="Successfully retrieved khstate resource: kh-test-check"
time="2021-03-12T03:01:41Z" level=debug msg="Setting execution state of check kh-test-check to false [Check execution error: kuberhealthy/kh-test-check: timed out waiting for checker pod to report in] 763e8f5e-f4e2-4312-977b-fab73f4fac8d KHCheck"
time="2021-03-12T03:01:41Z" level=debug msg="Checking existence of custom resource: kh-test-check"
time="2021-03-12T03:01:41Z" level=debug msg="khstate custom resource found: kh-test-check"
time="2021-03-12T03:01:41Z" level=debug msg="kuberhealthy kh-test-check writing khstate with ok: false and errors: [Check execution error: kuberhealthy/kh-test-check: timed out waiting for checker pod to report in] at last run: 2021-03-12 03:01:41.2833018 +0000 UTC m=+3929.822073101"
time="2021-03-12T03:01:41Z" level=debug msg="Successfully updated CRD for check: kh-test-check in namespace kuberhealthy"
time="2021-03-12T03:01:41Z" level=info msg="Running check: kh-test-check"
time="2021-03-12T03:01:41Z" level=info msg="c56ec199-4884-4e8e-a531-366bf0e4698a kuberhealthy/kh-test-check: [Setting expected UUID to: c56ec199-4884-4e8e-a531-366bf0e4698a]"
time="2021-03-12T03:01:41Z" level=info msg="c56ec199-4884-4e8e-a531-366bf0e4698a kuberhealthy/kh-test-check: [Updating khstate kh-test-check kuberhealthy to setUUID: c56ec199-4884-4e8e-a531-366bf0e4698a]"

The first khcheck run (uuid: 763e8f5e-f4e2-4312-977b-fab73f4fac8d) times out, then kuberhealthy goes to modify the khstate with this error. Once it successfully updates the khstate, only then does it proceed to the next check run (uuid: c56ec199-4884-4e8e-a531-366bf0e4698a)... Wondering if you see any of this in your logs?

@adriananeci
Copy link
Contributor Author

adriananeci commented Mar 12, 2021

Hey @joshulyne

Thanks for your reply. Check run interval and timeout is configured as:

runInterval: 1m
timeout: 1m

Looking deeper into the logs I can see that for the same check(pod e2e-k8sapi-1614766193 with IP 172.16.138.42 and uuid a9d17db3-3672-4e88-bda9-1966e41cf61a) both kuberhealthy replicas interacted with it.

Based on the below logs, the master pod(kuberhealthy-565b4b4d46-7hsdc) was watching for the pod, but at some time it gave up with Error running check: e2e-k8sapi in namespace e2e-test: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in.

One second later, the same pod successfully connected to the slave kuberhealthy(kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 Client connected to check report handler from 172.16.138.42:50208 Go-http-client/1.1") and reported its status.

Later on, the master succeeded to update the khState(kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=debug msg="Successfully updated CRD for check: e2e-k8sapi in namespace e2e-test").

Another interesting thing is that both kuberhealthy replicas updated the khState for the same uuid between time="2021-03-03T10:10:57Z" and "time="2021-03-03T10:11:07Z" based on the AuthoritativePod and uuid as shown in the debug state reflector store item from listing message.

kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:11Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:11:08Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:11:08Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:11:08Z" level=debug msg="Getting status of check for web request to status page: e2e-k8sapi e2e-test"
kuberhealthy-565b4b4d46-qf89s |time="2021-03-03T10:11:08Z" level=debug msg="state reflector store item from listing: 13 {
	"metadata": {
		"name": "e2e-k8sapi",
		"namespace": "e2e-test",
		"selfLink": "/apis/comcast.github.io/v1/namespaces/e2e-test/khstates/e2e-k8sapi",
		"uid": "7020bc55-6073-490d-8b9c-f7f7a35d2b93",
		"resourceVersion": "974456413",
		"generation": 3234,
		"creationTimestamp": "2021-03-02T16:05:22Z",
		"managedFields": [
			{
				"manager": "kuberhealthy",
				"operation": "Update",
				"apiVersion": "comcast.github.io/v1",
				"time": "2021-03-02T16:05:22Z",
				"fieldsType": "FieldsV1",
				"fieldsV1": {
					"f:spec": {
						".": {},
						"f:AuthoritativePod": {},
						"f:Errors": {},
						"f:LastRun": {},
						"f:Namespace": {},
						"f:OK": {},
						"f:RunDuration": {},
						"f:uuid": {}
					}
				}
			}
		]
	},
	"spec": {
		"OK": false,
		"Errors": [
			"Reporting took too long to complete and timed out."
		],
		"RunDuration": "15.340908574s",
		"Namespace": "e2e-test",
		"LastRun": "2021-03-03T10:10:58.749125658Z",
		"AuthoritativePod": "kuberhealthy-565b4b4d46-qf89s",
		"uuid": "a9d17db3-3672-4e88-bda9-1966e41cf61a"
	}
}""
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:07Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:07Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:07Z" level=debug msg="Getting status of check for web request to status page: e2e-k8sapi e2e-test"
kuberhealthy-565b4b4d46-7hsdc	"time="2021-03-03T10:11:07Z" level=debug msg="state reflector store item from listing: 3 {
	"metadata": {
		"name": "e2e-k8sapi",
		"namespace": "e2e-test",
		"selfLink": "/apis/comcast.github.io/v1/namespaces/e2e-test/khstates/e2e-k8sapi",
		"uid": "7020bc55-6073-490d-8b9c-f7f7a35d2b93",
		"resourceVersion": "974456413",
		"generation": 3234,
		"creationTimestamp": "2021-03-02T16:05:22Z",
		"managedFields": [
			{
				"manager": "kuberhealthy",
				"operation": "Update",
				"apiVersion": "comcast.github.io/v1",
				"time": "2021-03-02T16:05:22Z",
				"fieldsType": "FieldsV1",
				"fieldsV1": {
					"f:spec": {
						".": {},
						"f:AuthoritativePod": {},
						"f:Errors": {},
						"f:LastRun": {},
						"f:Namespace": {},
						"f:OK": {},
						"f:RunDuration": {},
						"f:uuid": {}
					}
				}
			}
		]
	},
	"spec": {
		"OK": false,
		"Errors": [
			"Reporting took too long to complete and timed out."
		],
		"RunDuration": "15.340908574s",
		"Namespace": "e2e-test",
		"LastRun": "2021-03-03T10:10:58.749125658Z",
		"AuthoritativePod": "kuberhealthy-565b4b4d46-qf89s",
		"uuid": "a9d17db3-3672-4e88-bda9-1966e41cf61a"
	}
}""
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:06Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:11:01Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 (e2e-test/e2e-k8sapi) Request completed successfully."
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=debug msg="Successfully updated CRD for check: e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=debug msg="e2e-test e2e-k8sapi writing khstate with ok: false and errors: [Reporting took too long to complete and timed out.] at last run: 2021-03-03 10:10:58.749125658 +0000 UTC m=+61444.661680946"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=debug msg="khstate custom resource found: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=debug msg="Checking existence of custom resource: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:58Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 (e2e-test/e2e-k8sapi) Setting check with name e2e-k8sapi in namespace e2e-test to 'OK' state: false uuid a9d17db3-3672-4e88-bda9-1966e41cf61a KHCheck"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:57Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:57Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:57Z" level=debug msg="Getting status of check for web request to status page: e2e-k8sapi e2e-test"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:57Z" level=debug msg="state reflector store item from listing: 6 {
	"metadata": {
		"name": "e2e-k8sapi",
		"namespace": "e2e-test",
		"selfLink": "/apis/comcast.github.io/v1/namespaces/e2e-test/khstates/e2e-k8sapi",
		"uid": "7020bc55-6073-490d-8b9c-f7f7a35d2b93",
		"resourceVersion": "974453334",
		"generation": 3231,
		"creationTimestamp": "2021-03-02T16:05:22Z",
		"managedFields": [
			{
				"manager": "kuberhealthy",
				"operation": "Update",
				"apiVersion": "comcast.github.io/v1",
				"time": "2021-03-02T16:05:22Z",
				"fieldsType": "FieldsV1",
				"fieldsV1": {
					"f:spec": {
						".": {},
						"f:AuthoritativePod": {},
						"f:Errors": {},
						"f:LastRun": {},
						"f:Namespace": {},
						"f:OK": {},
						"f:RunDuration": {},
						"f:uuid": {}
					}
				}
			}
		]
	},
	"spec": {
		"OK": true,
		"Errors": [],
		"RunDuration": "15.340908574s",
		"Namespace": "e2e-test",
		"LastRun": "2021-03-03T10:09:16.877528688Z",
		"AuthoritativePod": "kuberhealthy-565b4b4d46-7hsdc",
		"uuid": "a9d17db3-3672-4e88-bda9-1966e41cf61a"
	}
}""
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [aborting wait for external checker pod to report in due to context cancellation]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [waiting for external checker pod to report in...]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Failed to write new UUID for check because object was modified by another process.  Retrying in 5s]"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 (e2e-test/e2e-k8sapi) Request completed successfully."
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=debug msg="Successfully updated CRD for check: e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=debug msg="e2e-test e2e-k8sapi writing khstate with ok: true and errors: [] at last run: 2021-03-03 10:10:56.149483181 +0000 UTC m=+61442.062038369"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=debug msg="khstate custom resource found: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=debug msg="Checking existence of custom resource: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:56Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 (e2e-test/e2e-k8sapi) Setting check with name e2e-k8sapi in namespace e2e-test to 'OK' state: true uuid a9d17db3-3672-4e88-bda9-1966e41cf61a KHCheck"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:56Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Updating khstate e2e-k8sapi e2e-test to setUUID: 41075ba1-ab29-4db9-a231-14124a475a32]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=info msg="41075ba1-ab29-4db9-a231-14124a475a32 e2e-test/e2e-k8sapi: [Setting expected UUID to: 41075ba1-ab29-4db9-a231-14124a475a32]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=info msg="Running check: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=debug msg="Successfully updated CRD for check: e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=debug msg="e2e-test e2e-k8sapi writing khstate with ok: false and errors: [Check execution error: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in] at last run: 2021-03-03 10:10:55.722680012 +0000 UTC m=+58264.275673655"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:55Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:55Z" level=debug msg="khstate custom resource found: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:55Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:55Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 Calling pod is e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:55Z" level=debug msg="Found value on calling pod 172.16.138.42 value: KH_RUN_UUID a9d17db3-3672-4e88-bda9-1966e41cf61a"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:55Z" level=debug msg="Found value on calling pod 172.16.138.42 value: KH_RUN_UUID a9d17db3-3672-4e88-bda9-1966e41cf61a"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:55Z" level=debug msg="Found check named e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:54Z" level=debug msg="Checking existence of custom resource: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:54Z" level=debug msg="Setting execution state of check e2e-k8sapi to false [Check execution error: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in] a9d17db3-3672-4e88-bda9-1966e41cf61a KHCheck"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:54Z" level=debug msg="Successfully retrieved khstate resource: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="Getting status of check for web request to status page: e2e-k8sapi e2e-test"
kuberhealthy-565b4b4d46-qf89s	"time="2021-03-03T10:10:54Z" level=debug msg="state reflector store item from listing: 3 {
	"metadata": {
		"name": "e2e-k8sapi",
		"namespace": "e2e-test",
		"selfLink": "/apis/comcast.github.io/v1/namespaces/e2e-test/khstates/e2e-k8sapi",
		"uid": "7020bc55-6073-490d-8b9c-f7f7a35d2b93",
		"resourceVersion": "974453334",
		"generation": 3231,
		"creationTimestamp": "2021-03-02T16:05:22Z",
		"managedFields": [
			{
				"manager": "kuberhealthy",
				"operation": "Update",
				"apiVersion": "comcast.github.io/v1",
				"time": "2021-03-02T16:05:22Z",
				"fieldsType": "FieldsV1",
				"fieldsV1": {
					"f:spec": {
						".": {},
						"f:AuthoritativePod": {},
						"f:Errors": {},
						"f:LastRun": {},
						"f:Namespace": {},
						"f:OK": {},
						"f:RunDuration": {},
						"f:uuid": {}
					}
				}
			}
		]
	},
	"spec": {
		"OK": true,
		"Errors": [],
		"RunDuration": "15.340908574s",
		"Namespace": "e2e-test",
		"LastRun": "2021-03-03T10:09:16.877528688Z",
		"AuthoritativePod": "kuberhealthy-565b4b4d46-7hsdc",
		"uuid": "a9d17db3-3672-4e88-bda9-1966e41cf61a"
	}
}""
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="determineKHWorkload: Found khcheck: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="determineKHWorkload: determining workload: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 Calling pod is e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="Found value on calling pod 172.16.138.42 value: KH_RUN_UUID a9d17db3-3672-4e88-bda9-1966e41cf61a"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="Found value on calling pod 172.16.138.42 value: KH_RUN_UUID a9d17db3-3672-4e88-bda9-1966e41cf61a"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=debug msg="Found check named e2e-k8sapi in namespace e2e-test"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:54Z" level=debug msg="Retrieving khstate custom resource for: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:54Z" level=debug msg="khstate custom resource found: e2e-k8sapi"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 validating external check status report from:  172.16.138.42:50208"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:54Z" level=info msg="web: 29012fe9-b6a4-4a1f-9e94-6b35a3be58f3 Client connected to check report handler from 172.16.138.42:50208 Go-http-client/1.1"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=debug msg="Checking existence of custom resource: e2e-k8sapi"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=error msg="Error running check: e2e-k8sapi in namespace e2e-test: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [Error with running external check: e2e-test/e2e-k8sapi: timed out waiting for checker pod to report in]"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:53Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 validating external check status report from:  172.16.138.42:50150"
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:53Z" level=info msg="web: 742e2f11-f207-4746-86f3-09d8db287a06 Client connected to check report handler from 172.16.138.42:50150 Go-http-client/1.1"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [evicting pod e2e-k8sapi-1614766193 from namespace e2e-test]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod: e2e-k8sapi-1614766193 is in status: Running]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [finding pods that are not in status.phase=Failed or status.phase=Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod: e2e-k8sapi-1614766132 is in status: Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [finding pods that are not in status.phase=Failed or status.phase=Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod: e2e-k8sapi-1614766072 is in status: Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [finding pods that are not in status.phase=Failed or status.phase=Succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [wait for deleted event watcher has closed]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod shutdown monitor stopping gracefully]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [eviction: looking for pods with the label kuberhealthy-check-name = e2e-k8sapi]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [Evicting up any running pods with name e2e-k8sapi-1614766193]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [timed out waiting for checker pod to report in]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:53Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [timed out waiting for pod status to be reported]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [checker pod shutdown monitor saw a modified event. the pod changed to  Running]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [checker pod shutdown monitor saw a modified event.]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [got a result when watching for pod to remove]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [waiting for pod to report in to status page...]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [Waiting for pod status to be reported from pod e2e-k8sapi-1614766193 in namespace e2e-test]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [External check pod is running: e2e-k8sapi-1614766193]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod is now either running, failed, or succeeded]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [pod state is now: Running]"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:51Z" level=info msg="a9d17db3-3672-4e88-bda9-1966e41cf61a e2e-test/e2e-k8sapi: [got an event while waiting for pod to start running]"

During this timeframe, the only master was kuberhealthy-565b4b4d46-7hsdc based on the filtered logs:

kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:50Z" level=debug msg="Calculated master as kuberhealthy-565b4b4d46-7hsdc"	
kuberhealthy-565b4b4d46-qf89s | time="2021-03-03T10:10:25Z" level=debug msg="Calculated master as kuberhealthy-565b4b4d46-7hsdc"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:25Z" level=debug msg="Calculated master as kuberhealthy-565b4b4d46-7hsdc"
kuberhealthy-565b4b4d46-7hsdc | time="2021-03-03T10:10:20Z" level=debug msg="Calculated master as kuberhealthy-565b4b4d46-7hsdc"

Also, no Lost master logs during 4h before and after time="2021-03-03T10:10:00Z" timeframe.

@joshulyne
Copy link
Collaborator

Hey @adriananeci thanks for sharing more details about your check and the logs with us.

It is interesting to see that the authoritativePod or the kuberhealthy master pod be different from one khstate to the next... Usually that indicates which kuberhealthy pod updated that specific khstate resource -- but like you said, no apparent changes in master is seen in the logs :/ Will dig into that further!

To give you an update, we did find a bug in the way we were updated khstate resources, specifically in the error log line you were seeing here:

Failed to write new UUID for check because object was modified by another process.  Retrying in 5s

The khcheck easily could have gotten stuck here, because we weren't re-fetching the khstate resource when we retried (every 5s) to update the khstate resource -- we were trying to modify the khstate resource that hadn't had the latest modifications.

@integrii plans to create a PR for a fix for this bug. We'll let you know when that gets merged in -- thanks for reporting and digging into this with us!

integrii pushed a commit that referenced this issue Mar 29, 2021
joshulyne added a commit that referenced this issue Apr 7, 2021
@integrii
Copy link
Collaborator

integrii commented Apr 7, 2021

This also may be related to #880. Kuberhealthy was designed originally to be "okay" with two masters running at the same time in order to avoid having some external source of truth or complex election communication between pods. We could also slow down the master election process to ensure that there is never an overlap of masters. For now though, I want to keep trying to go with the strategy of 'its okay to run multiple masters for a period of time' to avoid all that extra complication in the codebase.

That said, I think this issue involves a go routine getting infinitely stuck trying to update the object in etcd. Because of that stuck process, the prior master never fully demotes its self and essentially goes brain dead. trying an object update that isn't allowed. The fix in PR #872 fixes that infinite loop. If we can also track down #880, then we should be back in a place where multiple masters running at the same time isn't an issue.

@joshulyne should be giving you an image that you can run to test the fix in #872 here shortly. It would be really helpful if you could switch your Kuberhealthy deployment image out to that version and see if it fixes your issue.

@joshulyne
Copy link
Collaborator

@adriananeci latest image with fix! kuberhealthy/kuberhealthy:v2.4.1

@adriananeci
Copy link
Contributor Author

Deployed it on a test cluster and it seems to have fixed the race condition.
No Failed to write new UUID for check because object was modified in logs and all khstates.comcast.github.io look fine.

Is there any ETA for v2.4.1 official release?

@joshulyne
Copy link
Collaborator

Working on the v2.4.1 release for today!

@joshulyne
Copy link
Collaborator

@adriananeci v2.4.1 is out! thanks for testing this out and working with us on it!

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment on the issue or this will be closed in 15 days.

@github-actions github-actions bot added the Stale label Jan 30, 2024
Copy link

This issue was closed because it has been stalled for 15 days with no activity. Please reopen and comment on the issue if you believe it should stay open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants