Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos: Use wget for healthcheck #594

Merged

Conversation

Reamer
Copy link
Contributor

@Reamer Reamer commented Dec 19, 2019

Use wget for healthcheck, because curl is not available in official thanos image (quay.io/thanos/thanos:v0.9.0)

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 19, 2019
@openshift-ci-robot
Copy link
Contributor

Hi @Reamer. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 19, 2019
@lilic
Copy link
Contributor

lilic commented Dec 19, 2019

@Reamer hello, thanks for the PR! How did you encounter this problem, are you a customer of openshift or just tried it out as I don't see in the openshift dev team? :)

@Reamer
Copy link
Contributor Author

Reamer commented Dec 20, 2019

Hi @lilic,
I'm part of the okd working group and checked some version of the okd 4.3 preview release. Done some experiments with updating components of okd and hit this issue.
oc adm release new --from-image=.... --to-image=.. thanos=quay.io/thanos/thanos:v0.9.0

@lilic
Copy link
Contributor

lilic commented Dec 20, 2019

/ok-to-test

@openshift-ci-robot openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 20, 2019
@lilic
Copy link
Contributor

lilic commented Dec 20, 2019

@Reamer Thanks for the clarification, that makes more sense now, as we didn't see any problems. Letting the tests run if no problems will merge, thanks for the PR! :)

@lilic
Copy link
Contributor

lilic commented Dec 20, 2019

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Jan 6, 2020

Hi @lilic,
if anything with my code blocks the merge, please notify me.

name: thanos-querier
readinessProbe:
exec:
command:
- sh
- -c
- curl http://localhost:9090/-/healthy
- wget -O /dev/null http://localhost:9090/-/healthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't work for me when testing in the container straight away, maybe thats why the tests are failing as well:


sh-4.2$ wget -O /dev/null http://localhost:9090/-/healthy
--2019-12-20 10:05:09--  http://localhost:9090/-/healthy
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:9090... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:9090... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23 [text/plain]
Saving to: '/dev/null'

100%[==================================================================================================================================================================>] 23          --.-K/s   in 0s

2019-12-20 10:05:09 (2.95 MB/s) - '/dev/null' saved [23/23]

Maybe we can change it to address directly:

sh-4.2$ wget -O /dev/null http://127.0.0.1:9090/-/healthy
--2019-12-20 10:04:57--  http://127.0.0.1:9090/-/healthy
Connecting to 127.0.0.1:9090... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23 [text/plain]
Saving to: '/dev/null'

100%[==================================================================================================================================================================>] 23          --.-K/s   in 0s

2019-12-20 10:04:57 (2.51 MB/s) - '/dev/null' saved [23/23]

cc @s-urbaniak I believe you worked on this part.

Copy link
Contributor

@s-urbaniak s-urbaniak Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should reference ipv4 addresses. I think wget first tries the ipv6 localhost address, fails, then retries the ipv4 address, which succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localhost is an exception and in this case it shouldn't matter if we use 127.0.0.1 or localhost, both should work the same.

FYI: we are using 127.0.0.1 in many places, for example in node-exporter daemonset.

@lilic
Copy link
Contributor

lilic commented Jan 6, 2020

Sorry about that I forgot to click submit on the review right before vacation. :)

@s-urbaniak
Copy link
Contributor

looking at the e2e-aws-operator test this simply seemed to have timed out. I don't see any errors in the logs and events.

@s-urbaniak
Copy link
Contributor

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Jan 6, 2020

@lilic I didn't test with IPv6 network stack. After adding IPv6 to my docker daemon. I tried starting thanos query with --http-address=localhost:9090 to bind on both interfaces and to be equal with wget or curl-healthcheck. Unfortunately thanos didn't start on both interfaces

[root@575e34403f53 /]# ifconfig 
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
...

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 8  bytes 480 (480.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 480 (480.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@575e34403f53 /]# netstat -tulpen
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp        0      0 127.0.0.1:10901         0.0.0.0:*               LISTEN      99         203913     -                   
tcp        0      0 127.0.0.1:9090          0.0.0.0:*               LISTEN      99         203907     -   

I can reproduce your output, but the exit code of wget is still 0. I think a change from localhost to 127.0.0.1 will makes no difference, except for the dns query. localhost should support all network stacks (ipv4 only, ipv6 only and dual stack)

[root@575e34403f53 /]# wget -O /dev/null http://localhost:9090/-/healthy
--2020-01-06 10:21:56--  http://localhost:9090/-/healthy
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:9090... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:9090... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23 [text/plain]
Saving to: '/dev/null'

100%[=========================================================================================================================================================================>] 23          --.-K/s   in 0s      

2020-01-06 10:21:56 (1.86 MB/s) - '/dev/null' saved [23/23]

[root@575e34403f53 /]# echo $?
0

@s-urbaniak
Copy link
Contributor

Thanos won't listen on both ip addresses as golang itself won't bind to both ip addresses as per golang/go#9334.

@paulfantom
Copy link
Contributor

I think if we want to have a proper healthcheck with wget, we should use something like:

wget --quiet --tries=1 --spider http://localhost:9090/-/healthy

Options passed to wget will reduce noise, try only once, and not download content.

@Reamer
Copy link
Contributor Author

Reamer commented Jan 6, 2020

Confirm partial with @paulfantom. We should change to

wget --quiet --tries=1 --spider http://localhost:9090/-/healthy

I'll update my PR.
Disagree with 127.0.0.1 and localhost. We should not rely on a specific network stack. Yes I know 127.0.0.1 is used in many places.

@metalmatze
Copy link
Contributor

You might want to reuse the work we already did in the Prometheus Operator.
That should work if either wget or curl is available.

https://github.com/coreos/prometheus-operator/blob/82d4fdcdc4067869c197910cb83d3ef10307d6e2/pkg/prometheus/statefulset.go#L619

@Reamer
Copy link
Contributor Author

Reamer commented Jan 6, 2020

Hi @metalmatze
thanks for sharing this information. I can adopt this solution, but before I have a question. Why don't you use --tries=1 --spider? I think for the shell command command you need at least a bash.

@paulfantom
Copy link
Contributor

I think for the shell command command you need at least a bash

Yes and no. command is a bash built-in and it is not included in "standard" posix shell. However I haven't found yet a system with posix shell in which command wouldn't be implemented (even busybox has it). In fact command is often recommended when writing portable shell scripts.

Why don't you use --tries=1 --spider?

That's a good question.

Use curl or wget for healthcheck. wget is needed because curl is not available in official thanos image (quay.io/thanos/thanos:v0.9.0)
@Reamer
Copy link
Contributor Author

Reamer commented Jan 7, 2020

Changed PR to curl and wget. I hope that the multi line script doesn't make issues.

@lilic
Copy link
Contributor

lilic commented Jan 14, 2020

/test e2e-aws-upgrade

@lilic
Copy link
Contributor

lilic commented Jan 22, 2020

@paulfantom do you mind reviewing this, thanks!

@paulfantom
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paulfantom, Reamer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2020
@paulfantom
Copy link
Contributor

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 2498123 into openshift:master Jan 23, 2020
@Reamer Reamer deleted the thanos_wget_healthcheck branch January 23, 2020 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants