Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nixy 0.12.1 timeout issues #29

Closed
vovinacci opened this issue Sep 5, 2017 · 5 comments
Closed

Nixy 0.12.1 timeout issues #29

vovinacci opened this issue Sep 5, 2017 · 5 comments

Comments

@vovinacci
Copy link

Hi,

I have nixy 0.12.1 + nginx on separate server and Mesos/Marathon cluster (3 masters + 3 slaves) on Digital Ocean in the same data center.
Nixy runs on Ubuntu 16.04.3 with 1 min load average <= 0.3 and total CPU load <= 25% at any time.
Monitoring is done with Monit.

Issues

There are two quite strange issues. Your help or advice on these would be very much appreciated.

  • 'endpoint is down' event in logs
-- Logs begin at Tue 2017-09-05 09:24:47 CEST, end at Tue 2017-09-05 10:57:36 CEST. --
Sep 05 09:26:22 mnixy sh[1547]: time="2017-09-05T09:26:22+02:00" level=error msg="endpoint is down" endpoint="http://master-2:8080" error="Get http://master-2:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 09:27:28 mnixy sh[1547]: time="2017-09-05T09:27:28+02:00" level=error msg="endpoint is down" endpoint="http://master-2:8080" error="Get http://master-2:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 09:28:00 mnixy sh[1547]: time="2017-09-05T09:28:00+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 09:31:42 mnixy sh[1547]: time="2017-09-05T09:31:42+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 09:42:28 mnixy sh[1547]: time="2017-09-05T09:42:28+02:00" level=error msg="endpoint is down" endpoint="http://master-3:8080" error="Get http://master-3:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 09:43:11 mnixy sh[1547]: time="2017-09-05T09:43:10+02:00" level=error msg="endpoint is down" endpoint="http://master-3:8080" error="Get http://master-3:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:05:38 mnixy sh[1547]: time="2017-09-05T10:05:38+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:09:38 mnixy sh[1547]: time="2017-09-05T10:09:38+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:12:34 mnixy sh[1547]: time="2017-09-05T10:12:34+02:00" level=error msg="endpoint is down" endpoint="http://master-3:8080" error="Get http://master-3:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:23:28 mnixy sh[1547]: time="2017-09-05T10:23:28+02:00" level=error msg="endpoint is down" endpoint="http://master-2:8080" error="Get http://master-2:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:24:44 mnixy sh[1547]: time="2017-09-05T10:24:44+02:00" level=error msg="endpoint is down" endpoint="http://master-2:8080" error="Get http://master-2:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:25:04 mnixy sh[1547]: time="2017-09-05T10:25:04+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:28:30 mnixy sh[1547]: time="2017-09-05T10:28:30+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:31:16 mnixy sh[1547]: time="2017-09-05T10:31:16+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:33:33 mnixy sh[1547]: time="2017-09-05T10:33:32+02:00" level=error msg="endpoint is down" endpoint="http://master-3:8080" error="Get http://master-3:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:45:32 mnixy sh[1547]: time="2017-09-05T10:45:32+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:45:53 mnixy sh[1547]: time="2017-09-05T10:45:53+02:00" level=info msg="marathon reload triggered" client="127.0.0.1:55782"
Sep 05 10:45:53 mnixy sh[1547]: time="2017-09-05T10:45:53+02:00" level=info msg="no config changes"
Sep 05 10:47:35 mnixy sh[1547]: time="2017-09-05T10:47:35+02:00" level=error msg="endpoint is down" endpoint="http://master-3:8080" error="Get http://master-3:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:53:02 mnixy sh[1547]: time="2017-09-05T10:53:02+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:53:10 mnixy sh[1547]: time="2017-09-05T10:53:10+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Sep 05 10:56:56 mnixy sh[1547]: time="2017-09-05T10:56:56+02:00" level=error msg="endpoint is down" endpoint="http://master-1:8080" error="Get http://master-1:8080/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

However health check is just fine and no nodes been marked as un-healthy.
Taking a look on hosts metrics - all looks just good, no CPU spikes or network timeouts/errors.

  • 'Resource temporarily unavailable' when doing health check
    Monit polls /api/v1/health once a minute and gets this once in a while (2-3 times for an hour)
    This seems to be some kind of timeout, when nixy replies quite slow, but I wasn't able to catch it explicitly. This triggers a lot of false negatives.
    Monit config part for nixy (it's quite self explanatory):
check host nixy
  with address localhost
  if failed
    port 6600
    protocol HTTP
    request "/v1/health"
    with timeout 3 seconds
  then alert

Configurations

nixy.toml (comments stripped):

# Nixy listening port
port = "6600"
xproxy = ""

marathon = ["http://master-1:8080", "http://master-2:8080", "http://master-3:8080"]
user = ""
pass = ""
realm = ""

nginx_config = "/etc/nginx/nginx.conf"
nginx_template = "/opt/nixy/nginx.tmpl"
nginx_cmd = "/opt/nixy/generate_index_html.sh"
nginx_ignore_check = false

[statsd]
addr = "localhost:8125"
namespace = "nixy.mesos-test"
sample_rate = 100

/opt/nixy/generate_index_html.sh (wrapper to generate list of running services and reload nginx)

#!/usr/bin/env bash
#
# Wrapper script for nixy to pass all command line parameters to nginx (if any)
# and update static microservices list page on successfull nginx reload

# Keep environment clean
LC_ALL=C

# HTML index page path
readonly HTML="/opt/nixy/html/index.html"

# Fail on any non-zero status code
set -e

# Run nginx with all command line parameters passed (if any)
if [[ -n "$@" ]]; then
  nginx "$@"
fi

# Get website list from nginx as array, excluding 'localhost', 'list' and 'logentries'
SITE_LIST=($(nginx -T 2> /dev/null |grep -E -o "server_name\ (.*);$" |grep -vE "\ \_|\ localhost|\ list\.|\ logentries" |awk '{ print $2 }' |tr -d ';'))

# Overwrite index page with HTML header
cat > $HTML <<- EOM
<html>
  <head>
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-beta/css/bootstrap.min.css" integrity="sha384-/Y6pD6FV/Vv2HJnA6t+vslU6fwYXjCFtcEpHbNJ0lyAFsXTsjBbfaDjzALeQsN6M" crossorigin="anonymous">
    <script type="text/javascript" src="http://livejs.com/live.js"></script>
    <title>Available services</title>
  </head>
  <body>
    <table class="table">
      <ul>
EOM

# Append services URL
for elem in "${SITE_LIST[@]}"; do
  cat >> $HTML <<- EOM
        <tr>
          <td><a href='http://$elem' target='_blank'>$elem</a></td>
        </tr>
EOM
done

# Append HTML index page with footer
cat >> $HTML <<- EOM
      </ul>
    </table>
    <p class="small">
    Last updated: $(date)
    </p>
  </body>
</html>
EOM

# EOF

Nginx template is pretty much standard, so I won't list it here.

Thanks in advance for your help!

@martensson
Copy link
Owner

Hi @vovinacci!

Nixy is semi-aggressive when it comes to checking health on its endpoints, and it seems you are having some occasional long responses a bit randomly from the log. Sometimes with several minutes between each error.

Nixy has a timeout of 5s, so if it takes more than 5s to get a response from /ping it will see that endpoint as unhealthy and move one to the next one. In your case it seems the endpoints quite often are slow to give a response.

It would be interesting if you try to measure the avg response time from your nixy servers to your endpoints (/ping), and try to figure out where the problem is. You should normally receive a pong response pretty quick if things are normal, and definitely not above 5s 😃 .

@vovinacci
Copy link
Author

vovinacci commented Sep 6, 2017

Hi @martensson,

Thanks a lot for your reply, will try that. Indeed networking at DO is not their strongest point. :)

P.S. Won't close the issue for a while if you don't mind.

@martensson
Copy link
Owner

Not a problem, just make sure they are on the same private network as well ;)
But either case its weird, we run a pretty large marathon cluster and even over the internet I never get a response time over 100ms.

I will leave the issue open a little bit in case you figure out something else 🤔

@vovinacci
Copy link
Author

I'm planning to do connection test first for a few hours first that should (if I'm not mistaken) imitate behaviour of Nixy (ask every 2s, wait max 5s for the reply) and to be ran on Nixy server (it's one and the only at the moment)

#!/usr/bin/env sh
set -e

while true; do
  (date && curl -m 5 -w "@curl-format.txt" -o /dev/null -s "http://master-1:8080/ping") >> master-1.log
  (date && curl -m 5 -w "@curl-format.txt" -o /dev/null -s "http://master-2:8080/ping") >> master-2.log
  (date && curl -m 5 -w "@curl-format.txt" -o /dev/null -s "http://master-3:8080/ping") >> master-3.log
  sleep 2
done

curl-format.txt:

    time_namelookup:  %{time_namelookup}\n
       time_connect:  %{time_connect}\n
    time_appconnect:  %{time_appconnect}\n
   time_pretransfer:  %{time_pretransfer}\n
      time_redirect:  %{time_redirect}\n
 time_starttransfer:  %{time_starttransfer}\n
                    ----------\n
         time_total:  %{time_total}\n

@vovinacci
Copy link
Author

@martensson
This smells like Digital Ocean DNS issue (original hostnames were changed to master-{1,2,3} in the output) as after test started at Wed Sep 6 13:59:55 CEST 2017, that what we had at Wed Sep 6 14:05:37 CEST 2017 for one of the endpoints

    time_namelookup:  0.000
       time_connect:  0.000
    time_appconnect:  0.000
   time_pretransfer:  0.000
      time_redirect:  0.000
 time_starttransfer:  0.000
                    ----------
         time_total:  5.515

So, I would close the issue now as it isn't related to Nixy.
Thank you very much for your support and have a splendid day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants