Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prom using up all available socket connections after a few days #4659

Closed
augmenter opened this Issue Sep 26, 2018 · 10 comments

Comments

Projects
None yet
3 participants
@augmenter
Copy link

augmenter commented Sep 26, 2018

Windows Version using up all available socket connections after a long time it seems.
I'm running it fine for a few days, then I am hit with log messages: dial tcp 127.0.0.1:3000: bind: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. When this happens, I am unable to open any browser webpages, or do any new connections from my machine. Also scraping stops.

Proposal

I'm running a test environment on my Windows machine, as a proof of concept, later I'm planning to deploy it to a Linux environment.

Bug Report

Run prometheus for a few days.

What did you expect to see?
Expected it to keep working.

What did you see instead? Under which circumstances?
Scraping stopped
Environment
Windows 10

  • System information:

n/a

  • Prometheus version:
    prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e2 5682fa14b7b)
    build user: root@5258e0bd9cc1
    build date: 20180712-14:13:08
    go version: go1.10.3

  • Logs:

level=debug ts=2018-09-26T05:33:55.2407613Z caller=scrape.go:703 component="scrape manager" scrape_pool=evolution_bet_exporter target=http://127.0.0.1:3000/metrics msg="Scrape failed" err="Get http://127.0.0.1:3000/metrics: dial tcp 127.0.0.1:3000: bind: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full."
level=debug ts=2018-09-26T05:34:00.034316Z caller=scrape.go:703 component="scrape manager" scrape_pool=mongo target=http://mongo.de.prod:9001/metrics msg="Scrape failed" err="Get http://mongo.de.prod:9001/metrics: dial tcp 127.0.0.1:9001: bind: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full."
@augmenter

This comment has been minimized.

Copy link
Author

augmenter commented Sep 26, 2018

It might be connected to my other issue, #4646 . Where some domains are not scraped, but data is valid, some error might not be freeing up the connections. (Just throwing an idea out there...)

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 26, 2018

This is probably a Windows-specific issue as it hasn't been reported for other systems. How many targets do you scrape?

@augmenter

This comment has been minimized.

Copy link
Author

augmenter commented Sep 26, 2018

About 20. Some of the targets are still enabled even though they're down. About 5 are Up. Should I try running only with targets that are UP and see if there's a difference?
Edit: now runing 12/12 UP targets. Will let you know how it goes…
Edit 2 : few hours later with only UP targets, ran out of buffer again, doing a full restart now and trying again.

@augmenter

This comment has been minimized.

Copy link
Author

augmenter commented Sep 27, 2018

Having all targets in UP state does not help. Will try running PROM from docker and will update here.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Sep 28, 2018

Hmm, this is concerning. Could you see if prometheus is holding files or network sockets? I am not familiar with Windows to see how to do it, but in linux, I'd use lsof.

Further, if it is indeed files, using 2.4.2 would help a lot because we've handled some issues regarding filehandling in Windows in that release.

@augmenter

This comment has been minimized.

Copy link
Author

augmenter commented Sep 29, 2018

I have been running Prometheus inside docker for a few days and it seems better. I cant use 2.4.2 I receive this error "INVALID is not a valid start token". I will try to run the windows version and check using tips found here to find if Prometheus is holding the connections open - https://stackoverflow.com/questions/8902136/how-to-find-a-list-of-sockets-held-by-a-process-in-windows

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 1, 2018

I cant use 2.4.2 I receive this error "INVALID is not a valid start token"

can you paste the metrics output from the target that generates this error?

@augmenter

This comment has been minimized.

Copy link
Author

augmenter commented Oct 2, 2018

@simonpasquier: I can't really share the whole output due to privacy reasons, here's a piece of it. Would it be possible to add more detail to the parsing error?

# HELP api_response_average api_response_average
# TYPE api_response_average counter
api_response_average{group="giveMoney",server="web03"} 0
api_response_average{group="giveMoney",server="web04"} 0
# HELP api_response_seconds api_response_seconds
# TYPE api_response_seconds summary
api_response_seconds{group="getGame",server="web03",quantile="0.4"} 6
api_response_seconds{group="getGame",server="web03",quantile="0.5"} 5
api_response_seconds{group="getGame",server="web03",quantile="0.6"} 1
api_response_seconds{group="getGame",server="web03",quantile="0.7"} 2
api_response_seconds{group="getGame",server="web03",quantile="0.9"} 1
api_response_seconds{group="getGame",server="web03",quantile="1.5"} 1
api_response_seconds{group="getGame",server="web03",quantile="2"} 1
api_response_seconds{group="getGame",server="web03",quantile="2.5"} 5
api_response_seconds{group="getGame",server="web03",quantile="3"} 1
api_response_seconds{group="getGame",server="web03",quantile="3.5"} 1
api_response_seconds{group="getGame",server="web04",quantile="0.4"} 4
api_response_seconds{group="getGame",server="web04",quantile="0.5"} 2
api_response_seconds{group="getGame",server="web04",quantile="0.7"} 3
api_response_seconds{group="getGame",server="web04",quantile="0.8"} 1
api_response_seconds{group="getGame",server="web04",quantile="0.9"} 1
api_response_seconds{group="getGame",server="web04",quantile="10"} 1
api_response_seconds{group="getGame",server="web04",quantile="2.5"} 1
api_response_seconds{group="getGame",server="web04",quantile="3"} 1
api_response_seconds{group="getGame",server="web04",quantile="3.5"} 2
api_response_seconds{group="getGame",server="web05",quantile="0.4"} 1
api_response_seconds{group="getGame",server="web05",quantile="0.5"} 6
api_response_seconds{group="getGame",server="web05",quantile="0.6"} 3
api_response_seconds{group="getGame",server="web05",quantile="0.7"} 1
api_response_seconds{group="getGame",server="web05",quantile="0.8"} 1
api_response_seconds{group="getGame",server="web05",quantile="1"} 1
# HELP mysql_connections_average mysql_connections_average
# TYPE mysql_connections_average counter
mysql_connections_average{group="read_server",server="global"} 0
mysql_connections_average{group="write_server",server="global"} 0
# HELP mysql_connections_fail_total mysql_connections_fail_total
# TYPE mysql_connections_fail_total counter
mysql_connections_fail_total{group="read_server",server="global"} 0
mysql_connections_fail_total{group="write_server",server="global"} 0
# HELP mysql_connections_ok_total mysql_connections_ok_total
# TYPE mysql_connections_ok_total counter
mysql_connections_ok_total{group="read_server",server="global"} 69
mysql_connections_ok_total{group="write_server",server="global"} 64
# HELP mysql_connections_total mysql_connections_total
# TYPE mysql_connections_total counter
mysql_connections_total{group="read_server",server="global"} 0
mysql_connections_total{group="write_server",server="global"} 0
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 3, 2018

You can check the metrics with promtool:

curl -s <your metric endpoint URL> | promtool check  metrics
@augmenter

This comment has been minimized.

Copy link
Author

augmenter commented Oct 8, 2018

@simonpasquier Seems the issue is gone in v2.4.2!! I will report back if it reoccurs. Thanks a lot for the help!

@augmenter augmenter closed this Oct 8, 2018

@lock lock bot locked and limited conversation to collaborators Apr 6, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.