Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open File Descriptors (SYN_SENT) #1726

Closed
gouthamve opened this Issue Jun 11, 2016 · 6 comments

Comments

Projects
None yet
2 participants
@gouthamve
Copy link
Member

gouthamve commented Jun 11, 2016

What did you do?
Ran Prometheus on production load.

What did you expect to see?
Prometheus runs fine

What did you see instead? Under which circumstances?
Prometheus starts erroring with too many open files

Environment

  • System information:

    Linux 3.13.0-74-generic x86_64

  • Prometheus version:

    build user: root@dfc6307dc40d
    build date: 20160526-01:42:25
    go version: go1.6.2

  • Prometheus configuration file:

# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'CM-Main'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  # - "second.rules"


# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 10s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    target_groups:
      - targets: ['localhost:80']

  - job_name: 'discovery-daemon'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 10s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    target_groups:
      - targets: ['localhost:15000']

  - job_name: 'ps'

    metrics_path: '/metrics'
    target_groups:
      - targets: []
    file_sd_configs:
      - names:
        - /etc/config/prometheus/discover/ps.json

  - job_name: 'dc'

    metrics_path: '/metrics'
    target_groups:
      - targets: []
    file_sd_configs:
      - names:
        - /etc/config/prometheus/discover/dc.json
  • Logs:
time="2016-06-11T01:37:36Z" level=error msg="Error reading file \"/etc/config/prometheus/discover/dc.json\": open /etc/config/prometheus/discover/dc.json: too many open files" source="file.go:180"
time="2016-06-11T01:37:36Z" level=error msg="Error reading file \"/etc/config/prometheus/discover/ps.json\": open /etc/config/prometheus/discover/ps.json: too many open files" source="file.go:180"
time="2016-06-11T01:37:36Z" level=error msg="Error reading file \"/etc/config/prometheus/discover/dc.json\": open /etc/config/prometheus/discover/dc.json: too many open files" source="file.go:180"
time="2016-06-11T01:37:36Z" level=error msg="Error reading file \"/etc/config/prometheus/discover/ps.json\": open /etc/config/prometheus/discover/ps.json: too many open files" source="file.go:180"
time="2016-06-11T01:37:36Z" level=error msg="Error reading file \"/etc/config/prometheus/discover/dc.json\": open /etc/config/prometheus/discover/dc.json: too many open files" source="file.go:180"
time="2016-06-11T01:37:36Z" level=error msg="Error reading file \"/etc/config/prometheus/discover/dc.json\": open /etc/config/prometheus/discover/dc.json: too many open files" source="file.go:180"
.
.
.
.
time="2016-06-11T01:37:38Z" level=error msg="Error looking up label pair {__name__ ps_queue_total}: open data/labelpair_to_fingerprints/009635.ldb: too many open files" source="persistence.go:1293"
time="2016-06-11T01:37:39Z" level=warning msg="Series quarantined." fingerprint=da215ff5c458b0cb metric=jvm_memory_pool_bytes_used{instance="ec2-x-y-z-g..us-west-2.compute.amazonaws.com:10006", instanceId="i-7242eeaf", job="ps", pool="CMS Old Gen"} reason="open data/da/215ff5c458b0cb.db: too many open files" source="storage.go:1442"
time="2016-06-11T01:37:39Z" level=warning msg="Series quarantined." fingerprint=c6b8fd8d3324261c metric=jvm_memory_pool_bytes_used{instance="ec2-x-y-z-g..us-west-2.compute.amazonaws.com:10004", instanceId="i-75ecc3e0", job="ps", pool="Code Cache"} reason="open data/c6/b8fd8d3324261c.db: too many open files" source="storage.go:1442"
time="2016-06-11T01:37:39Z" level=warning msg="Series quarantined." fingerprint=18803fc3905db6e2 metric=scrape_duration_seconds{instance="ec2-x-y-z-g.us-west-2.compute.amazonaws.com:10003", instanceId="i-669a76c9", job="ps"} reason="open data/18/803fc3905db6e2.db: too many open files" source="storage.go:1442"
.
.
.
.
time="2016-06-11T01:37:41Z" level=error msg="Error indexing label pair to fingerprints batch: open data/labelpair_to_fingerprints/009635.ldb: too many open files" source="persistence.go:1241"
time="2016-06-11T01:37:41Z" level=error msg="Error indexing label name to label values batch: open data/labelname_to_labelvalues/002688.ldb: too many open files" source="persistence.go:1244"
time="2016-06-11T01:37:42Z" level=error msg="Error indexing label pair to fingerprints batch: open data/labelpair_to_fingerprints/009635.ldb: too many open files" source="persistence.go:1241"
time="2016-06-11T01:37:42Z" level=error msg="Error indexing label name to label values batch: open data/labelname_to_labelvalues/002688.ldb: too many open files" source="persistence.go:1244"
time="2016-06-11T01:37:43Z" level=error msg="Error indexing label name to label values batch: open data/labelname_to_labelvalues/002688.ldb: too many open files" source="persistence.go:1244"
time="2016-06-11T01:37:43Z" level=error msg="Error looking up label pair {area nonheap}: open data/labelpair_to_fingerprints/009645.ldb: too many open files" source="persistence.go:1293"
time="2016-06-11T01:37:43Z" level=error msg="Error indexing label name to label values batch: open data/labelname_to_labelvalues/002688.ldb: too many open files" source="persistence.go:1244"
time="2016-06-11T01:37:44Z" level=error msg="Error indexing label name to label values batch: open data/labelname_to_labelvalues/002688.ldb: too many open files" source="persistence.go:1244"

I increased the open file limits, but prometheus is eating up everything. The output of lsof (pid is prometheus pid):

sudo lsof | grep <pid> | wc -l ==> 61165
sudo lsof | grep <pid> | grep SYN_SENT | wc -l ==> 60149

@gouthamve

This comment has been minimized.

Copy link
Member Author

gouthamve commented Jun 11, 2016

Well, I increased the limits to a drastic amount and prometheus using a LOT of files.

sudo lsof | grep <pid> | wc -l ==> 221049
sudo lsof | grep <pid> | grep SYN_SENT | wc -l ==> 183771

My queries are now timing out :(

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 11, 2016

What does netstat -tpen tell you about where those SYN_SENT connections are going to? It suggests that either the remote server is not replying with a SYN/ACK or doesn't receive the SYN in the first place. If this is happening for scrape targets under certain conditions, Prometheus should probably still not let half-open connections pile up.

@gouthamve

This comment has been minimized.

Copy link
Member Author

gouthamve commented Jun 11, 2016

I found the root cause, my SD mechanism had an error which caused terminated instances to lie around. I have patched the code and right now am checking on a small load and everything is running as expected.

Just started the test under production load, will keep you posted. But yep, any way we can make sure that unacknowledged connections are not piling up?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 11, 2016

So your file SD actually contained hundreds of thousands of targets in the end? If so, then it's not really a piling-up issue in Prometheus, but Prometheus just needs to connect to too many targets at once?

@gouthamve

This comment has been minimized.

Copy link
Member Author

gouthamve commented Jun 11, 2016

Yep, you were right. So we kill and get new instances every hour. And the killed instances are not cleaned up for 5-10mins where the open connections pile up.

But the weird thing is that with 7K-10K services, we are still hitting 90K open connections. While this might be expected, is there any flag/config or something I could do to reduce the number.

And good thing is they fall from 90K to 4K very fast once the cleanup is done.

@gouthamve gouthamve closed this Jun 16, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.