Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxdbWriter not closing connections Icinga2 2.10.3 CentOS 7 #6989

Closed
bengelberth-shadowsoft opened this issue Mar 1, 2019 · 18 comments · Fixed by #6990
Closed

InfluxdbWriter not closing connections Icinga2 2.10.3 CentOS 7 #6989

bengelberth-shadowsoft opened this issue Mar 1, 2019 · 18 comments · Fixed by #6990
Assignees
Labels
area/influxdb Metrics to InfluxDB bug Something isn't working
Milestone

Comments

@bengelberth-shadowsoft
Copy link

Expected Behavior

I would expect Icinga2 to close http connections to Influxdb rather than keeping them open and opening additional ones.

Current Behavior

Icinga2 appears to keep opening new connections to influxdb and never closing them. This is causing thousands of established connections. I have seen three outcomes so far:

  • Icinga2 system runs out of memory and kills Icinga2

  • Icinga2 system runs out of File Descriptors

  • Influxdb system runs out of File Descriptors and influxdb crashes

Possible Solution

I am not sure what the solution is. However, this is new behavior in icinga2 2.10.3

Steps to Reproduce (for bugs)

1.Turn on influxdb feature and monitor established TCP connections

Context

This issue is causing Icinga2 to be killed by the system when all memory is used, run out of fd and checks fail, or crashing the influxdb process on a remote system.

Your Environment

  • Version used (icinga2 --version):
icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.3-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-957.5.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
  • Operating System and version:
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog elasticsearch gelf graphite livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-mysql influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About):
Icingaweb2 2.6.2
Modules: doc 2.6.2, grafana 1.3.4, monitoring 2.6.2
  • Config validation (icinga2 daemon -C):
[2019-03-01 07:40:24 -0500] information/cli: Icinga application loader (version: r2.10.3-1)
[2019-03-01 07:40:24 -0500] information/cli: Loading configuration file(s).
[2019-03-01 07:40:24 -0500] information/ConfigItem: Committing config item(s).
[2019-03-01 07:40:24 -0500] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2019-03-01 07:40:24 -0500] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2019-03-01 07:40:24 -0500] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2019-03-01 07:40:24 -0500] warning/ApiListener: Please read the upgrading documentation for v2.8: https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2019-03-01 07:40:24 -0500] information/ApiListener: My API identity: REMOVED
[2019-03-01 07:40:25 -0500] warning/ApplyRule: Apply rule 'snmp-interface' (in /var/lib/icinga2/api/zones/global-config/_etc/services-snmp.conf: 1:0-1:29) for type 'Service' does not match anywhere!
[2019-03-01 07:40:25 -0500] warning/ApplyRule: Apply rule 'snmp-storage' (in /var/lib/icinga2/api/zones/global-config/_etc/services-snmp.conf: 20:1-20:28) for type 'Service' does not match anywhere!
[2019-03-01 07:40:25 -0500] warning/ApplyRule: Apply rule 'disk-windows' (in /var/lib/icinga2/api/zones/global-config/_etc/services-windows.conf: 1:0-1:27) for type 'Service' does not match anywhere!
[2019-03-01 07:40:25 -0500] warning/ApplyRule: Apply rule 'nscp-local-memory' (in /var/lib/icinga2/api/zones/global-config/_etc/services-windows.conf: 9:1-9:33) for type 'Service' does not match anywhere!
[2019-03-01 07:40:25 -0500] warning/ApplyRule: Apply rule 'ping6' (in /var/lib/icinga2/api/zones/global-config/_etc/services.conf: 35:1-35:21) for type 'Service' does not match anywhere!
[2019-03-01 07:40:25 -0500] warning/ApplyRule: Apply rule '' (in /var/lib/icinga2/api/zones/global-config/_etc/services.conf: 282:1-282:66) for type 'Service' does not match anywhere!
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 646 Services.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 58 Hosts.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 4 EventCommands.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 113 Dependencies.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 7 NotificationCommands.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1400 Notifications.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 NotificationComponent.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 9 HostGroups.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 34 Downtimes.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 36 Zones.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 41 Endpoints.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 6 ApiUsers.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 4 Users.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 218 CheckCommands.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 2 UserGroups.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 5 ServiceGroups.
[2019-03-01 07:40:25 -0500] information/ConfigItem: Instantiated 3 TimePeriods.
[2019-03-01 07:40:25 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-03-01 07:40:25 -0500] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

  • This is my /etc/icinga2/features-available/influxdb.conf for the icinga2 feature. I have removed host, username, and password values.

cat /etc/icinga2/features-available/influxdb.conf 
/**
 * The InfluxdbWriter type writes check result metrics and
 * performance data to an InfluxDB HTTP API
 */

library "perfdata"
object InfluxdbWriter "influxdb" {
  host = "*"
  ssl_enable = true
  port = 8086
  database = "icinga2"
  flush_threshold = 1024
  flush_interval = 10s
  username = "*"
  password = "*"
  host_template = {
    measurement = "$host.check_command$"
    tags = {
      hostname = "$host.name$"
    }
  }
  service_template = {
    measurement = "$service.check_command$"
    tags = {
      hostname = "$host.name$"
      service = "$service.name$"
    }
  }
  enable_send_thresholds = true
  enable_send_metadata = true
}
  • influxd version InfluxDB v1.7.4 (git: 1.7 ef77e72f435b71b1ad6da7d6a6a4c4a262439379)
  • socket stats for system with influxdb feature on
cat /proc/net/sockstat
sockets: used 5663
TCP: inuse 5435 orphan 1 tw 11 alloc 5438 mem 40
UDP: inuse 4 mem 3
UDPLITE: inuse 0
RAW: inuse 3
FRAG: inuse 0 memory 0
  • socket stats for system with influxdb feature off
cat /proc/net/sockstat
sockets: used 270
TCP: inuse 44 orphan 0 tw 19 alloc 45 mem 53
UDP: inuse 4 mem 2
UDPLITE: inuse 0
RAW: inuse 2
FRAG: inuse 0 memory 0

Here is a graph showing the Established TCP connections growing over time. The first peak Icinga2 was killed for out of memory. The second peak influxdb daemon on another server crashed and restarted.
image

Before the upgrade this Icinga2 system would maintain about 33 TCP connections. After the upgrade it peaked at 7,270 TCP connections.

@bengelberth-shadowsoft
Copy link
Author

@Al2Klimov Thank you for taking a look at this.

As I continue to troubleshoot the situation, it appears that the issue is only showing up on systems that are using an ssl connection to influxdb. I am not seeing it on test systems that are not using an ssl connection to influxdb.

@Al2Klimov
Copy link
Member

ref/IC/12219

@marcofl
Copy link

marcofl commented Mar 6, 2019

Same issue here, one connection more per influxdb flush as it seems. we downgraded to 2.10.2 for now.
screen shot 2019-03-06 at 11 24 15

@Al2Klimov
Copy link
Member

Hello @marcofl!

Please could you test #6990?

Best,
AK

@mcktr mcktr added the area/influxdb Metrics to InfluxDB label Mar 8, 2019
@cultcom
Copy link

cultcom commented Mar 11, 2019

I can confirm the exact same behaviour for our system. As we have sufficient RAM the process does not die but all checks fail with:
Error: Function call 'pipe2' failed with error code 24, 'Too many open files'

image

@prupert
Copy link

prupert commented Mar 11, 2019

Same problem here. InfluxDBWriter with TLS.

All Icinga checks cannot be executed after a while due to 'Too many open files' error. This is a serious bug. We have downgraded to 2.10.2 for now.

@Al2Klimov
Copy link
Member

Hello guys!

Feel free to test the PR I linked. The faster one of you writes a test protocol, the faster it will be merged.

Best,
AK

@marcofl
Copy link

marcofl commented Mar 11, 2019

Hello @marcofl!

Please could you test #6990?

Best,
AK

Sure, can you point me to the correct snapshot package for xenial?

@Al2Klimov
Copy link
Member

Hello @marcofl!

I'm afraid there isn't any (yet). If there were any, I'd not refer to the PR.

Best,
AK

@dnsmichi dnsmichi added this to the 2.11.0 milestone Mar 12, 2019
@Icinga Icinga deleted a comment from Al2Klimov Mar 12, 2019
@dnsmichi
Copy link
Contributor

dnsmichi commented Mar 13, 2019

I'm wondering about the changes involved here, since git diff v2.10.2 v2.10.3 lib/perfdata doesn't highlight something here. Likely it is related to a2ae01e with the dropped life support references making the original problem with not closing the streams at all more visible.

@marcofl
Copy link

marcofl commented Mar 13, 2019

Can you give this ticket higher priority / bug label maybe? This actually made this version unusable for everyone using the InfluxDB writer...

@sharkyzz
Copy link

sharkyzz commented Mar 13, 2019

Yes, I can confirm we experience the exact same issue. Also using Icinga 2.10.3 with InfluxDB and TLS.

@Al2Klimov Al2Klimov added the bug Something isn't working label Mar 14, 2019
@dnsmichi
Copy link
Contributor

Can you give this ticket higher priority / bug label maybe? This actually made this version unusable for everyone using the InfluxDB writer...

@Al2Klimov created a patch which is on my review list. I am at Icinga Camp Berlin currently so I will merge this the latest next week.

Cheers,
Michael

@bengelberth-shadowsoft
Copy link
Author

Installed #6990 on our systems that were suffering from this issue.

It has been running since March 11.

The issue appears to have cleared up. We have been observing the TCP Connection count and it is NOT increasing. Previously we could have a crash or out of file handles in 12 hours or less.
Icinga2 system is now functioning normally.

Thank you for your help.

@linuxmail
Copy link

Yes, I can confirm we experience the exact same issue. Also using Icinga 2.10.3 with InfluxDB and TLS.

Me too :-)

$ netstat -patune | grep icinga | wc -l
3829

nearly all connections to InfluxDB.

@dnsmichi
Copy link
Contributor

This may affect other (TLS) streams not only for InfluxDB/Elasticsearch features.

@dnsmichi
Copy link
Contributor

Tests

Use the influxdb vagrant box, and modify it a bit for TLS.

 usermod -a -G icinga influxdb

vim /etc/influxdb/influxdb.conf

[http]
https-certificate = "/var/lib/icinga2/certs/icinga2-influxdb.vagrant.demo.icinga.com.crt"
https-private-key = "/var/lib/icinga2/certs/icinga2-influxdb.vagrant.demo.icinga.com.key"
https-enabled = true

systemctl restart influxdb


vim /etc/icinga2/features-enabled/influxdb.conf

  ssl_enable = true

systemctl restart icinga2

The Grafana datasource needs to be modified to server, https and skip verify.

Generate some more load from Icinga:

vim /etc/icinga2/demo/many.conf

const countHosts = 100;

systemctl restart icinga2

Open files

[root@icinga2-influxdb ~]# for p in $(pidof icinga2); do lsof -p $p | grep TCP; done
icinga2 5317 icinga   15u     IPv4             807181       0t0       TCP *:5665 (LISTEN)
icinga2 5317 icinga   16u     IPv4             806410       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:42390->icinga2-influxdb.vagrant.demo.icinga.com:mysql (ESTABLISHED)
icinga2 5317 icinga   19u     IPv4             809574       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:34776->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga   20u     IPv4             806696       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:34736->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga   22u     IPv4             807700       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:34750->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga   23u     IPv4             815710       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:34928->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga   24u     IPv4             813370       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:34882->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga   25u     IPv4             823461       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:35094->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga   26u     IPv4             814492       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:34906->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)

....

icinga2 5317 icinga  180u     IPv4             877680       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:36486->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga  181u     IPv4             873421       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:36210->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)
icinga2 5317 icinga  185u     IPv4             876449       0t0       TCP icinga2-influxdb.vagrant.demo.icinga.com:36476->icinga2-influxdb.vagrant.demo.icinga.com:d-s-n (ESTABLISHED)

Fix

Screen Shot 2019-03-18 at 14 33 23

dnsmichi pushed a commit that referenced this issue Mar 19, 2019
refs #6989

(cherry picked from commit 2a6b122)
@marcofl
Copy link

marcofl commented Mar 29, 2019

I can confirm the issues is gone with 2.10.4 for us. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb Metrics to InfluxDB bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants