Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf fails to write data to Elasticsearch #5676

Closed
aurimasplu opened this issue Apr 4, 2019 · 4 comments
Closed

Telegraf fails to write data to Elasticsearch #5676

aurimasplu opened this issue Apr 4, 2019 · 4 comments
Labels
area/elasticsearch bug unexpected problem or unintended behavior

Comments

@aurimasplu
Copy link

aurimasplu commented Apr 4, 2019

Relevant telegraf.conf:

[global_tags]
[agent]
  interval = "1s" 
  round_interval = true
  metric_batch_size = 1000 
  metric_buffer_limit = 1000000 
  collection_jitter = "0s"
  flush_interval = "120s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = "TELEGRAF-LOG"
  hostname = ""
  omit_hostname = true

[[outputs.elasticsearch]]
  urls = [ "https://my-elastic-node1:9200", "https://my-elastic-node2:9200" ]
  timeout = "5s"
  enable_sniffer = false
  health_check_interval = "0s"
  username = "elasticuser"
  password = "elasticpass"
  index_name = "telegraf-{{measurement_tag}}-%Y.%m.%d"
  default_tag_value = "interface"
  manage_template = true
  template_name = "telegraf"
  overwrite_template = false
  insecure_skip_verify = true
  namepass = ["interface", "LoadbalancerVserver"]

[[outputs.elasticsearch]]
 urls = [ "https://my-elastic-node1:9200", "https://my-elastic-node2:9200" ]
  timeout = "5s"
  enable_sniffer = false
  health_check_interval = "0s"
  username = "elasticuser"
  password = "elasticpass"
  index_name = "telegraf-{{measurement_tag}}-%Y.%m"
  default_tag_value = ""
  manage_template = true
  template_name = "telegraf"
  overwrite_template = false
  insecure_skip_verify = true
  namedrop = ["interface", "LoadbalancerVserver"]

System info:

OS: RHEL 7.6
CPU: 8 cores
RAM: 16G
Telegraf versions:
Issue noticed in telegraf versions:
1.9.4-1
1.10.1-1
Everything was working fine in telegraf version:
1.7.4-1 and lower.

Steps to reproduce:

  1. Run telegraf versions 1.9.4-1 or 1.10.1-1.
  2. We have identical RHEL-7.6 virtual servers with identical telegraf configuration. Only difference one server runs telegraf-1.7.4-1 and other 1.10.1-1.
    When we run server with telegraf-1.10.1-1 we dont get most of the metrics and telegraf starts generating log provided below. It seems what telegraf fails to communicate with Elasticsearch.

Also I am collecting telegraf self monitoring with inputs.internal and I see that buffer_size behave different, see screenshot.

Expected behavior:

We run thousands inputs.snmp instances to collect various SNMP counters and write data to two Elasticsearch outputs. With telegraf-1.7.4-1 and lower everything worked fine. Metrics was successfully collected and successfully written to output.

Actual behavior:

After upgrading to telegraf-1.10.1-1 we noticed that we are missing data, some random points gets written but most are missing. We also checked telegraf-1.9.4-1 and got same behavior.

Additional info:

[root@my-server PROD:log]# cat messages | grep telegraf | grep elasticsearch
Apr  4 08:14:51 my-server telegraf: 2019-04-04T06:14:51Z I! Loaded outputs: elasticsearch elasticsearch
Apr  4 08:16:05 my-server telegraf: 2019-04-04T06:16:05Z E! [agent] Error writing to output [elasticsearch]: Error sending bulk request to Elasticsearch: Post https://my-elastic-node1:9200/_bulk: context deadline exceeded
Apr  4 08:16:05 my-server telegraf: 2019-04-04T06:16:05Z E! [agent] Error writing to output [elasticsearch]: Error sending bulk request to Elasticsearch: Post https://my-elastic-node1:9200/_bulk: context deadline exceeded
Apr  4 08:16:10 my-server telegraf: 2019-04-04T06:16:10Z E! [agent] Error writing to output [elasticsearch]: Error sending bulk request to Elasticsearch: Post https://my-elastic-node2:9200/_bulk: context deadline exceeded
Apr  4 08:16:36 my-server telegraf: 2019-04-04T06:16:36Z E! [agent] Error writing to output [elasticsearch]: Error sending bulk request to Elasticsearch: Post https://my-elastic-node2:9200/_bulk: context deadline exceeded
Apr  4 08:18:01 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@785c460b; line: 1, column: 119], i_o_exception
Apr  4 08:18:01 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@567dd72e; line: 1, column: 119], i_o_exception
Apr  4 08:18:01 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@3745d633; line: 1, column: 119], i_o_exception
Apr  4 08:18:01 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@147d29e5; line: 1, column: 120], i_o_exception
Apr  4 08:18:01 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@4e3a341b; line: 1, column: 119], i_o_exception
Apr  4 08:18:01 my-server telegraf: 2019-04-04T06:18:01Z E! [agent] Error writing to output [elasticsearch]: W! Elasticsearch failed to index 5 metrics
Apr  4 08:18:04 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@2bb94494; line: 1, column: 119], i_o_exception
Apr  4 08:18:04 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@7a6a2327; line: 1, column: 119], i_o_exception
Apr  4 08:18:04 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@7eefb0f6; line: 1, column: 119], i_o_exception
Apr  4 08:18:04 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@51d4f645; line: 1, column: 119], i_o_exception
Apr  4 08:18:04 my-server telegraf: at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@3ded514; line: 1, column: 120], i_o_exception
Apr  4 08:18:04 my-server telegraf: 2019-04-04T06:18:04Z E! [agent] Error writing to output [elasticsearch]: W! Elasticsearch failed to index 5 metrics

image

Also one more thing that was not seen before is that them trying to stop service, telegraf-1.10.1 does not correctly:

systemd[1]: telegraf.service stop-sigterm timed out. Killing.
systemd[1]: telegraf.service: main process exited, code=killed, status=9/KILL
systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
systemd[1]: Unit telegraf.service entered failed state.
systemd[1]: telegraf.service failed.
@danielnelson danielnelson added area/elasticsearch bug unexpected problem or unintended behavior labels Apr 23, 2019
@danielnelson
Copy link
Contributor

What version of Elasticsearch are you using? Did reverting to Telegraf 1.7 fix the issue?

@deepaksood619
Copy link

deepaksood619 commented Apr 24, 2019

I am facing the same issue.

2019-04-24T06:10:29Z W! [agent] output "elasticsearch" did not complete within its flush interval
2019-04-24T06:10:59Z W! [agent] output "elasticsearch" did not complete within its flush interval
2019-04-24T06:11:00Z E! [agent] Error writing to output [elasticsearch]: Error sending bulk request to Elasticsearch: Post http://elasticsearch.example.com:9200/_bulk: context deadline exceeded

Elasticsearch version: 6.4.2

Telegraf version (same error for both versions of telegraf)

Telegraf 1.9.2 (git: HEAD dda80799)
Telegraf 1.10.2 (git: HEAD 3303f5c3)

Conf

[[outputs.elasticsearch]]
  urls = ["http://elasticsearch.example.com:9200"]
  timeout = "1m"
  flush_interval = "30s"
  enable_sniffer = false
  health_check_interval = "0s"
  index_name = "device_log-%Y.%m.%d"
  manage_template = true
  template_name = "telegraf"
  overwrite_template = false
  namepass = ["tail"]

[[inputs.tail]]
  files = ["/var/log/electric_meter.log", "/var/log/telegraf/telegraf.log", "/var/log/health-log", "/var/log/syslog"]
  from_beginning = false
  interval = "10s"
  pipe = false
  watch_method = "inotify"
  data_format = "value"
  data_type = "string"

@aurimasplu
Copy link
Author

@danielnelson we are using Elasticsearch 6.1.1. And we write directly to it, we are not using Logstash or any other queuing product.
And yes, reverting back to Telegraf 1.7 fixed the issue,

@sjwang90
Copy link
Contributor

sjwang90 commented Dec 4, 2020

I came across this issue and I wanted to see if this is still persistent with the latest version of Telegraf as it's been quite awhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elasticsearch bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants