Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container usage does not decrease after buffer is emptied #7234

Closed
dpajin opened this issue Mar 26, 2020 · 4 comments
Closed

container usage does not decrease after buffer is emptied #7234

dpajin opened this issue Mar 26, 2020 · 4 comments
Labels
area/influxdb bug unexpected problem or unintended behavior

Comments

@dpajin
Copy link
Contributor

dpajin commented Mar 26, 2020

Relevant telegraf.conf:

[agent]
  interval = "1m"
  round_interval = true
  flush_buffer_when_full = true
  metric_buffer_limit = 3000000
  metric_batch_size = 100000
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = true
  omit_hostname = true

  
# Output database for telegraf internal statistics
[[outputs.influxdb]]
  urls = ["http://localhost:8086"]
  database = 
  username = 
  password = 
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  skip_database_creation = true
  #metric_buffer_limit = 10000
  #metric_batch_size = 1000
  #flush_interval = "30s"
  namepass = ["internal_*"]

# Telegraf internal stats collection
[[inputs.internal]]
  ## If true, collect telegraf memory stats.
  interval = "1m"
  collect_memstats = false


# Output databases for each node
[[outputs.influxdb]]
  urls = ["http://db_1:8086"]
  database = 
  username = 
  password = 
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  skip_database_creation = true
  namedrop = ["internal_*"]

[[outputs.influxdb]]
  urls = ["http://db_2:8086"]
  database = 
  username = 
  password = 
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  skip_database_creation = true
  namedrop = ["internal_*"]

[[outputs.influxdb]]
  urls = ["http://db_3:8086"]
  database = 
  username = 
  password = 
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  skip_database_creation = true
  namedrop = ["internal_*"]

  
[[inputs.influxdb_listener]]
  ## Address and port to host HTTP listener on
  service_address = ":9096"

  ## maximum duration before timing out read of the request
  read_timeout = "5s"
  ## maximum duration before timing out write of the response
  write_timeout = "5s"

  ## Maximum allowed HTTP request body size in bytes.
  ## 0 means to use the default of 32MiB.
  max_body_size = 0

System info:

Telegraf version 1.13.4, running in the container, image from Docker Hub

Linux RMIMH03S 5.3.0-40-generic #32~18.04.1-Ubuntu SMP

Docker

Docker 19.03.7

Steps to reproduce:

I use Telegraf with influxdb_listener input plugin and output to write data into multiple InfluxDB databases. I use metric_buffer_limit with 3M metrics.

  1. One database is down, it will store the metric in the buffer.
  2. Buffer gets full.
  3. Database is available and the metrics from the buffer are written to the database.
  4. The memory used for buffering is not released after buffer is emptied

Expected behavior:

I would expect that the memory consumption of the Telegraf container is decreased after buffer is emptied.

Actual behavior:

Memory consumption stays high even when the buffer is emptied.

Additional info:

The image below shows the usage of the memory of the Telegraf docker container. Around 19:40 buffering has started. Around 23:20 the buffer was full. Around 00:00 the database was available and the metric were written to the database. At that time container memory usage slightly increase from 2.29 GB to 2.35 GB.

The memory usage stays still the same after 12 hours.
When the database is not available again, the new buffering does not increase memory further (at 40% buffer usage, the memory consumption did not increased)

Is this expected behavior?

image

@danielnelson
Copy link
Contributor

Definitely not expecting this behavior. We have had issues in the past where all metric references were not cleared and they couldn't be garbage collected.

There are a couple of follow up test variations that we should perform:

  • Same test as above but without Docker
  • Compare results when using a single output vs multiple outputs.

If by change you could run either of these tests it would be very helpful. Can you also share how you are calculating memory usage, any input plugins used and the underlying queries.

@danielnelson danielnelson added bug unexpected problem or unintended behavior ready labels Mar 26, 2020
@dpajin
Copy link
Contributor Author

dpajin commented Mar 27, 2020

Memory usage is collected by another instance of Telegraf running directly on host, using docker input plugin. Export is done again to InfluxDB and using the following query in Grafana to draw this graph:

SELECT mean("usage") FROM "docker_container_mem" WHERE "node" =~ /^$node$/ AND $timeFilter GROUP BY time($__interval), "container_name" fill(null)

Okay, I will try to make those test as suggested and I will come back with results.

@danielnelson danielnelson added this to the planned milestone Mar 27, 2020
@sjwang90 sjwang90 removed the ready label Jan 29, 2021
@sjwang90 sjwang90 removed this from the Planned milestone Jan 29, 2021
@ssoroka
Copy link
Contributor

ssoroka commented Mar 8, 2021

@dpajin is this still an issue?

@sjwang90
Copy link
Contributor

@dpajin Closing this issue. Feel free to re-open if still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants