Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd DaemonSet degrades and stops sending logs #87

Closed
rloporp opened this issue Jun 7, 2023 · 5 comments
Closed

fluentd DaemonSet degrades and stops sending logs #87

rloporp opened this issue Jun 7, 2023 · 5 comments

Comments

@rloporp
Copy link

rloporp commented Jun 7, 2023

We found that after some time the fluentd DaemonSet degrades and stops sending logs to Logz.io. This degradation may take somewhere between a couple of hours to 2 or 3 days to happen. We haven't been able to run a working DaemonSet for a whole week.

Research

We've confirmed on our monitoring systems that the problem happens as soon as the error (fluentd.output.status.num.errors.gauge) and retry (fluentd.output.status.retry.count.gauge) counters are higher than 0. The DaemonSet remains totally healthy in terms of memory and CPU (no throttling).

Environment

We've found this behavior with release 1.3.0. We've tried release 1.4.0 and the problem arises faster.

Workaround

When that happens, logs stop being sent to Logz.io and the only known workaround is to restart the affected DaemonSet.

@mirii1994
Copy link
Contributor

Hi @rloporp , thanks for reporting this issue.
Can you please share details about your k8s environment? Where it runs on (cloud provider), which k8s version, what type are the nodes?

@rloporp
Copy link
Author

rloporp commented Jun 14, 2023

Sure, it's a kOps-managed cluster running on AWS EC2 instances:

  • kOps version 1.23.2
  • K8s version 1.23.9
  • m5.xlarge instances (4 vCPU, 16GB RAM)

We've investigated on our own, and disk IOPS seems to be a bottleneck. We're testing memory-based buffers instead and the results look promising.
We've also noticed that the problem tends to begin when fluentd tries to forward a big log entry (even within limits) to LogZ. In that scenario, LogZ rejects the connection, fluentd keeps retrying the failed log entry and the buffer gets full with all subsequent logs.

@mirii1994
Copy link
Contributor

@rloporp thank you.
What was the size of the log entry that was rejected by logz.io's listener?
What was the response from logz.io?

@rloporp
Copy link
Author

rloporp commented Jun 16, 2023

About the log size, they are currently truncated to 5000 characters (in terms of byte size, I think they're UTF-8 characters, but I'm not 100% sure) + minimal Sping Boot headers.

About the error and Logz.io's response:

This is an old error (after one month, I could only retrieve a couple of lines):

2023-05-11 12:11:37 +0000 [error]: #0 [out_logzio] Error while sending POST to https://listener.logz.io:8071?token=****: {"malformedLines":0,"oversizedLines":1,"successfulLines":23}                                                        │
2023-05-11 12:11:37 +0000 [warn]: #0 [out_logzio] failed to flush the buffer. retry_times=9 next_retry_time=2023-05-11 12:12:09 +0000 chunk="5fb43eb75e431a2312837c16b91a1bdc" error_class=RuntimeError error="Logzio listener returned (400) for https://listener.logz. │

Current errors:

2023-06-16 15:32:08 +0000 [info]: #0 [out_logzio] Got 400 code from Logz.io. This means that some of your logs are too big, or badly formatted. Response: {"malformedLines":0,"oversizedLines":1,"successfulLines":6}
...
2023-06-16 15:33:25 +0000 [info]: #0 [out_logzio] Got 400 code from Logz.io. This means that some of your logs are too big, or badly formatted. Response: {"malformedLines":0,"oversizedLines":1,"successfulLines":54}
...
2023-06-16 16:31:59 +0000 [info]: #0 [out_logzio] Got 400 code from Logz.io. This means that some of your logs are too big, or badly formatted. Response: {"malformedLines":0,"oversizedLines":1,"successfulLines":56}
...
2023-06-16 16:33:12 +0000 [info]: #0 [out_logzio] Got 400 code from Logz.io. This means that some of your logs are too big, or badly formatted. Response: {"malformedLines":0,"oversizedLines":1,"successfulLines":4}
...
2023-06-16 16:51:52 +0000 [info]: #0 [out_logzio] Got 400 code from Logz.io. This means that some of your logs are too big, or badly formatted. Response: {"malformedLines":0,"oversizedLines":1,"successfulLines":17}

Zoom in on the last (16:51:52) error:

2023-06-16 16:51:38 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:40 +0000 [info]: #0 [filter_kubernetes_metadata] stats - namespace_cache_size: 5, pod_cache_size: 16, pod_cache_host_updates: 1545, pod_watch_gone_errors: 98, pod_watch_gone_notices: 98, namespace_cache_api_updates: 1063, pod_cache_api_updates: 1063, id_cache_miss: 1063
2023-06-16 16:51:41 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:44 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:44 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:45 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:46 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:46 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:48 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:50 +0000 [info]: #0 [spring_multi_lines] Timeout flush: kubernetes.var.log.containers.****.log:default
2023-06-16 16:51:52 +0000 [info]: #0 [out_logzio] Got 400 code from Logz.io. This means that some of your logs are too big, or badly formatted. Response: {"malformedLines":0,"oversizedLines":1,"successfulLines":17}

@mirii1994
Copy link
Contributor

Hi @rloporp ,
Please try the latest docker image (1.5.0) and let us know if this solves your issue. Thank!

@ralongit ralongit closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants