-
Notifications
You must be signed in to change notification settings - Fork 353
Open
Description
Logs are missing intermittently when using default memory buffer configuration. as a fix I switched to the filesystem buffer in the Fluentbit configuration, But retry counts increase significantly and frequent
upstream connection error messages appear in the logs. It looks like the agent is unable to reliably send data to the upstream output when filesystem buffering is enabled.
can someone please review my doc what might be causing the issue, also does moving to file system even help as its leading to too many upstream connection error
without file system: logs Im seeing:
paused (mem buf overlimit)
resume (mem buf overlimit)
with filesystem buffer: logs Im seeing:
24:02] [error] [upstream] connection #1771 to tcp://172.20.193.46:24240 timed out after 10 seconds (connection timeout) [2025/11/05 13:24:02] [error] [upstream] connection
1-1762346821.29418039.flb', retry in 550 seconds: task_id=1106, input=storage_backlog.1 > output=forward.0 (out_id=0)
[2025/11/05 13:21:29] [ warn] [engine] failed to flush chunk '1-1762347048.635527405.flb', retry in 394 seconds: task_id=1667, input=storage_backlog.1 > output=forward.0 (out_id=0)
[2025/11/05 13:21:29] [ warn] [engine] failed to flush chunk '1-1762348643.438153342.flb', retry in 76 seconds: task_id=934, input=tail.0 > output=forward.0 (out_id=0)
[2025/11/05 13:21:29] [ warn] [engine] failed to flush chunk '1-1762348472.804439701.flb', retry in 363 seconds: task_id=1632, input=tail.0 > output=forward.0 (out_id=0)
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
name: k8s-prod-logging-isolated
namespace: logs
spec:
controlNamespace: logs
enableRecreateWorkloadOnImmutableFieldChange: true
errorOutputRef: error-file-prod
fluentbit:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cf-group
operator: In
values:
- prodint
bufferStorageVolume:
hostPath:
path: /fluentbit-buffers-zone-a
customParsers: "[PARSER]\n Name cri-log-key\n Format regex\n Regex ^(?<time>[^
]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$\n Time_Key time\n
\ Time_Format %Y-%m-%dT%H:%M:%S.%L%z \n"
filterModify:
- rules:
- Rename:
key: log
value: message
image:
pullPolicy: Always
repository:dummy.dkr.ecr.ap-south-1.amazonaws.com/thirdparty-tools
tag: fluent-bit_3.2.5_multiarch
bufferStorage:
storage.max_chunks_up: 1000
storage.backlog.mem_limit: 500M
forwardOptions:
storage.total_limit_size: "10G"
inputTail:
Buffer_Chunk_Size: 300k
Buffer_Max_Size: 500MB
Ignore_Older: 10m
Mem_Buf_Limit: 500MB
storage.type: filesystem
Rotate_Wait: "30"
Refresh_Interval: "1"
Parser: cri-log-key
logLevel: info
metrics:
prometheusAnnotations: true
podPriorityClassName: logging-pc
positiondb:
hostPath:
path: /positiondb-zone-a
resources:
limits:
memory: 2500Mi
requests:
cpu: 351m
memory: 1000Mi
tolerations:
- effect: NoSchedule
key: cf-group
operator: Equal
value: prodint
updateStrategy:
rollingUpdate:
maxUnavailable: 2
fluentd:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cf-group
operator: In
values:
- fluentd-1a-1b
bufferStorageVolume:
pvc:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
volumeMode: Filesystem
extraVolumes:
- containerName: fluentd
path: /fluentd/kafka/tls
volume:
secret:
secretName: kafka-secret
volumeName: kafka-tls
image:
pullPolicy: IfNotPresent
repository: dummy.dkr.ecr.ap-south-1.amazonaws.com/thirdparty-tools
tag: fluentd_5.1.1-full
logLevel: info
metrics:
prometheusAnnotations: true
resources:
limits:
cpu: 2500m
memory: 4200Mi
requests:
cpu: 2200m
memory: 4200Mi
scaling:
drain:
enabled: false
replicas: 13
tolerations:
- effect: NoSchedule
key: nodegroup
operator: Equal
value: fluentd-1a-1b
Metadata
Metadata
Assignees
Labels
No labels