Skip to content

Logs are missing intermittently in version 5.1.1 #2131

@sujay2306

Description

@sujay2306

Logs are missing intermittently when using default memory buffer configuration. as a fix I switched to the filesystem buffer in the Fluentbit configuration, But retry counts increase significantly and frequent
upstream connection error messages appear in the logs. It looks like the agent is unable to reliably send data to the upstream output when filesystem buffering is enabled.

can someone please review my doc what might be causing the issue, also does moving to file system even help as its leading to too many upstream connection error

without file system: logs Im seeing:

paused (mem buf overlimit)
resume (mem buf overlimit)

with filesystem buffer: logs Im seeing:

24:02] [error] [upstream] connection #1771 to tcp://172.20.193.46:24240 timed out after 10 seconds (connection timeout) [2025/11/05 13:24:02] [error] [upstream] connection 
1-1762346821.29418039.flb', retry in 550 seconds: task_id=1106, input=storage_backlog.1 > output=forward.0 (out_id=0)
[2025/11/05 13:21:29] [ warn] [engine] failed to flush chunk '1-1762347048.635527405.flb', retry in 394 seconds: task_id=1667, input=storage_backlog.1 > output=forward.0 (out_id=0)
[2025/11/05 13:21:29] [ warn] [engine] failed to flush chunk '1-1762348643.438153342.flb', retry in 76 seconds: task_id=934, input=tail.0 > output=forward.0 (out_id=0)
[2025/11/05 13:21:29] [ warn] [engine] failed to flush chunk '1-1762348472.804439701.flb', retry in 363 seconds: task_id=1632, input=tail.0 > output=forward.0 (out_id=0)

apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: k8s-prod-logging-isolated
  namespace: logs
spec:
  controlNamespace: logs
  enableRecreateWorkloadOnImmutableFieldChange: true
  errorOutputRef: error-file-prod
  fluentbit:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: cf-group
              operator: In
              values:
              - prodint

    bufferStorageVolume:
      hostPath:
        path: /fluentbit-buffers-zone-a
    customParsers: "[PARSER]\n    Name cri-log-key\n    Format regex\n    Regex ^(?<time>[^
      ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$\n    Time_Key    time\n
      \   Time_Format %Y-%m-%dT%H:%M:%S.%L%z \n"
    filterModify:
    - rules:
      - Rename:
          key: log
          value: message
    image:
      pullPolicy: Always
      repository:dummy.dkr.ecr.ap-south-1.amazonaws.com/thirdparty-tools
      tag: fluent-bit_3.2.5_multiarch
    bufferStorage:
      storage.max_chunks_up: 1000
      storage.backlog.mem_limit: 500M
    forwardOptions:
       storage.total_limit_size: "10G"
    inputTail:
      Buffer_Chunk_Size: 300k
      Buffer_Max_Size: 500MB
      Ignore_Older: 10m
      Mem_Buf_Limit: 500MB
      storage.type: filesystem
      Rotate_Wait: "30"
      Refresh_Interval: "1"
      Parser: cri-log-key
    logLevel: info
    metrics:
      prometheusAnnotations: true
    podPriorityClassName: logging-pc
    positiondb:
      hostPath:
        path: /positiondb-zone-a
    resources:
      limits:
        memory: 2500Mi
      requests:
        cpu: 351m
        memory: 1000Mi
    tolerations:
    - effect: NoSchedule
      key: cf-group
      operator: Equal
      value: prodint


    updateStrategy:
      rollingUpdate:
        maxUnavailable: 2
  fluentd:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: cf-group
              operator: In
              values:
              - fluentd-1a-1b
    bufferStorageVolume:
      pvc:
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 30Gi
          volumeMode: Filesystem
    extraVolumes:
    - containerName: fluentd
      path: /fluentd/kafka/tls
      volume:
        secret:
          secretName: kafka-secret
      volumeName: kafka-tls
    image:
      pullPolicy: IfNotPresent
      repository: dummy.dkr.ecr.ap-south-1.amazonaws.com/thirdparty-tools
      tag: fluentd_5.1.1-full
    logLevel: info
    metrics:
      prometheusAnnotations: true
    resources:
      limits:
        cpu: 2500m
        memory: 4200Mi
      requests:
        cpu: 2200m
        memory: 4200Mi
    scaling:
      drain:
        enabled: false
      replicas: 13
    tolerations:
    - effect: NoSchedule
      key: nodegroup
      operator: Equal
      value: fluentd-1a-1b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions