Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inodes 100% used #2112

Closed
sekka1 opened this Issue Oct 21, 2016 · 6 comments

Comments

Projects
None yet
4 participants
@sekka1
Copy link

sekka1 commented Oct 21, 2016

What did you do?
I am running prometheus in a kubernetes cluster with an EBS backed disks. After a day or two, it reports it is out of disk space. Looking at the disk, the inodes for the data partition is 100% usage.

What did you expect to see?
For it not to do this. Is this b/c of something in my config files or how im setting it up?

What did you see instead? Under which circumstances?

Environment

  • System information:

    insert output of uname -srm here

/prometheus # uname -srm
Linux 4.6.3-coreos x86_64
  • Prometheus version:

    insert output of prometheus -version here

  • Alertmanager version:

    insert output of alertmanager -version here (if relevant to the issue)

/prometheus # prometheus -version
prometheus, version 1.2.1 (branch: master, revision: dd66f2e94b2b662804b9aa1b6a50587b990ba8b7)
  build user:       root@fd9b0daff6bd
  build date:       20161010-15:58:23
  go version:       go1.7.1
  • Prometheus configuration file:
    Prometheus config
    global:
      scrape_interval: 15s
    # A scrape configuration for running Prometheus on a Kubernetes cluster.
    # This uses separate scrape configs for cluster components (i.e. API server, node)
    # and services to allow each to use different authentication configs.
    #
    # Kubernetes labels will be added as Prometheus labels on metrics via the
    # `labelmap` relabeling action.

      # This file comes from the kubernetes configmap
    rule_files:
    - '/etc/prometheus-rules/alert.rules'

    # Scrape config for cluster components.
    scrape_configs:
    - job_name: 'kubernetes-cluster'

      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https

      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration (`in_cluster` below) because discovery & scraping are two
      # separate concerns in Prometheus.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        # If your node certificates are self-signed or use a different CA to the
        # master CA, then disable certificate verification below. Note that
        # certificate verification is an integral part of a secure infrastructure
        # so this should only be disabled in a controlled environment. You can
        # disable certificate verification by uncommenting the line below.
        #
        # insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
      - api_servers:
        - 'https://kubernetes.default.svc'
        in_cluster: true
        role: apiserver

    - job_name: 'kubernetes-nodes'

      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https

      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration (`in_cluster` below) because discovery & scraping are two
      # separate concerns in Prometheus.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        # If your node certificates are self-signed or use a different CA to the
        # master CA, then disable certificate verification below. Note that
        # certificate verification is an integral part of a secure infrastructure
        # so this should only be disabled in a controlled environment. You can
        # disable certificate verification by uncommenting the line below.
        #
        # insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
      - api_servers:
        - 'https://kubernetes.default.svc'
        in_cluster: true
        role: node

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    # Scrape config for service endpoints.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
    # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
    # to set this to `https` & most likely set the `tls_config` of the scrape config.
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: If the metrics are exposed on a different port to the
    # service then set this appropriately.
    - job_name: 'kubernetes-service-endpoints'

      kubernetes_sd_configs:
      - api_servers:
        - 'https://kubernetes.default.svc'
        in_cluster: true
        role: endpoint

      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)(?::\d+);(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_service_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

    # Example scrape config for probing services via the Blackbox Exporter.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/probe`: Only probe services that have a value of `true`
    - job_name: 'kubernetes-services'

      metrics_path: /probe
      params:
        module: [http_2xx]

      kubernetes_sd_configs:
      - api_servers:
        - 'https://kubernetes.default.svc'
        in_cluster: true
        role: service

      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_service_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name

    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
    - job_name: 'kubernetes-pods'

      kubernetes_sd_configs:
      - api_servers:
        - 'https://kubernetes.default.svc'
        in_cluster: true
        role: pod

      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (.+):(?:\d+);(\d+)
        replacement: ${1}:${2}
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_pod_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

    # Monitor etcd
    # The "etcd-cluster" is a kube service pointing to the etcd nodes
    - job_name: 'etcd'
      scrape_interval: 5s
      static_configs:
        - targets: ['etcd-cluster:2379']

Kubernetes pod file

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-monitoring
  labels:
    branch: ${WERCKER_GIT_BRANCH}
    commit: ${WERCKER_GIT_COMMIT}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-monitoring
  template:
    metadata:
      name: prometheus-monitoring
      labels:
        app: prometheus-monitoring
        branch: ${WERCKER_GIT_BRANCH}
        commit: ${WERCKER_GIT_COMMIT}
    spec:
      nodeSelector:
        caste: patrician
      containers:

      # Prometheus server
      - name: prometheus
        image: prom/prometheus:v1.2.1
        args:
          - '-storage.local.retention=72h'
          - '-storage.local.path=/home'
          - '-storage.local.memory-chunks=500000'
          - '-config.file=/etc/prometheus/prometheus.yml'
          - '-alertmanager.url=http://localhost:9093'
          - '-web.external-url=http://production-prometheus.wercker.io'
        ports:
        - name: web
          containerPort: 9090
        volumeMounts:
        - name: config-volume-prometheus
          mountPath: /etc/prometheus
        - name: config-volume-alert-rules
          mountPath: /etc/prometheus-rules
        - name: prometheus-data
          mountPath: /home
        # resources:
        #   limits:
        #     cpu: 8000m
        #     memory: 8000Mi
        #   requests:
        #     cpu: 1000m
        #     memory: 1000Mi

      #Alert manager
      - name: alertmanager
        image: quay.io/prometheus/alertmanager:v0.4.2
        args:
          - -config.file=/etc/prometheus/alertmanager.yml
        volumeMounts:
        - name: config-volume-alertmanager
          mountPath: /etc/prometheus

      # Volumens and config maps
      volumes:
      - name: config-volume-prometheus
        configMap:
          name: prometheus
      - name: config-volume-alertmanager
        configMap:
          name: prometheus-alertmanager
      - name: config-volume-alert-rules
        configMap:
          name: prometheus-alert-rules
      - name: prometheus-data
        awsElasticBlockStore:
          volumeID: ${AWS_EBS_VOLUME}
          fsType: ext4
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
    The data is in the /home directory.
/prometheus # df -i
Filesystem              Inodes      Used Available Use% Mounted on
overlay              131072000    472628 130599372   0% /
tmpfs                  2054523        18   2054505   0% /dev
tmpfs                  2054523        16   2054507   0% /sys/fs/cgroup
/dev/xvdba              655360    655360         0 100% /home
/dev/xvdh            131072000    472628 130599372   0% /prometheus
/dev/xvda9             1498496     25477   1473019   2% /etc/prometheus
/dev/xvda9             1498496     25477   1473019   2% /etc/prometheus-rules
/dev/xvda9             1498496     25477   1473019   2% /dev/termination-log
tmpfs                  2054523         9   2054514   0% /var/run/secrets/kubernetes.io/serviceaccount
/dev/xvdh            131072000    472628 130599372   0% /etc/resolv.conf
/dev/xvdh            131072000    472628 130599372   0% /etc/hostname
/dev/xvda9             1498496     25477   1473019   2% /etc/hosts
shm                    2054523         1   2054522   0% /dev/shm
tmpfs                  2054523        18   2054505   0% /proc/kcore
tmpfs                  2054523        18   2054505   0% /proc/latency_stats
tmpfs                  2054523        18   2054505   0% /proc/timer_stats
tmpfs                  2054523        18   2054505   0% /proc/sched_debug
/prometheus # df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                 468.6G      5.2G    438.3G   1% /
tmpfs                     7.8G         0      7.8G   0% /dev
tmpfs                     7.8G         0      7.8G   0% /sys/fs/cgroup
/dev/xvdba                9.7G      3.7G      5.5G  40% /home
/dev/xvdh               468.6G      5.2G    438.3G   1% /prometheus
/dev/xvda9                5.4G      1.7G      3.5G  33% /etc/prometheus
/dev/xvda9                5.4G      1.7G      3.5G  33% /etc/prometheus-rules
/dev/xvda9                5.4G      1.7G      3.5G  33% /dev/termination-log
tmpfs                     7.8G     12.0K      7.8G   0% /var/run/secrets/kubernetes.io/serviceaccount
/dev/xvdh               468.6G      5.2G    438.3G   1% /etc/resolv.conf
/dev/xvdh               468.6G      5.2G    438.3G   1% /etc/hostname
/dev/xvda9                5.4G      1.7G      3.5G  33% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                     7.8G         0      7.8G   0% /proc/kcore
tmpfs                     7.8G         0      7.8G   0% /proc/latency_stats
tmpfs                     7.8G         0      7.8G   0% /proc/timer_stats
tmpfs                     7.8G         0      7.8G   0% /proc/sched_debug

Logs:

time="2016-10-21T15:56:32Z" level=warning msg="Series quarantined." fingerprint=0864cf7a957a8a06 metric=container_cpu_user_seconds_total{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", caste="peasant", container_name="POD", id="/init.scope/system.slice/docker-79abd00098418a47bcb31207e1113602b1a7386e88bb6d37e34fb46ffcdcf375.scope", image="gcr.io/google_containers/pause-amd64:3.0", instance="i-71d74367", job="kubernetes-nodes", kubernetes_io_hostname="ip-10-0-49-178.ec2.internal", name="k8s_POD.d8dbe16c_run-580a28c6fd4b5f010090a966_default_acc605d5-979c-11e6-a98e-12717dc31e5c_7abce792", namespace="default", pod_name="run-580a28c6fd4b5f010090a966"} reason="open /home/08/64cf7a957a8a06.db: no space left on device" source="storage.go:1626" 
@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented Oct 22, 2016

I see that inodes on '/home' not '/prometheus' are run out. Hardly a prometheus problem.

BTW, if you use xfs instead of ext4, then you wont have this problem as XFS allocates inodes dynamically as long as there is a free space remaining you'll be fine

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Oct 22, 2016

The configuration is such that the storage directory is on /home, so the diagnosis is correct.

Prometheus creates roughly one file for every time series (unique metric/label combination). This can exhaust inodes. Possible mitigations are using XFS instead of ext4, creating the ext4 filesystem with more inodes, or creating a larger file system than strictly needed. I don't know off hand whether the first two are easy with Kubernetes volumes.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Oct 22, 2016

This mostly happens if each time series is very short lived. Because of kubelet internals, metrics about containers from the nodes include restart counters and other details that cause unnecessary time series churn. You may be able to fix them up with relabelling, or don't scrape them for now.

I also noticed that you have both endpoint and pod targets in the Prometheus configuration.

@sekka1

This comment has been minimized.

Copy link
Author

sekka1 commented Oct 24, 2016

Ok cool..getting a bigger disk fixed it. Looking into what both the endpoint and pods targets gets me. Are these going to be the same metrics?

@sekka1

This comment has been minimized.

Copy link
Author

sekka1 commented Oct 27, 2016

Thanks @matthiasr

@sekka1 sekka1 closed this Oct 27, 2016

leedm777 added a commit to leedm777/prometheus that referenced this issue Nov 21, 2017

Add recommendation for using XFS
We kept running out of inodes with Prometheus on our ext4 systems. The
recommendation to use XFS was found in blog posts and GitHub issues, but
not in the official documentation.

See prometheus#2112
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.