Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: invalid memory address or nil pointer dereference #2955

Closed
iofxl opened this Issue Jul 17, 2017 · 31 comments

Comments

Projects
None yet
10 participants
@iofxl
Copy link

iofxl commented Jul 17, 2017

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

I see the panic in the title.

Environment

  • System information:

    Linux 2.6.32-358.el6.x86_64 x86_64

  • Prometheus version:

    2.0.0-beta.0, official build.

  • Alertmanager version:

    insert output of alertmanager -version here (if relevant to the issue)

  • Prometheus configuration file:

# my global config
global:
  scrape_interval:     30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "first.rules.yml"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:39090']

#  - job_name: 'node'
#    static_configs:
#      - targets: ['localhost:39100']
#        labels:
#         group: 'prometheus-server'

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:39100']
        labels:
         group: 'prometheus-server'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: 'vp'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'

      - targets: ['8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100','8.8.8.8:39100']
        labels:
         group: '4g'


  - job_name: 'snmp_4g'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp_4'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp_4g'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'swsnl'

    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116

  - job_name: 'snmp'
    scrape_interval:     3m
    scrape_timeout:     2m
    static_configs:
      - targets: ['8.8.8.8','8.8.8.8','8.8.8.8']
        labels:
         group: 'sw'
    metrics_path: /snmp
    params:
      module: [default]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 8.8.8.8:39116




  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
time="2017-07-14T15:45:59+08:00" level=info msg="Starting prometheus (version=2.0.0-beta.0, branch=master, revision=2b5d9159537cbd123219296121e05244e26c0940)" source="main.go:202" 
time="2017-07-14T15:45:59+08:00" level=info msg="Build context (go=go1.8.3, user=root@fc24486243df, date=20170712-12:21:13)" source="main.go:203" 
time="2017-07-14T15:45:59+08:00" level=info msg="Host details (Linux 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 ns1 (none))" source="main.go:204" 
time="2017-07-14T15:45:59+08:00" level=info msg="Starting tsdb" source="main.go:216" 
time="2017-07-14T15:48:01+08:00" level=info msg="tsdb started" source="main.go:222" 
time="2017-07-14T15:48:01+08:00" level=info msg="Loading configuration file prometheus.yml" source="main.go:351" 
time="2017-07-14T15:48:01+08:00" level=info msg="Listening on :39090" source="web.go:286" 
time="2017-07-14T15:48:01+08:00" level=info msg="Starting target manager..." source="targetmanager.go:67" 
ts=2017-07-14T07:48:31.805777219Z caller=compact.go:235 msg="compact blocks" blocks=[01BMZM1NMKB3862CZH451K3TKW]
ts=2017-07-14T08:00:00.158718549Z caller=db.go:867 msg="created head block" ulid=01BN01S43N9GVC7SPN7DC825SP mint=1500019200000 maxt=1500026400000
ts=2017-07-14T09:00:01.808299206Z caller=compact.go:235 msg="compact blocks" blocks=[01BMZTXCWNV88DXS2AW5VSKV3H]
ts=2017-07-14T09:00:12.039838082Z caller=compact.go:235 msg="compact blocks" blocks="[01BMZTXCZ04R8WFDAB6V56QPPA 01BN0143XXEDFBJP9GMQEF6902 01BN0571CG0AW8CFY0WMG4S3HT]"
ts=2017-07-14T10:00:00.156138675Z caller=db.go:867 msg="created head block" ulid=01BN08MVBN8W69Z1T5DKHC6FKE mint=1500026400000 maxt=1500033600000
ts=2017-07-14T11:00:01.818372645Z caller=compact.go:235 msg="compact blocks" blocks=[01BN01S43N9GVC7SPN7DC825SP]
ts=2017-07-14T12:00:00.154268008Z caller=db.go:867 msg="created head block" ulid=01BN0FGJKNTCM5V3R1BHG0DNP9 mint=1500033600000 maxt=1500040800000
ts=2017-07-14T13:00:01.81024889Z caller=compact.go:235 msg="compact blocks" blocks=[01BN08MVBN8W69Z1T5DKHC6FKE]
ts=2017-07-14T14:00:00.150149774Z caller=db.go:867 msg="created head block" ulid=01BN0PC9VMS07SFV674GKAQHKM mint=1500040800000 maxt=1500048000000
ts=2017-07-14T15:00:01.808408261Z caller=compact.go:235 msg="compact blocks" blocks=[01BN0FGJKNTCM5V3R1BHG0DNP9]
ts=2017-07-14T15:00:11.004250575Z caller=compact.go:235 msg="compact blocks" blocks="[01BN0C2RMTFXD4HQ58GXPJPD70 01BN0JYFWJGYWN2CRD7QMKE5X2 01BN0ST74GJPWC1YTY20F5BXRP]"
ts=2017-07-14T16:00:00.153046619Z caller=db.go:867 msg="created head block" ulid=01BN0X813NW0V7JF13AY3XRY6H mint=1500048000000 maxt=1500055200000
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1662f11]

goroutine 232 [running]:
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).append(0xc420178620, 0xc45187e000, 0x1db8b, 0x20200, 0xed0fae9fa, 0x255cdb36, 0x29132a0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/prometheus/prometheus/retrieval/scrape.go:759 +0x1a01
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc420178620, 0x29e8d60800, 0x1bf08eb000, 0x0)
	/go/src/github.com/prometheus/prometheus/retrieval/scrape.go:624 +0x48c
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
	/go/src/github.com/prometheus/prometheus/retrieval/scrape.go:281 +0x534
time="2017-07-15T23:10:01+08:00" level=info msg="Starting prometheus (version=2.0.0-beta.0, branch=master, revision=2b5d9159537cbd123219296121e05244e26c0940)" source="main.go:202" 
time="2017-07-15T23:10:01+08:00" level=info msg="Build context (go=go1.8.3, user=root@fc24486243df, date=20170712-12:21:13)" source="main.go:203" 
time="2017-07-15T23:10:01+08:00" level=info msg="Host details (Linux 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 ns1 (none))" source="main.go:204" 
time="2017-07-15T23:10:01+08:00" level=info msg="Starting tsdb" source="main.go:216" 
time="2017-07-15T23:11:29+08:00" level=info msg="tsdb started" source="main.go:222" 
time="2017-07-15T23:11:29+08:00" level=info msg="Loading configuration file prometheus.yml" source="main.go:351" 
time="2017-07-15T23:11:29+08:00" level=info msg="Starting target manager..." source="targetmanager.go:67" 
time="2017-07-15T23:11:29+08:00" level=info msg="Listening on :39090" source="web.go:286" 
ts=2017-07-15T15:11:30.134779123Z caller=db.go:867 msg="created head block" ulid=01BN3CVY9HY0796F9955C3E3Y0 mint=1500120000000 maxt=1500127200000
ts=2017-07-15T15:11:30.165463703Z caller=db.go:867 msg="created head block" ulid=01BN3CVYAPZAXS77ZDERSZRV1C mint=1500127200000 maxt=1500134400000
ts=2017-07-15T15:11:59.982241209Z caller=compact.go:235 msg="compact blocks" blocks=[01BN0PC9VMS07SFV674GKAQHKM]
ts=2017-07-15T15:12:08.962022735Z caller=compact.go:235 msg="compact blocks" blocks=[01BN0X813NW0V7JF13AY3XRY6H]
ts=2017-07-15T16:00:00.108907164Z caller=db.go:867 msg="created head block" ulid=01BN3FMR2100S2N0DN6MMZBKKK mint=1500134400000 maxt=1500141600000
ts=2017-07-15T17:00:29.984555746Z caller=compact.go:235 msg="compact blocks" blocks=[01BN3CVYAPZAXS77ZDERSZRV1C]
ts=2017-07-15T17:00:37.257511167Z caller=compact.go:235 msg="compact blocks" blocks="[01BN3CWVFE8JT4EA97M61WT31E 01BN3CX481ZGS0222QW9BEQMTW 01BN3K3GX0HS0AFHQY6PANB25S]"
ts=2017-07-15T18:00:00.108444905Z caller=db.go:867 msg="created head block" ulid=01BN3PGFA1F50EDHD6JAPZ30TN mint=1500141600000 maxt=1500148800000
ts=2017-07-15T19:00:29.986600081Z caller=compact.go:235 msg="compact blocks" blocks=[01BN3FMR2100S2N0DN6MMZBKKK]
ts=2017-07-15T20:00:00.11601485Z caller=db.go:867 msg="created head block" ulid=01BN3XC6J1MW0SMG7GHKX6TBAX mint=1500148800000 maxt=1500156000000
ts=2017-07-15T21:00:29.986676429Z caller=compact.go:235 msg="compact blocks" blocks=[01BN3PGFA1F50EDHD6JAPZ30TN]
ts=2017-07-15T22:00:00.1149678Z caller=db.go:867 msg="created head block" ulid=01BN447XT37BWGT9EJ5K2R1A7B mint=1500156000000 maxt=1500163200000
ts=2017-07-15T23:00:29.982945491Z caller=compact.go:235 msg="compact blocks" blocks=[01BN3XC6J1MW0SMG7GHKX6TBAX]
ts=2017-07-15T23:00:39.481496762Z caller=compact.go:235 msg="compact blocks" blocks="[01BN3SZ852H5ZSCK8PST8Q02FG 01BN40TZD2BM0QPYFZ2P05H5MR 01BN47PPMYFR32S28XKX57EVNS]"
ts=2017-07-16T00:00:00.111144418Z caller=db.go:867 msg="created head block" ulid=01BN4B3N227VMRWJDM7XJ0VXDV mint=1500163200000 maxt=1500170400000
ts=2017-07-16T01:00:29.973013395Z caller=compact.go:235 msg="compact blocks" blocks=[01BN447XT37BWGT9EJ5K2R1A7B]
ts=2017-07-16T02:00:00.112194233Z caller=db.go:867 msg="created head block" ulid=01BN4HZCA5GP7PSNVWZGXQE981 mint=1500170400000 maxt=1500177600000
ts=2017-07-16T03:00:29.985333474Z caller=compact.go:235 msg="compact blocks" blocks=[01BN4B3N227VMRWJDM7XJ0VXDV]
ts=2017-07-16T04:00:00.109957127Z caller=db.go:867 msg="created head block" ulid=01BN4RV3J36BWH4WQMPHBCKYH3 mint=1500177600000 maxt=1500184800000
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1662f11]

goroutine 276 [running]:
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).append(0xc4201f21c0, 0xc434190000, 0x1db89, 0x20200, 0xed0fcde5a, 0x395ec127, 0x29132a0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/prometheus/prometheus/retrieval/scrape.go:759 +0x1a01
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc4201f21c0, 0x29e8d60800, 0x1bf08eb000, 0x0)
	/go/src/github.com/prometheus/prometheus/retrieval/scrape.go:624 +0x48c
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
	/go/src/github.com/prometheus/prometheus/retrieval/scrape.go:281 +0x534

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jul 26, 2017

@brian-brazil This is coming from staleness specifically panicking here: https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L536-L538

Not able to come up with an explanation as to how that could be nil though. Do you have any pointers?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 26, 2017

That panic looks to be from https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L759

I'd say that the lsets cache isn't being properly maintained at some point, I don't see anything wrong from a quick look.

@bkupidura

This comment has been minimized.

Copy link

bkupidura commented Aug 22, 2017

Similar issue (not sure if same). Nothing special in config, static targets + remote write to remote storage adapter.

prometheus, version 2.0.0-beta.2 (branch: HEAD, revision: a52f082)

[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x166e410]

goroutine 364 [running]:
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).append(0xc420207180, 0xc566666000, 0xeb58c, 0x101e00, 0xed12df87b, 0x341233e8, 0x294c1e0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:777 +0x19a0
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc420207180, 0x37e11d600, 0x37e11d600, 0x0)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:646 +0x48c
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:302 +0x534
2017-08-22 10:03:41.996 - INFO - Process exited with code 2
@tudor

This comment has been minimized.

Copy link

tudor commented Aug 29, 2017

@igorcanadi and I are hitting (what looks like) the same thing in 2.0.0-beta.2:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x166e410]

goroutine 422 [running]:
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).append(0xc47b8563f0, 0xc45b0f4000, 0x162088, 0x203e00, 0xed136d7db, 0xa75f94f, 0x294c1e0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:777 +0x19a0
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc47b8563f0, 0x6fc23ac00, 0x37e11d600, 0x0)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:646 +0x48c
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:302 +0x534

Relevant code:

                ref, ok := sl.cache.getRef(yoloString(met))
                if ok {
                        lset := sl.cache.lsets[ref].lset
@tudor

This comment has been minimized.

Copy link

tudor commented Aug 29, 2017

(also, yoloString is one of my favorite function names now :) )

@johrstrom

This comment has been minimized.

Copy link

johrstrom commented Sep 1, 2017

I'm not sure if this is other users' experience, but I get this every 2 hours like clockwork at the top of the hour. It's almost surely something to do with the kubernetes discovery and/or just high load on the server because it only happens in our large environment and not the on smaller clusters.

@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 1, 2017

Our prometheus died 120 times in the last 2 days with this stack trace, so it's a bit more often than every two hours:

NAME                                      READY     STATUS    RESTARTS   AGE
prometheus-4201869658-2f1tr               1/1       Running   120        2d
@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 4, 2017

I'm getting this every 2-3 minutes. This prometheus (2.0.0-beta.2) is doing nothing but scraping a more-busy v1.3.0 prometheus on localhost. Here's its config file in its entirety:

# To ease later transition, this prometheus (running 2.0.0 beta 2)
# federates all data from the production prometheus on this host, and
# writes all the data to a new state directory, without doing anything else.
# aecolley, 2017-08-29

global:
  scrape_interval: 1m
  scrape_timeout: 20s
  evaluation_interval: 1m
  external_labels:
    zone: d

scrape_configs:
  - job_name: federate
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~".+"}'
    static_configs:
      - targets:
          - '127.0.0.1:59090'
        labels:
          instance: ''
    relabel_configs:
      - source_labels: [job]
        target_label: job
        regex: 'federate'
        replacement: ''

The stack trace is practically identical:

time="2017-09-04T15:44:43-04:00" level=info msg="Listening on :49090" source="web.go:327"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x16715b0]

goroutine 68 [running]:
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).append(0xc4201562a0, 0xc42e6c0000, 0xaf340d, 0x101fe00, 0xed13fa4d2, 0x72a7dd4, 0x295b4e0, 0x0, 0x0, 0x0, ...)
        /builddir/build/BUILD/gopath/src/github.com/prometheus/prometheus/retrieval/scrape.go:777 +0x19a0
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc4201562a0, 0xdf8475800, 0x4a817c800, 0x0)
        /builddir/build/BUILD/gopath/src/github.com/prometheus/prometheus/retrieval/scrape.go:646 +0x48c
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
        /builddir/build/BUILD/gopath/src/github.com/prometheus/prometheus/retrieval/scrape.go:302 +0x534
time="2017-09-04T15:47:32-04:00" level=info msg="Starting prometheus (version=2.0.0-beta.2, branch=tripadvisor/releng-2.0.0, revision=tr-prometheus-2.0.0-0.6)" source="main.go:204"

The only thing out of the ordinary is that the v1.3.0 prometheus (the federation source for the crashing prometheus) regularly logs "Error on ingesting out-of-order samples". That's expected because it's federating data from two fully-redundant prometheus servers, so some merging of the timeseries is normal.

Other than the crashes, beta.2 looks great. ⛅️

@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 5, 2017

@tudor and I traced this down to app.Add() that returns duplicate refs for different metrics here: https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L833. When there are duplicate refs two elements in c.refs points to the same element in c.lsets. This means that a call to addRef leaves one element in c.refs without its corresponding c.lsets element because it's deleted here: https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L549

Here is a sample of our logging that we added:

time="2017-09-05T20:14:02Z" level=info msg="Not found in the cache, producing ref 015e53a12d914b5a37007a6fd8f30b73000000000002083a for metric container_cpu_cfs_periods_total{id="/kubepods/burstable/pod2f30f3e5-8844-11e7-94df-0a365eaab0be"}" source="scrape.go:837" target="{__address__="XXXX:443", __metrics_path__="/api/v1/nodes/ip-10-0-114-253.ec2.internal:4194/proxy/metrics", __scheme__="https", beta_kubernetes_io_arch="amd64", beta_kubernetes_io_instance_type="m4.large", beta_kubernetes_io_os="linux", failure_domain_beta_kubernetes_io_region="us-east-1", failure_domain_beta_kubernetes_io_zone="us-east-1c", instance="ip-10-0-114-253.ec2.internal", job="kubernetes-cadvisor", kubernetes_io_hostname="ip-10-0-114-253.ec2.internal", kubernetes_io_role="master", node_role_kubernetes_io_master=""}"
time="2017-09-05T20:14:02Z" level=info msg="Not found in the cache, producing ref 015e53a12d914b5a37007a6fd8f30b73000000000002083a for metric container_cpu_cfs_periods_total{id="/kubepods/pod6492c4a0-8d1d-11e7-a32b-12a0b9c354ec"}" source="scrape.go:837" target="{__address__="XXXX:443", __metrics_path__="/api/v1/nodes/ip-10-0-114-253.ec2.internal:4194/proxy/metrics", __scheme__="https", beta_kubernetes_io_arch="amd64", beta_kubernetes_io_instance_type="m4.large", beta_kubernetes_io_os="linux", failure_domain_beta_kubernetes_io_region="us-east-1", failure_domain_beta_kubernetes_io_zone="us-east-1c", instance="ip-10-0-114-253.ec2.internal", job="kubernetes-cadvisor", kubernetes_io_hostname="ip-10-0-114-253.ec2.internal", kubernetes_io_role="master", node_role_kubernetes_io_master=""}"
panic: Duplicate ref!

As you can see both container_cpu_cfs_periods_total{id="/kubepods/burstable/pod2f30f3e5-8844-11e7-94df-0a365eaab0be"}" and container_cpu_cfs_periods_total{id="/kubepods/pod6492c4a0-8d1d-11e7-a32b-12a0b9c354ec"}" maps to the same ref 015e53a12d914b5a37007a6fd8f30b73000000000002083a.

We also looked at where the uniqueness of refs is guaranteed, it seems to be from the length of h.series here: https://github.com/prometheus/tsdb/blob/master/head.go#L709. This sounds like a race issue, but we weren't able to uncover it by reading the code.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 7, 2017

Thanks for the extensive debugging! That sounds like a very probably cause though I cannot think of a race in that particular case from the top of my head.

Regardless, current work in prometheus/tsdb reworks that section entirely. So this bug should disappear. Let's hope it doesn't get replaced with a similarly severe one :)

@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 8, 2017

Awesome! When do you think it'll be ready? Let us know if you'd like some beta testers.

@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 13, 2017

Sadly, v2.0.0-beta.3 isn't faring any better:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x167e3ce]

goroutine 76 [running]:
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).append(0xc420500480, 0xc430bc6000, 0xa9238e, 0x1b898d1, 0xed14b55fc, 0x27c95c51, 0x2974760, 0x0, 0x0, 0x0, ...)
        /builddir/build/BUILD/gopath/src/github.com/prometheus/prometheus/retrieval/scrape.go:801 +0x18fe
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc420500480, 0xdf8475800, 0x4a817c800, 0x0)
        /builddir/build/BUILD/gopath/src/github.com/prometheus/prometheus/retrieval/scrape.go:668 +0x560
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
        /builddir/build/BUILD/gopath/src/github.com/prometheus/prometheus/retrieval/scrape.go:305 +0x534
time="2017-09-13T12:38:22-04:00" level=info msg="Starting prometheus (version=2.0.0-beta.3, branch=tripadvisor/releng-2.0.0, revision=tr-prometheus-2.0.0-0.7)" source="main.go:210"
@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 13, 2017

Same here :(

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 14, 2017

Okay, so after several people taking a look, the cache maintenance seems to be correct. So the issue of duplicate series IDs that you debugged @igorcanadi is still a likely reason for that.

Before it was generated from the length of known series, which should've been safe but apparently wasn't. Now we have an atomic counter, which is only touched here for ID assignment. I cannot think of anything generating duplicate IDs at this point. It would be great if you could verify that it is in fact what is happening under the new implementation though @igorcanadi.

(We could just make the panic disappear by adding another check of course. But I believe this should actually panic because it shouldn't be possible unless there's a bug.)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 14, 2017

@igorcanadi @aecolley updating on which configurations you are experiencing this with would be neat – just to be sure it's not some path that's only hit when using certain features.

@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 14, 2017

@fabxc my configuration is the same as the one in my comment from 10 days ago.

fabxc added a commit that referenced this issue Sep 15, 2017

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 15, 2017

I made some slight changes that should prevent the panicking and log various cases that make the cache go bad. It would be really neat if people encountering this could run this version as I cannot reproduce the issue myself.

Container: quay.io/fabxc/prometheus:v2.0.0-beta.4-debug-2955

Or build it yourself from the debug-2955 branch.

This doesn't solve anything but would at least confirm that it is an issue with duplicated IDs, even with the latest changes in prometheus/tsdb.

@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 15, 2017

@fabxc I no longer get the crash with your build, but now prometheus is not responsive on its HTTP port.

Log:

time="2017-09-15T18:43:04Z" level=info msg="Starting prometheus (version=2.0.0-beta.4, branch=dev-2.0, revision=1b80f631a88a01473c1356d50da12eb49b462050)" source="main.go:210"
time="2017-09-15T18:43:04Z" level=info msg="Build context (go=go1.9, user=fabxc@Fabians-MBP-2.ber.coreos.systems, date=20170915-09:06:15)" source="main.go:211"
time="2017-09-15T18:43:04Z" level=info msg="Host details (Linux 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 prometheus-2690957710-46n7s (none))" source="main.go:212"
time="2017-09-15T18:43:04Z" level=info msg="Starting tsdb" source="main.go:224"
@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 16, 2017

I think you are hitting the one other known bug there, which is some form of deadlock on startup. It's not panicking because it's not even scraping :) Will have to investigate that one separately.

Can you see whether it starts listening after another restart or try with a clean storage directory otherwise?

@bkupidura

This comment has been minimized.

Copy link

bkupidura commented Sep 16, 2017

@fabxc I can also run "fixed" binary, but instead of docker image - can you just share prometheus bin?

fabxc added a commit that referenced this issue Sep 18, 2017

@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 18, 2017

The debug binary doesn't crash. I ran it for 21 minutes and 43 seconds, in which time it wrote 279320 lines to standard error.

236385 of them were of the form

msg="iterDone: deleted lset var{labels} for ref \d+; entry for var{labels shuffled} with same ref still exists" source="scrape.go:530"

42923 of them were of the form

msg="no lset entry found for ref \d+" source="scrape.go:844" timeseries="var{labels}"

The remaining 12 lines were the familiar startup/shutdown lines.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 19, 2017

Thank @aecolley. That's extremely helpful. I understand correctly that its the exact same series just in different ordering (as exposed by the target)?
That's good in so far that it means the storage is not handing out duplicated IDs/refs.

It's bad because it literally means the code we are logging from doesn't fully do what its docs specify it should do – at least not correctly.

We definitely have to handle targets that expose the same series in inconsistent order. But we should encourage all client libs etc. to stay consistent. Do you know what kind of target/client lib the metric comes from?

@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 19, 2017

Yes, the labels are in a different order. Here's a typical log line:

time="2017-09-18T17:41:07-04:00" level=warning msg="iterDone: deleted lset process_cpu_seconds_total{job="monitor",dc="mz",instance="prom01mz:9090",owner="devops",zone="global"} for ref 66494; entry for process_cpu_seconds_total{instance="prom01mz:9090",owner="devops",job="monitor",dc="mz",zone="global"} with same ref still exists" source="scrape.go:530" target=http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D

The /federate target is prometheus v1.3.0, as indeed is the original instance in this case. It's possible that this behaviour doesn't happen with newer prometheus releases, because I haven't tried any of them in this environment.

I grepped the same log file for the other lines corresponding to the same timeseries (using the instance and zone labels), and here they are, in case it helps:

time="2017-09-18T17:45:27-04:00" level=warning msg="iterDone: deleted lset process_cpu_seconds_total{instance="prom01mz:9090",owner="devops",job="monitor",dc="mz",zone="global"} for ref 66494; entry for process_cpu_seconds_total{job="monitor",dc="mz",instance="prom01mz:9090",owner="devops",zone="global"} with same ref still exists" source="scrape.go:530" target=http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D
time="2017-09-18T17:46:59-04:00" level=warning msg="no lset entry found for ref 66494" source="scrape.go:844" target=http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D timeseries="process_cpu_seconds_total{job="monitor",dc="mz",instance="prom01mz:9090",owner="devops",zone="global"}"
time="2017-09-18T17:51:35-04:00" level=warning msg="no lset entry found for ref 66494" source="scrape.go:844" target=http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D timeseries="process_cpu_seconds_total{job="monitor",dc="mz",instance="prom01mz:9090",owner="devops",zone="global"}"

fabxc added a commit that referenced this issue Sep 19, 2017

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 19, 2017

I think I've got this fixed – essentially just getting rid of some complexity we no longer need. The changes so far should also make us a bit more resilient against misbehaving targets like cAdvisor and – in my benchmark run so far – save another ~10% memory for a scraping-only Prometheus server.

Will file a PR once the everything looks good.
If you want you can try it out by building again from the debug-2955 branch or via quay.io/fabxc/prometheus:v2.0.0-beta.4-debug-2955.

This also contains other important fixes we've done since beta.4.

@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 19, 2017

After only 15 minutes of running the new build (prometheus_build_info{branch="debug-2955",goversion="go1.9",revision="f13dda0d1013ab45126c87163667bc49a05440d6",version="2.0.0-beta.4"}), everything seems fine. I'll update again after a day. Thanks for the very fast fix!

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 19, 2017

Awesome, thanks everyone for helping with debugging!
@aecolley once we've got 24h feedback, I'll see if we can get out beta.5.

@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 19, 2017

I'm also deploying this build to our clusters, will report back in 24h!

@aecolley

This comment has been minimized.

Copy link

aecolley commented Sep 20, 2017

24h later, still no crashes, here's what the log looks like:

Caller(s) Count Example
scrape.go:905 48 level=warn ts=2017-09-19T13:28:38.103295879Z caller=scrape.go:905 component="target manager" scrape_pool=federate target="http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D" msg="Error on ingesting out-of-order samples" num_dropped=1
scrape.go:867 46 level=debug ts=2017-09-19T13:28:37.587221956Z caller=scrape.go:867 component="target manager" scrape_pool=federate target="http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D" msg="Out of order sample" series="ALERTS{(redacted)}"
compact.go:359 17 ts=2017-09-19T15:00:38.846134898Z caller=compact.go:359 component=tsdb msg="compact blocks" count=1 mint=1505822400000 maxt=1505829600000
head.go:265 12 ts=2017-09-19T15:00:40.558043907Z caller=head.go:265 component=tsdb msg="head GC completed" duration=104.258027ms
head.go:276 12 ts=2017-09-19T15:00:40.558196996Z caller=head.go:276 component=tsdb msg="WAL truncation completed" duration=4.512µs
scrape.go:820 8 level=debug ts=2017-09-19T14:20:39.489029496Z caller=scrape.go:820 component="target manager" scrape_pool=federate target="http://127.0.0.1:59090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D" msg="Out of order sample" series="ALERTS{(redacted)}"
main.go:214 1 level=info ts=2017-09-19T12:47:16.400664405Z caller=main.go:214 msg="Starting prometheus" version="(version=2.0.0-beta.4, branch=debug-2955, revision=f13dda0d1013ab45126c87163667bc49a05440d6)"
main.go:215 1 level=info ts=2017-09-19T12:47:16.400810026Z caller=main.go:215 build_context="(go=go1.9, user=fabxc@Fabians-MBP-2.ber.coreos.systems, date=20170919-11:31:38)"
main.go:216 1 level=info ts=2017-09-19T12:47:16.400842701Z caller=main.go:216 host_details="(Linux 2.6.32-696.10.2.el6.x86_64 #1 SMP Tue Sep 12 14:33:29 UTC 2017 x86_64 prom01d (none))"
main.go:229 1 level=info ts=2017-09-19T12:47:16.400906984Z caller=main.go:229 msg="Starting TSDB"
main.go:242 1 level=info ts=2017-09-19T12:47:16.40888632Z caller=main.go:242 msg="TSDB succesfully started"
main.go:352 1 level=info ts=2017-09-19T12:47:16.424764905Z caller=main.go:352 msg="Server is ready to receive requests."
main.go:375 1 level=info ts=2017-09-19T12:47:16.411231102Z caller=main.go:375 msg="Loading configuration file" filename=/srv/prometheus-beta.yml
targetmanager.go:68 1 level=info ts=2017-09-19T12:47:16.424815922Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
web.go:367 1 level=info ts=2017-09-19T12:47:16.424850738Z caller=web.go:367 component=web msg="Start listening for connections" address=:49090
@igorcanadi

This comment has been minimized.

Copy link

igorcanadi commented Sep 20, 2017

No crashes for us either, great stuff, thanks @fabxc !

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 22, 2017

Preemptively closing here. All issues listed here were addressed in separate PRs and should be fixed in beta.5.
If any issues occur, please open new issues as this one is got quite busy with different ones.

@fabxc fabxc closed this Sep 22, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.