Skip to content

infiniband: Handle iWARP* RDMA modules N/A#974

Merged
SuperQ merged 2 commits intoprometheus:masterfrom
mjtrangoni:handle-inactive-iwarp
Oct 4, 2018
Merged

infiniband: Handle iWARP* RDMA modules N/A#974
SuperQ merged 2 commits intoprometheus:masterfrom
mjtrangoni:handle-inactive-iwarp

Conversation

@mjtrangoni
Copy link
Contributor

Hi,

PTAL. This should be "handling" #966, where the Intel iWARP RDMA modules are reporting N/A (no PMA) instead of an integer. I added an example to the fixtures, and the end to end tests.

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this.
Beside my comment above, the end2end tests are failing.

@discordianfish
Copy link
Member

@mjtrangoni Can you look into fixing the tests? Buildkite runs the integration tests and they fail. You should be able to reproduce this locally by running the tests. This is the result:

node_infiniband_legacy_unicast_packets_transmitted_total{device="mlx4_0",port="2"} 61239
--
  | # HELP node_infiniband_link_downed_total Number of times the link failed to recover from an error state and went down
  | # TYPE node_infiniband_link_downed_total counter
  | +node_infiniband_link_downed_total{device="i40iw0",port="1"} 0
  | node_infiniband_link_downed_total{device="mlx4_0",port="1"} 0
  | node_infiniband_link_downed_total{device="mlx4_0",port="2"} 0
  | # HELP node_infiniband_link_error_recovery_total Number of times the link successfully recovered from an error state
  | # TYPE node_infiniband_link_error_recovery_total counter
  | +node_infiniband_link_error_recovery_total{device="i40iw0",port="1"} 0
  | node_infiniband_link_error_recovery_total{device="mlx4_0",port="1"} 0
  | node_infiniband_link_error_recovery_total{device="mlx4_0",port="2"} 0
  | # HELP node_infiniband_multicast_packets_received_total Number of multicast packets received (including errors)
  | @@ -803,10 +805,12 @@
  | node_infiniband_multicast_packets_transmitted_total{device="mlx4_0",port="2"} 0
  | # HELP node_infiniband_port_data_received_bytes_total Number of data octets received on all links
  | # TYPE node_infiniband_port_data_received_bytes_total counter
  | +node_infiniband_port_data_received_bytes_total{device="i40iw0",port="1"} 0
  | node_infiniband_port_data_received_bytes_total{device="mlx4_0",port="1"} 1.8527668e+07
  | node_infiniband_port_data_received_bytes_total{device="mlx4_0",port="2"} 0
  | # HELP node_infiniband_port_data_transmitted_bytes_total Number of data octets transmitted on all links
  | # TYPE node_infiniband_port_data_transmitted_bytes_total counter
  | +node_infiniband_port_data_transmitted_bytes_total{device="i40iw0",port="1"} 0
  | node_infiniband_port_data_transmitted_bytes_total{device="mlx4_0",port="1"} 1.493376e+07
  | node_infiniband_port_data_transmitted_bytes_total{device="mlx4_0",port="2"} 0
  | # HELP node_infiniband_unicast_packets_received_total Number of unicast packets received (including errors)

@mjtrangoni
Copy link
Contributor Author

Hi @discordianfish,
Sorry for the delay, I will be taking a look at this today.

@mjtrangoni mjtrangoni force-pushed the handle-inactive-iwarp branch from e0001da to 2928a13 Compare August 20, 2018 20:10
@mjtrangoni
Copy link
Contributor Author

@discordianfish I can not see what is really happening at buildkite build. How should I do to access its logs?

Locally everything is working well, see,

>> extracting sysfs fixtures
if [ -d collector/fixtures/sys ] ; then rm -r collector/fixtures/sys ; fi
./ttar -C collector/fixtures -x -f collector/fixtures/sys.ttar
/home/mt/go/packages/src/github.com/prometheus/node_exporter/collector/fixtures
touch collector/fixtures/sys/.unpacked
>> running tests
go test -short -race ./...
ok  	github.com/prometheus/node_exporter	(cached)
ok  	github.com/prometheus/node_exporter/collector	1.202s
>> vetting code
go vet ./...
>> checking metrics for correctness
./checkmetrics.sh /home/mt/go/packages/bin/promtool collector/fixtures/e2e-output.txt
>> running tests in 32-bit mode
ok  	github.com/prometheus/node_exporter	(cached)
ok  	github.com/prometheus/node_exporter/collector	0.048s
>> running end-to-end tests
./end-to-end-test.sh

BTW, I will be taking a look tomorrow if I can satisfy your previous change request.

@discordianfish
Copy link
Member

Hrmm odd.. Here is the full build log. @SuperQ maybe you have some ideas?
I might have more time tomorrow to have a closer look.

~~~ Preparing working directory
�[90m$�[0m cd /home/peon/.buildkite-agent/builds/debian-9-4-0-ppc64le-build-prometheus-io-1/prometheus/node-exporter
�[90m$�[0m git remote set-url origin https://github.com/prometheus/node_exporter
�[90m$�[0m git clean -fxdq
�[90m# Fetch and checkout pull request head from GitHub�[0m
�[90m$�[0m git fetch -v origin refs/pull/974/head
POST git-upload-pack (960 bytes)

remote: Counting objects: 12, done.�[K

remote: Compressing objects:  25% (1/4)   �[K
remote: Compressing objects:  50% (2/4)   �[K
remote: Compressing objects:  75% (3/4)   �[K
remote: Compressing objects: 100% (4/4)   �[K
remote: Compressing objects: 100% (4/4), done.�[K

remote: Total 12 (delta 8), reused 11 (delta 8), pack-reused 0�[K

Unpacking objects:   8% (1/12)   
Unpacking objects:  16% (2/12)   
Unpacking objects:  25% (3/12)   
Unpacking objects:  33% (4/12)   
Unpacking objects:  41% (5/12)   
Unpacking objects:  50% (6/12)   
Unpacking objects:  58% (7/12)   
Unpacking objects:  66% (8/12)   
Unpacking objects:  75% (9/12)   
Unpacking objects:  83% (10/12)   
Unpacking objects:  91% (11/12)   
Unpacking objects: 100% (12/12)   
Unpacking objects: 100% (12/12), done.

From https://github.com/prometheus/node_exporter

 * branch            refs/pull/974/head -> FETCH_HEAD

�[90m# FETCH_HEAD is now `2928a1330c0cbdf82eca34da3728c34e14774b9e`�[0m
�[90m$�[0m git checkout -f 2928a1330c0cbdf82eca34da3728c34e14774b9e
Warning: you are leaving 1 commit behind, not connected to

any of your branches:



  58c51d2 filesystem: Ignore netns/nsfs mounts



If you want to keep it by creating a new branch, this may be a good time

to do so with:



 git branch <new-branch-name> 58c51d2



HEAD is now at 2928a13... infiniband: Handle issue when iWARP* RDMA modules are not available

�[90m# Cleaning again to catch any post-checkout changes�[0m
�[90m$�[0m git clean -fxdq
�[90m# Checking to see if Git data needs to be sent to Buildkite�[0m
�[90m$�[0m buildkite-agent meta-data exists buildkite:git:commit
~~~ Running commands
�[90m$�[0m # Custom GOROOT
export GOROOT=$HOME/godev/go
export PATH=$GOROOT/bin:$PATH:$HOME/.buildkite-agent/bin

if [ ! -e "$HOME/.buildkite-agent/builds/${BUILDKITE_AGENT_NAME//\./-}/prometheus" ]; then
  mkdir -p  "$HOME/.buildkite-agent/builds/${BUILDKITE_AGENT_NAME//\./-}/prometheus"
fi

# Fresh GOPATH
export GOPATH="$(mktemp -d "$HOME/.buildkite-agent/builds/${BUILDKITE_AGENT_NAME//\./-}/prometheus/node-exporter-$BUILDKITE_COMMIT-$BUILDKITE_BUILD_NUMBER-XXXX")"

make promu

tar cf - --exclude ".git" . | (mkdir -p "$GOPATH/src/github.com/prometheus/node_exporter"; tar xf - -C "$GOPATH/src/github.com/prometheus/node_exporter")

(
 cd $GOPATH/src/github.com/prometheus/node_exporter
make
buildkite-agent artifact upload "node_exporter"
)

$HOME/bin/gc-builds / 75 /home/peon/var/gc_builds.prom
GOOS= GOARCH= go get -u github.com/prometheus/promu

>> checking code style

! gofmt -d $(find . -path ./vendor -prune -o -name '*.go' -print) | grep '^'

GOOS= GOARCH= go get -u honnef.co/go/tools/cmd/staticcheck

>> running staticcheck

/home/peon/.buildkite-agent/builds/debian-9-4-0-ppc64le-build-prometheus-io-1/prometheus/node-exporter-2928a1330c0cbdf82eca34da3728c34e14774b9e-991-jKp2/bin/staticcheck -ignore "" ./...

GOOS= GOARCH= go get -u github.com/kardianos/govendor

>> running check for unused packages

No unused packages

GOOS= GOARCH= go get -u github.com/prometheus/promu

>> building binaries

/home/peon/.buildkite-agent/builds/debian-9-4-0-ppc64le-build-prometheus-io-1/prometheus/node-exporter-2928a1330c0cbdf82eca34da3728c34e14774b9e-991-jKp2/bin/promu build --prefix /home/peon/.buildkite-agent/builds/debian-9-4-0-ppc64le-build-prometheus-io-1/prometheus/node-exporter-2928a1330c0cbdf82eca34da3728c34e14774b9e-991-jKp2/src/github.com/prometheus/node_exporter

 >   node_exporter

# github.com/prometheus/node_exporter

/tmp/go-link-787584353/000019.o: In function `mygetgrouplist':

/home/peon/godev/go/src/os/user/getgrouplist_unix.go:15: warning: Using 'getgrouplist' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

/tmp/go-link-787584353/000018.o: In function `mygetgrgid_r':

/home/peon/godev/go/src/os/user/cgo_lookup_unix.go:38: warning: Using 'getgrgid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

/tmp/go-link-787584353/000018.o: In function `mygetgrnam_r':

/home/peon/godev/go/src/os/user/cgo_lookup_unix.go:43: warning: Using 'getgrnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

/tmp/go-link-787584353/000018.o: In function `mygetpwnam_r':

/home/peon/godev/go/src/os/user/cgo_lookup_unix.go:33: warning: Using 'getpwnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

/tmp/go-link-787584353/000018.o: In function `mygetpwuid_r':

/home/peon/godev/go/src/os/user/cgo_lookup_unix.go:28: warning: Using 'getpwuid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

/tmp/go-link-787584353/000006.o: In function `_cgo_f7895c2c5a3a_C2func_getaddrinfo':

/tmp/go-build/cgo-gcc-prolog:46: warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

>> extracting sysfs fixtures

if [ -d collector/fixtures/sys ] ; then rm -r collector/fixtures/sys ; fi

./ttar -C collector/fixtures -x -f collector/fixtures/sys.ttar

touch collector/fixtures/sys/.unpacked

>> running tests

go test -short  ./...

ok  	github.com/prometheus/node_exporter	0.186s

ok  	github.com/prometheus/node_exporter/collector	0.307s

>> vetting code

go vet ./...

>> checking metrics for correctness

./checkmetrics.sh /home/peon/.buildkite-agent/builds/debian-9-4-0-ppc64le-build-prometheus-io-1/prometheus/node-exporter-2928a1330c0cbdf82eca34da3728c34e14774b9e-991-jKp2/bin/promtool collector/fixtures/e2e-64k-page-output.txt

>> SKIP running tests in 32-bit mode: not supported on Linux/ppc64le

>> running end-to-end tests

./end-to-end-test.sh

--- collector/fixtures/e2e-64k-page-output.txt	2018-08-19 09:46:02.000000000 -0400

+++ /tmp/node_exporter_e2e_test.qR3C7i/e2e-output.txt	2018-08-20 16:43:30.448636760 -0400

@@ -787,10 +787,12 @@

 node_infiniband_legacy_unicast_packets_transmitted_total{device="mlx4_0",port="2"} 61239

 # HELP node_infiniband_link_downed_total Number of times the link failed to recover from an error state and went down

 # TYPE node_infiniband_link_downed_total counter

+node_infiniband_link_downed_total{device="i40iw0",port="1"} 0

 node_infiniband_link_downed_total{device="mlx4_0",port="1"} 0

 node_infiniband_link_downed_total{device="mlx4_0",port="2"} 0

 # HELP node_infiniband_link_error_recovery_total Number of times the link successfully recovered from an error state

 # TYPE node_infiniband_link_error_recovery_total counter

+node_infiniband_link_error_recovery_total{device="i40iw0",port="1"} 0

 node_infiniband_link_error_recovery_total{device="mlx4_0",port="1"} 0

 node_infiniband_link_error_recovery_total{device="mlx4_0",port="2"} 0

 # HELP node_infiniband_multicast_packets_received_total Number of multicast packets received (including errors)

@@ -803,10 +805,12 @@

 node_infiniband_multicast_packets_transmitted_total{device="mlx4_0",port="2"} 0

 # HELP node_infiniband_port_data_received_bytes_total Number of data octets received on all links

 # TYPE node_infiniband_port_data_received_bytes_total counter

+node_infiniband_port_data_received_bytes_total{device="i40iw0",port="1"} 0

 node_infiniband_port_data_received_bytes_total{device="mlx4_0",port="1"} 1.8527668e+07

 node_infiniband_port_data_received_bytes_total{device="mlx4_0",port="2"} 0

 # HELP node_infiniband_port_data_transmitted_bytes_total Number of data octets transmitted on all links

 # TYPE node_infiniband_port_data_transmitted_bytes_total counter

+node_infiniband_port_data_transmitted_bytes_total{device="i40iw0",port="1"} 0

 node_infiniband_port_data_transmitted_bytes_total{device="mlx4_0",port="1"} 1.493376e+07

 node_infiniband_port_data_transmitted_bytes_total{device="mlx4_0",port="2"} 0

 # HELP node_infiniband_unicast_packets_received_total Number of unicast packets received (including errors)

LOG =====================

time="2018-08-20T16:43:28-04:00" level=info msg="Starting node_exporter (version=0.16.0, branch=non-git, revision=non-git)" source="node_exporter.go:82"

time="2018-08-20T16:43:28-04:00" level=info msg="Build context (go=go1.10.3, user=peon@debian-9-4-0-ppc64le.build.prometheus.io, date=20180820-20:19:45)" source="node_exporter.go:83"

time="2018-08-20T16:43:28-04:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"

time="2018-08-20T16:43:28-04:00" level=info msg=" - arp" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - bcache" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - bonding" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - buddyinfo" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - conntrack" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - cpu" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - diskstats" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - drbd" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - edac" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - entropy" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - filefd" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - hwmon" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - infiniband" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - interrupts" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - ipvs" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - ksmd" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - loadavg" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - mdadm" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - meminfo" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - meminfo_numa" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - mountstats" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - netclass" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - netdev" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - netstat" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - nfs" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - nfsd" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - processes" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - qdisc" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - sockstat" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - stat" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - textfile" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - vmstat" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - wifi" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - xfs" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg=" - zfs" source="node_exporter.go:97"

time="2018-08-20T16:43:28-04:00" level=info msg="Listening on 127.0.0.1:12807" source="node_exporter.go:111"

time="2018-08-20T16:43:29-04:00" level=debug msg="collect query: []" source="node_exporter.go:36"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: sockstat collector succeeded after 0.002302s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: nfs collector succeeded after 0.005995s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: vmstat collector succeeded after 0.003657s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Don't know how to process key-value pair [version: \"\"]" source="drbd_linux.go:206"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: textfile collector succeeded after 0.019876s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Don't know how to process string \"8.4.3\"" source="drbd_linux.go:209"

time="2018-08-20T16:43:29-04:00" level=debug msg="Don't know how to process string \"(api:1/proto:86-101)\"" source="drbd_linux.go:209"

time="2018-08-20T16:43:29-04:00" level=debug msg="Don't know how to process key-value pair [srcversion: \"\"]" source="drbd_linux.go:206"

time="2018-08-20T16:43:29-04:00" level=debug msg="Don't know how to process string \"1A9F77B1CA5FF92235C2213\"" source="drbd_linux.go:209"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: interrupts collector succeeded after 0.005136s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: qdisc collector succeeded after 0.007228s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="return load 0: 0.210000" source="loadavg.go:51"

time="2018-08-20T16:43:29-04:00" level=debug msg="return load 1: 0.370000" source="loadavg.go:51"

time="2018-08-20T16:43:29-04:00" level=debug msg="return load 2: 0.390000" source="loadavg.go:51"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: loadavg collector succeeded after 0.003279s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: nfsd collector succeeded after 0.008169s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Set node_mem: map[string]float64{\"VmallocTotal_bytes\":3.5184372087808e+13, \"AnonHugePages_bytes\":0, \"MemTotal_bytes\":3.831959552e+09, \"SwapCached_bytes\":1.97124096e+08, \"Mapped_bytes\":2.4496128e+08, \"SReclaimable_bytes\":4.5846528e+07, \"NFS_Unstable_bytes\":0, \"CommitLimit_bytes\":6.210940928e+09, \"HugePages_Total\":0, \"Inactive_anon_bytes\":9.04245248e+08, \"Hugepagesize_bytes\":2.097152e+06, \"HugePages_Surp\":0, \"Active_anon_bytes\":2.068484096e+09, \"Mlocked_bytes\":32768, \"AnonPages_bytes\":2.298032128e+09, \"VmallocChunk_bytes\":3.5183963009024e+13, \"HugePages_Free\":0, \"HugePages_Rsvd\":0, \"DirectMap4k_bytes\":1.9011584e+08, \"Inactive_file_bytes\":1.49172224e+08, \"Dirty_bytes\":1.077248e+06, \"KernelStack_bytes\":5.9392e+06, \"Bounce_bytes\":0, \"Committed_AS_bytes\":8.023486464e+09, \"VmallocUsed_bytes\":3.6130816e+08, \"Shmem_bytes\":6.0809216e+08, \"HardwareCorrupted_bytes\":0, \"Active_file_bytes\":2.18533888e+08, \"Unevictable_bytes\":32768, \"Writeback_bytes\":0, \"PageTables_bytes\":7.7017088e+07, \"WritebackTmp_bytes\":0, \"Cached_bytes\":9.53229312e+08, \"Active_bytes\":2.287017984e+09, \"Inactive_bytes\":1.053417472e+09, \"SwapFree_bytes\":3.23108864e+09, \"SUnreclaim_bytes\":5.545984e+07, \"MemFree_bytes\":2.30883328e+08, \"Buffers_bytes\":2.256896e+07, \"SwapTotal_bytes\":4.2949632e+09, \"Slab_bytes\":1.01306368e+08, \"DirectMap2M_bytes\":3.787456512e+09}" source="meminfo.go:48"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: meminfo collector succeeded after 0.096349s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: edac collector succeeded after 0.131201s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: entropy collector succeeded after 0.001151s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop3" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="\"port_rcv_data\" value is N/A" source="infiniband_linux.go:150"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop6" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: sda1" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram0" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram1" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="\"port_xmit_data\" value is N/A" source="infiniband_linux.go:150"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram11" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram12" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="\"link_downed\" value is N/A" source="infiniband_linux.go:150"

time="2018-08-20T16:43:29-04:00" level=debug msg="\"link_error_recovery\" value is N/A" source="infiniband_linux.go:150"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop7" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Don't know how to process string \"C\"" source="drbd_linux.go:209"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: netdev collector succeeded after 0.007508s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: conntrack collector succeeded after 0.064022s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="probing wifi device \"wlan0\" with type \"station\"" source="wifi_linux.go:172"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: filefd collector succeeded after 0.123671s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: ksmd collector succeeded after 0.073170s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: ipvs collector succeeded after 0.046448s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Personality unknown: [md219 : inactive sdb[2](S) sdc[1](S) sda[0](S)]" source="mdadm_linux.go:185"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: xfs collector succeeded after 0.064362s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: sda4" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram3" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram4" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram13" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop4" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: bonding collector succeeded after 0.174741s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: stat collector succeeded after 0.221985s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Set node_buddy: []procfs.BuddyInfo{procfs.BuddyInfo{Node:\"0\", Zone:\"DMA\", Sizes:[]float64{1, 0, 1, 0, 2, 1, 1, 0, 1, 1, 3}}, procfs.BuddyInfo{Node:\"0\", Zone:\"DMA32\", Sizes:[]float64{759, 572, 791, 475, 194, 45, 12, 0, 0, 0, 0}}, procfs.BuddyInfo{Node:\"0\", Zone:\"Normal\", Sizes:[]float64{4381, 1093, 185, 1530, 567, 102, 4, 0, 0, 0, 0}}}" source="buddyinfo.go:63"

time="2018-08-20T16:43:29-04:00" level=debug msg="Skipping duplicate device entry {\"192.168.1.1:/srv/test\" \"tcp\"}" source="mountstats_linux.go:518"

time="2018-08-20T16:43:29-04:00" level=debug msg="file not found when retrieving stats: \"open collector/fixtures/proc/11/stat: no such file or directory\"" source="processes_linux.go:108"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram10" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop0" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop1" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram8" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram9" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: sda2" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: vda2" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: nvme0n1p2" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram14" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram15" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: loop2" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: sda3" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram5" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram6" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: vda1" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: ram7" source="diskstats_linux.go:178"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: buddyinfo collector succeeded after 0.166204s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: mountstats collector succeeded after 0.171070s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: arp collector succeeded after 0.059075s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="OK: meminfo_numa collector succeeded after 0.267895s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: processes collector succeeded after 0.254776s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: zfs collector succeeded after 0.267997s." source="collector.go:135"

time="2018-08-20T16:43:29-04:00" level=debug msg="Ignoring device: nvme0n1p1" source="diskstats_linux.go:178"

time="2018-08-20T16:43:30-04:00" level=debug msg="Ignoring device: ram2" source="diskstats_linux.go:178"

time="2018-08-20T16:43:30-04:00" level=debug msg="Ignoring device: loop5" source="diskstats_linux.go:178"

time="2018-08-20T16:43:30-04:00" level=debug msg="Ignoring device: fd0" source="diskstats_linux.go:178"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: diskstats collector succeeded after 0.259788s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="Don't know how to process string \"r-----\"" source="drbd_linux.go:209"

time="2018-08-20T16:43:30-04:00" level=debug msg="Don't know how to process key-value pair [wo: \"d\"]" source="drbd_linux.go:206"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: drbd collector succeeded after 0.330100s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md3" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md127" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md0" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md4" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md6" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md8" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md7" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md9" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md10" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md11" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md12" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md126" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md219" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md00" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="collecting metrics for device md120" source="mdadm_linux.go:275"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: mdadm collector succeeded after 0.349385s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: netstat collector succeeded after 0.286771s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="probing wifi device \"wlan1\" with type \"access point\"" source="wifi_linux.go:172"

time="2018-08-20T16:43:30-04:00" level=debug msg="BSS information not found for wifi device \"wlan1\"" source="wifi_linux.go:190"

time="2018-08-20T16:43:30-04:00" level=debug msg="station information not found for wifi device \"wlan1\"" source="wifi_linux.go:203"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: wifi collector succeeded after 0.380765s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: infiniband collector succeeded after 0.403846s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: bcache collector succeeded after 0.369847s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: cpu collector succeeded after 0.319106s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: netclass collector succeeded after 0.486988s." source="collector.go:135"

time="2018-08-20T16:43:30-04:00" level=debug msg="OK: hwmon collector succeeded after 0.529723s." source="collector.go:135"

=========================

Makefile:95: recipe for target 'test-e2e' failed

make: *** [test-e2e] Error 1

�[31m🚨 Error: The command exited with status 2�[0m
^^^ +++
^^^ +++

@hhoffstaette
Copy link
Contributor

hhoffstaette commented Sep 8, 2018

Sorry to comment so late, but I figured the following information might be of interest.
While I agree that making the collector more error-tolerant would be nice, the root cause will soon go away because it should not have happened in the first place! I ran into the same problem with rxe (software RoCE) and reported it to linux-rdma, with an eventual solution preventing the empty/invalid counters directory from appearing in the first place. This patch will soon go into mainline.
I realize that doesn't help people with old kernels, but nevertheless it's good to know that this problem is fixed for good.
With the mentioned patch applied to my kernel e.g. an rxe device is properly detected, but has no metrics - but also causes no errors.

Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>
@mjtrangoni mjtrangoni force-pushed the handle-inactive-iwarp branch 2 times, most recently from dd987d2 to d4221a7 Compare September 11, 2018 11:32
@mjtrangoni
Copy link
Contributor Author

@discordianfish I added the node_infiniband_*{device="i40iw0",port="1"} 0 metrics at collector/fixtures/e2e-64k-page-output.txt. What is failing now at buildkite?

@discordianfish
Copy link
Member

@mjtrangoni The failure was unrelated, kicked off the build again. If that passes, we're good to merge this (except that code comment)

@SuperQ Can you have a look? Also: Can we somehow make the buildkite output public?

@SuperQ
Copy link
Member

SuperQ commented Sep 12, 2018

@discordianfish There's a bunch of long-standing issues with buildkite not having any way to make build output public. 😢

This is related to prometheus#966, and handle this error,

Jun 07 13:33:24 hostname node_exporter[81888]: time="2018-06-07T13:33:24+02:00" level=error msg="ERROR: infiniband
collector failed after 0.000929s: strconv.ParseUint: parsing \"N/A (no PMA)\": invalid syntax" source="collector.go:132"

Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>
@mjtrangoni mjtrangoni force-pushed the handle-inactive-iwarp branch from d4221a7 to 4ae03a4 Compare September 13, 2018 11:09
@mjtrangoni
Copy link
Contributor Author

@discordianfish PTAL at my comment. buildkite failed again, but it should still be unrelated.

@mjtrangoni
Copy link
Contributor Author

@discordianfish ping? I think this could be merged.

@discordianfish
Copy link
Member

LGTM, @SuperQ wdyt?

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ SuperQ merged commit 3659260 into prometheus:master Oct 4, 2018
@mjtrangoni mjtrangoni deleted the handle-inactive-iwarp branch October 4, 2018 13:23
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
* infiniband: Add not connected i40iw0/ports/1 fixtures
* infiniband: Handle issue when iWARP* RDMA modules are not available

This is related to prometheus#966, and handle this error,

Jun 07 13:33:24 hostname node_exporter[81888]: time="2018-06-07T13:33:24+02:00" level=error msg="ERROR: infiniband
collector failed after 0.000929s: strconv.ParseUint: parsing \"N/A (no PMA)\": invalid syntax" source="collector.go:132"

Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants