Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win_perf_counter plugin does not work on 386 #2468

Closed
PierreF opened this issue Feb 23, 2017 · 19 comments
Closed

win_perf_counter plugin does not work on 386 #2468

PierreF opened this issue Feb 23, 2017 · 19 comments
Labels
area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows
Milestone

Comments

@PierreF
Copy link
Contributor

PierreF commented Feb 23, 2017

Bug report

Telegraf i386 crash on Windows:

2017-02-23T15:41:04Z I! Starting Telegraf (version 1.2.1)
2017-02-23T15:41:04Z I! Loaded outputs: file
2017-02-23T15:41:04Z I! Loaded inputs: inputs.win_perf_counters
2017-02-23T15:41:04Z I! Tags enabled: host=MSEDGEWIN10
2017-02-23T15:41:04Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"MSEDGEWIN10", Flush Interval:10s
unexpected fault address 0x5566b687
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0x5566b687 pc=0x48a345]

goroutine 17 [running]:
runtime.throw(0xf13bde, 0x5)
        /usr/local/go/src/runtime/panic.go:566 +0x7f fp=0x12302e60 sp=0x12302e54
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_windows.go:164 +0x116 fp=0x12302e78 sp=0x12302e60
syscall.UTF16ToString(0x5566b687, 0x20000000, 0x20000000, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_windows.go:51 +0x35 fp=0x12302ea0 sp=0x12302e78
github.com/influxdata/telegraf/plugins/inputs/win_perf_counters.UTF16PtrToString(0x5566b687, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/win_perf_counters/pdh.go:418 +0x62 fp=0x12302ec4 sp=0x12302ea0
github.com/influxdata/telegraf/plugins/inputs/win_perf_counters.(*Win_PerfCounters).Gather(0x1265a8e0, 0x14046a0, 0x1265af60, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/win_perf_counters/win_perf_counters.go:277 +0x32d fp=0x12302f98 sp=0x12302ec4
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0x122a8a00, 0x1265aaa0, 0x1265af60)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:153 +0x4c fp=0x12302fc8 sp=0x12302f98
runtime.goexit()
        /usr/local/go/src/runtime/asm_386.s:1612 +0x1 fp=0x12302fcc sp=0x12302fc8
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:154 +0xe5

Full telegraf.conf:

[[outputs.file]]
    files = ["stdout"]

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
    ]

System info:

Windows 10 amd64 and Windows 8 i386
Telegraf 1.2.1 and nightly

Steps to reproduce:

  1. Run telegraf and wait few seconds (like 20-30s)

Expected behavior:

No crash

Actual behavior:

Crash :)

Additional info:

On the same machine, amd64 version works well.
I've dig a bit on the probable root cause and I think that issue is a difference size in structure size between Go and Windows API.
PDH_FMT_COUNTERVALUE_ITEM_DOUBLE has a size (according to unsafe.Sizeof, so according to Go) of 24 bytes on amd64 and 16 bytes on i386.

Both seems logical if structure aligns its fields on machine word size (8 bytes on amd64; 4 bytes on i386).
The expanded structure is

struct {
    SzName *uint16   // machine word size: 4 or 8 bytes
                               // no padding needed to align on word size
    CStatus uint32     // 2 bytes
                               // padding to align on word size. 2 bytes on i386 and 6 bytes on amd64
    DoubleValue float64  // 8 bytes
}

But I think Windows and C++ do align on 8 bytes boundary for both i386 and amd64. I don't have C++ compiler on Windows to confirm this hypothesis, but by adding few fmt.Printf that leads my to this idea:

Just before this for loop I've added:

fmt.Printf("ret=%#v, bufSize=%#v, bufCount=%#v\n", ret, bufSize, bufCount)
fmt.Printf("%#v\n", (*[1 << 29]byte)(unsafe.Pointer(&(filledBuf[0])))[:bufSize])

This will dump the number of items and the binary data in the buffer.

Result just before crash (on i386 version of telegraf):

ret=0x0, bufSize=0x42, bufCount=0x2
[]byte{
    0x30, 0x10, 0x3c, 0x12,        // this is szName
    0x0, 0x0, 0x0, 0x0,            // this looks like a padding to align CStatus on 8 bytes boundary
    0x0, 0x0,                      // this is CStatus
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0,  // this looks like a padding to align DoubleValue on 8 bytes boundary
    0x11, 0xc0, 0x11, 0x20, 0x2e, 0x83, 0x56, 0x40,  // This a a double value equal to 90.05, look good for a CPU Idle %

   0x3e, 0x10, 0x3c, 0x12,        // this look like another szName, address are rather close to first one (0x123c103e vs 0x123c1030)
   0x0, 0x0, 0x0, 0x0,            // padding
   0x0, 0x0,                      // CStatus
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0,  // padding
   0x11, 0xc0, 0x11, 0x20, 0x2e, 0x83, 0x56, 0x40,  // double, equal to 90.05 like this first one.
                                                    // It's expected since the machine is a single core. One of the value is
                                                    // the single core, the other is the total (which on a single core is the same value)
 
   // I don't know why there is always some additional data... it's the case on i386 and amd64
   0x5f, 0x0, 0x54, 0x0, 0x6f, 0x0, 0x74, 0x0, 0x61, 0x0, 0x6c, 0x0, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0
}

But since Go assume alignment is done on machine word size, it will interpret value as:

[]byte{
    0x30, 0x10, 0x3c, 0x12,        // this is szName, good
    0x0, 0x0,                      // Use this as CStatus... okay since CStatus seems to always be 0 like padding
    0x0, 0x0,                      // padding to align DoubleValue on 4 bytes boundary
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,  // Then Go assume this is DoubleValue... equal to 0.0. Unlikely for a CPU Idle % (the machine do nothing)
    
   0x11, 0xc0, 0x11, 0x20,  // this should be the next szName then... but cause the unexpected fault address 0x2011c011
   0x2e, 0x83,              // this should be CStatus
   0x56, 0x40,              // This should be a padding... with non-zero
   0x3e, 0x10, 0x3c, 0x12, 0x0, 0x0, 0x0, 0x0,  // this should be DoubleValue, equal to 1.511476285e-315
   
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
   0x11, 0xc0, 0x11, 0x20, 0x2e, 0x83, 0x56, 0x40,
   0x5f, 0x0, 0x54, 0x0, 0x6f, 0x0, 0x74, 0x0, 0x61, 0x0, 0x6c, 0x0, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0
}

Proposal:

We should verify that Windows C++ do align structure on 8 bytes boundary (anyone with a C++ compiler on Windows ?, just checking sizeof(PDH_FMT_COUNTERVALUE_ITEM) should be good).
If confirmed, we should find how to tell Go to align on 8 bytes boundary for i386 and amd64

@PierreF
Copy link
Contributor Author

PierreF commented Feb 23, 2017

It's possible that lxn/win#12 was caused by the same issue

@sparrc sparrc added bug unexpected problem or unintended behavior platform/windows help wanted Request for community participation, code, contribution labels Feb 23, 2017
@danielnelson danielnelson added area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) help wanted Request for community participation, code, contribution and removed help wanted Request for community participation, code, contribution labels May 11, 2017
@kmonsoor
Copy link

I can confirm, telegraf still crashing as it's on telegraf-1.4.0-rc2_windows_i386 tested on Windows 7 Enterprise 32-bit. Is someone working on it ?

C:\Users\IEUser\Downloads\telegraf-1.4.0-rc2_windows_i386\telegraf>telegraf.exe
2017/08/29 05:23:22 I! Using config file: C:\Users\IEUser\Downloads\telegraf-1.4.0-rc2_windows_i386\telegraf\telegraf.conf
E! Unable to create /Program Files/Telegraf/telegraf.log (open /Program Files/Telegraf/telegraf.log: The system cannot find the path specified.), using stderr
2017-08-29T12:23:24Z I! Database creation failed: Post http://localhost:8086/query?q=CREATE+DATABASE+%22telegraf%22: dial tcp [::1]:8086: connectex: No connection could be made because the target mach
ine actively refused it.
2017-08-29T12:23:24Z I! Starting Telegraf v1.4.0~rc2
2017-08-29T12:23:24Z I! Loaded outputs: influxdb
2017-08-29T12:23:24Z I! Loaded inputs: inputs.win_perf_counters
2017-08-29T12:23:24Z I! Tags enabled: host=IE11Win7
2017-08-29T12:23:24Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"IE11Win7", Flush Interval:10s
unexpected fault address 0x2f2bec32
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0x2f2bec32 pc=0x46a559]

goroutine 12 [running]:
runtime.throw(0x1111aca, 0x5)
        /usr/local/go/src/runtime/panic.go:596 +0x7c fp=0x12963eec sp=0x12963ee0
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_windows.go:164 +0xe2 fp=0x12963f00 sp=0x12963eec
syscall.UTF16ToString(0x2f2bec32, 0x20000000, 0x20000000, 0x17f0c20, 0x12988a50)
        /usr/local/go/src/syscall/syscall_windows.go:49 +0x29 fp=0x12963f1c sp=0x12963f00
github.com/influxdata/telegraf/plugins/inputs/win_perf_counters.UTF16PtrToString(0x2f2bec32, 0x12a8da27, 0x7)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/win_perf_counters/pdh.go:421 +0x35 fp=0x12963f34 sp=0x12963f1c
github.com/influxdata/telegraf/plugins/inputs/win_perf_counters.(*Win_PerfCounters).Gather(0x12a89180, 0x17ead80, 0x12a98220, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/win_perf_counters/win_perf_counters.go:211 +0x1db fp=0x12963fc0 sp=0x12963f34
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0x12a02f00, 0x12a89c00, 0x12a98220)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:153 +0x38 fp=0x12963fe0 sp=0x12963fc0
runtime.goexit()
        /usr/local/go/src/runtime/asm_386.s:1629 +0x1 fp=0x12963fe4 sp=0x12963fe0
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:154 +0xba

goroutine 1 [chan receive]:
github.com/kardianos/service.(*windowsService).Run(0x12bc3d60, 0x12ce6090, 0x12bbcf60)
        /home/ubuntu/telegraf-build/src/github.com/kardianos/service/service_windows.go:273 +0x127
main.main()
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:398 +0x9a2

goroutine 27 [select]:
main.reloadLoop.func1(0x12970340, 0x12970300, 0x12cdaa40, 0x12cdaa00)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:195 +0x1d5
created by main.reloadLoop
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:209 +0x5eb

goroutine 19 [syscall]:
os/signal.signal_recv(0x0)
        /usr/local/go/src/runtime/sigqueue.go:116 +0x14f
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:22 +0x1a
created by os/signal.init.1
        /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 23 [semacquire]:
sync.runtime_Semacquire(0x12bb888c)
        /usr/local/go/src/runtime/sema.go:47 +0x29
sync.(*WaitGroup).Wait(0x12bb8880)
        /usr/local/go/src/sync/waitgroup.go:131 +0x91
github.com/influxdata/telegraf/agent.(*Agent).Run(0x12972ea8, 0x12970300, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:426 +0x497
main.reloadLoop(0x12cdaa00, 0x18daf00, 0x0, 0x0, 0x18daf00, 0x0, 0x0, 0x18daf00, 0x0, 0x0, ...)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:234 +0x8c2
main.(*program).run(0x12ce6090)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:262 +0xdc
created by main.(*program).Start
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:251 +0x33

goroutine 10 [select]:
github.com/influxdata/telegraf/agent.(*Agent).flusher(0x12972ea8, 0x12970300, 0x129705c0, 0x12970640, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:320 +0x360
github.com/influxdata/telegraf/agent.(*Agent).Run.func1(0x12bb8880, 0x12972ea8, 0x12970300, 0x129705c0, 0x12970640)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:396 +0x63
created by github.com/influxdata/telegraf/agent.(*Agent).Run
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:400 +0x2d7

goroutine 28 [select]:
github.com/influxdata/telegraf/agent.(*Agent).flusher.func1(0x12bb88b0, 0x12970300, 0x12970680, 0x12972ea8)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:263 +0x25c
created by github.com/influxdata/telegraf/agent.(*Agent).flusher
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:292 +0xc6

goroutine 11 [select]:
github.com/influxdata/telegraf/agent.gatherWithTimeout(0x12970300, 0x12a89c00, 0x12a98220, 0x540be400, 0x2)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:157 +0x234
github.com/influxdata/telegraf/agent.(*Agent).gatherer(0x12972ea8, 0x12970300, 0x12a89c00, 0x540be400, 0x2, 0x129705c0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:124 +0x2a4
github.com/influxdata/telegraf/agent.(*Agent).Run.func3(0x12bb8880, 0x12972ea8, 0x12970300, 0x129705c0, 0x12a89c00, 0x540be400, 0x2)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:422 +0x6b
created by github.com/influxdata/telegraf/agent.(*Agent).Run
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:423 +0x45f

goroutine 29 [select]:
github.com/influxdata/telegraf/agent.(*Agent).flusher.func2(0x12bb88b0, 0x12970300, 0x12970640, 0x12972ea8, 0x12970680)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:298 +0x210
created by github.com/influxdata/telegraf/agent.(*Agent).flusher
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:315 +0x121

@danielnelson
Copy link
Contributor

@kmonsoor No one is working on this as far as I know, would you be able to take a look?

@kmonsoor
Copy link

@danielnelson I could but I have no idea how Go works as well as developing for Windows 😞

@ronnix
Copy link

ronnix commented Sep 4, 2017

It would be great to fix this issue, but maybe in the meantime the default configuration shipped with the 32-bit Windows build could be changed so that all the [[inputs.win_perf_counters.*]] sections are commented, and the following sections are uncommented:

[[inputs.cpu]]
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.swap]]

This would at least prevent the bad out-of-the-box experience with this 32-bit build.

@sim-brar
Copy link

@ronnix Your work around worked perfectly! Thanks :)

@MarkBreedveld
Copy link

MarkBreedveld commented Jan 9, 2018

@PierreF
I'm not very good with C++ or GO, but I ran the following code through windows C++ compiler.
By creating simple console application with the following code.
Both x64 as x86 build returned 24.

That confirms the cause.

#include "stdafx.h"
#include <windows.h>
#include <stdio.h>
#include <pdh.h>
#include <pdhmsg.h>
#include <string>
#include <iostream>

int main()
{
	auto i = sizeof(PDH_FMT_COUNTERVALUE_ITEM);

	std::string str = std::to_string(i);
	std::cout << str;
	std::getline(std::cin, str);
    return 0;
}

@MarkBreedveld
Copy link

I also looked for existing solutions. Those do exist.

Elastic might have working performance counters.
https://github.com/elastic/beats/tree/master/metricbeat/module/windows/perfmon

His github repository says many things are broken.
But his approach might work.
https://github.com/alexbrainman/pc

@srclosson
Copy link

This issue is causing some major problems for me. I would like to deploy telegraf to several pieces of mining equipment and to collect data from the windows PC's collecting data from the mining equipment's onboard systems. Telegraf won't even start. I would be willing to help out with testing and development in any way I can for a quick turn around.

@srclosson
Copy link

I looked over the project at https://github.com/alexbrainman/pc and looked at how he was padding the structs to provide 64bit alignment and used a similar tactic in pdh.go. It looks like it's worked, and I'm collecting data on a 32bit platform without crashing.

I'm also new to go, and would need some help productizing this but for the time being the software appears to be collecting data reliably.

@russorat
Copy link
Contributor

thanks everyone for the report. We will see where we can fit this in. in the meantime, @srclosson if you'd like to submit a PR with the changes you made to get it working, we (and the community) will review.

@srclosson
Copy link

@russorat: Sorry, I could look for the PR process, but I'm so busy. Could you point me in the right direction?

@danielnelson
Copy link
Contributor

@srclosson We don't have directions for this specifically, but here is a brief overview that I hope will help you get started. This is from memory so it might not be 100% accurate:

@srclosson
Copy link

Okay, I've done an initial checkin. Some work to separate 64 bit and 32 bit builds is required. Also includes 2 enhancements:

  1. Use the timestamp from the device source
  2. Allow connecting to a remote perfmon source to get data

Comments are welcome
https://github.com/srclosson/telegraf/tree/386fix

@russorat
Copy link
Contributor

@srclosson thanks for sharing. could you open up an official pull request for your fix? https://github.com/influxdata/telegraf/compare/master...srclosson:386fix?expand=1

@srclosson
Copy link

Yes, I can. I'm actually waitng for approval from my end. Sorry guys, this is taking time. I don't expect to have permission probably until Monday or Tuesday.

@srclosson
Copy link

I have created a pull request. @russorat, would you mind reviewing?

@russorat
Copy link
Contributor

connect #4076

@danielnelson
Copy link
Contributor

This bug should be fixed by #4189, it would be great if everyone could try it out using the nightly builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows
Projects
None yet
Development

No branches or pull requests

9 participants