Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

PCP webapi not returning data/metrics ? #14

Closed
hc000 opened this Issue Apr 17, 2015 · 8 comments

Comments

Projects
None yet
4 participants

hc000 commented Apr 17, 2015

I followed these instruction and compiled pcp from source.

$ git clone git://git.pcp.io/pcp

$ apt-get build-dep pcp

$ cd pcp
$ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --with-webapi
$ make
$ groupadd -r pcp
$ useradd -c "Performance Co-Pilot" -g pcp -d /var/lib/pcp -M -r -s /usr/sbin/nologin pcp
$ make install

but I am getting this error when calling the API

PMWEBD error, code -12443: Insufficient elements in list

when i request http://host:44323/pmapi/context?hostspec=localhost&polltimeout=600

I get { "context": 367806010 }

when i request http://host:44323/pmapi/367806010/_metric?prefix=hinv

i get { "metrics":[]}

how can i make it populate data in the metrics array?

Thank you!

Contributor

fche commented Apr 17, 2015

If the phenomenon is the same there as it is here, this is probably a bug in libpcp. What happens is that if pmwebd and pmcd both run, and pmwebd issues a successful context for the running pmcd, it remembers this too long. If you now stop the pmcd process, the previous pmwebd (via libpcp's pmNewContext) will still issue new pcp context#s, perhaps because it decides to reuse prior not-quite-dead tcp connections. Those new context#s are useless however; pmFetch() might return errors (or not?).

Contributor

kmcdonell commented Apr 18, 2015

I'm not sure if Frank's explanation is 100% correct ... if pmwebd calls pmDestroyContext() when it sees a fatal error, then I can't immediately see how the stale pmcd connection info can leak out to new contexts.

To get some more information, do the following:

edit /etc/pcp/pmwebd/pmwebd.options and add this line at the end

OPTIONS="$OPTIONS -Dcontext,pdu"

then restart pmwebd (sudo /etc/init.d/pmweb restart, or equivalent)

Try your web queries again.

Then post the contents of /var/log/pcp/pmweb/pmwebd.log ... this will contain the detailed diagnostics for pmwebd communicating with pmcd.

-----Original Message-----
From: Frank Ch. Eigler [mailto:notifications@github.com]
Sent: Saturday, 18 April 2015 8:54 AM
To: performancecopilot/pcp
Subject: Re: [pcp] PCP webapi not returning data/metrics ? (#14)

If the phenomenon is the same there as it is here, this is probably a bug in
libpcp. What happens is that if pmwebd and pmcd both run, and pmwebd issues
a successful context for the running pmcd, it remembers this too long. If you
now stop the pmcd process, the previous pmwebd (via libpcp's pmNewContext)
will still issue new pcp context#s, perhaps because it decides to reuse prior not-
quite-dead tcp connections. Those new context#s are useless however;
pmFetch() might return errors (or not?).


Reply to this email directly or view it on GitHub
<#14 (comment)
94091715> . <https://github.com/notifications/beacon/AEafqy61hyXCpv42-
pIm_TvncUNbS_mdks5oAYaUgaJpZM4EDFkB.gif>

hc000 commented Apr 18, 2015

I actually talked to him on irc, and he suggested to change PMCD_REQUEST_TIMEOUT from 1 to 10, which did start showing the metrics. unfortunately I am not able to grab the log right now, but i can provide that to you on monday if you still want them.

Contributor

fche commented Apr 19, 2015

@kmcdonell The issue is fully reproducible here with 3.10.4 code, as follows:

context,pdu logs at http://web.elastic.org/~fche/issue14.txt

(Note pmwebd doesn't call pmDestroyContext() at all in this case - it's deferred a few minutes after the last use of a pmwebapi context; and there were no errors evident anyway.)

Contributor

kmcdonell commented Apr 19, 2015

From: Frank Ch. Eigler [mailto:notifications@github.com]
Sent: Sunday, 19 April 2015 10:45 AM
...
context,pdu logs at http://web.elastic.org/~fche/issue41.txt
http://web.elastic.org/%7Efche/issue41.txt

Once I un-transpositioned 41 -> 14, all good.

Thanks.

So I think the problem here is that ...

  1. pmwebd is idle most of the time, so after pmcd is terminated, there is no pending PDU xfer that could deliver an error, which pmwebd could use to pmDestroyContext().
  2. it appears that the pmns traversal does not return 0, it does not return at all ... is that possible? If that's not possible (there should be a timeout and we should go down the timeout path), do you know where __pmNotifyError() output for pmwebd goes .... are we missing that somehow, as there are lots of diagnostics below in the pduread() timeout path?

@natoscott natoscott added the bug label Jun 3, 2015

Contributor

kmcdonell commented Dec 22, 2015

This is now fixed in my tree ... commits will flow to the github tree in due course.

From commit 3d4d2c0 ...
There is fast path (and fd saving) optimization in pmNewContext()
where we don't establish a new connection to pmcd when we're asked
to create a subsequent PMAPI context to the same pmcd host and port
used to create an existing PMAPI context. Multiple PMAPI contexts
are then multiplexed over the same socket connection. This all
works as designed.

But as shown in https://github.com/performancecopilot/pcp/issues/14
this produces slightly odd behaviour when pmcd has died between the
initial pmNewContext() call and the later one, but the client has not
sent a PDU and so libpcp does not update the state of the connection
to pmcd and will try to re-use the socket that is not going to work.

Since we cannot rely on any socket-level service to let us know
the remote end of a socket has been closed, we need a "ping" ... we
don't have a ping PDU per se (and adding one would introduce version
compatibility issues) so we use a (bad) pmDesc request PDU with an
expected PM_ERR_PMID error PDU response ... if this happens, pmcd is
well, otherwise go down the slow path and try to build a new socket
connection to (presumably a new) pmcd.

And qa/1090 reproduces Frank's failure recipe to verify (a) it used to fail, and (b) it now works as expected.

@kmcdonell kmcdonell closed this Dec 22, 2015

Contributor

fche commented Dec 22, 2015

Thanks, Ken!

"Since we cannot rely on any socket-level service to let us know the remote end of a socket has been closed"

... this part might not actually be true. According to stackexchange, a __pmRecv(fd, buf, 1, MSG_PEEK) should signal failure if the socket was closed by the other side, without having to send anything.

Contributor

kmcdonell commented Dec 22, 2015

You're right Frank.

The MSG_PEEK appoach "might" work on some platforms (I saw this in my research also) ... but there is also lots of google-based intelligence to suggest that an application level ping is safer and I guarantee the ping approach will work across all platforms.

kmcdonell added a commit that referenced this issue Dec 27, 2015

src/libpcp/context.c: tweak pmNewContext() when re-using an existing …
…pmcd connection

There is fast path (and fd saving) optimization in pmNewContext()
where we don't establish a new connection to pmcd when we're asked
to create a subsequent PMAPI context to the same pmcd host and port
used to create an existing PMAPI context.  Multiple PMAPI contexts
are then multiplexed over the same socket connection.  This all
works as designed.

But as shown in #14
this produces slightly odd behaviour when pmcd has died between the
initial pmNewContext() call and the later one, but the client has not
sent a PDU and so libpcp does not update the state of the connection
to pmcd and will try to re-use the socket that is not going to work.

Since we cannot rely on any socket-level service to let us know
the remote end of a socket has been closed, we need a "ping" ... we
don't have a ping PDU per se (and adding one would introduce version
compatibility issues) so we use a (bad) pmDesc request PDU with an
expected PM_ERR_PMID error PDU response ... if this happens, pmcd is
well, otherwise go down the slow path and try to build a new socket
connection to (presumably a new) pmcd.

kmcdonell added a commit that referenced this issue Dec 27, 2015

qa/1090: (new) tests for #14
Exercise new "ping" protocol between libpcp's pmNewContext and pmcd
when re-using a socket connection for a new PMAPI context.

@kevinjpickard kevinjpickard referenced this issue in Netflix/vector Feb 11, 2016

Closed

Unspecified error, No data plotting #111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment