Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

pmchart crash on OSX with certain archive #35

Closed
tallpsmith opened this Issue Aug 10, 2015 · 18 comments

Comments

Projects
None yet
3 participants
Contributor

tallpsmith commented Aug 10, 2015

I've tried this with the latest pmchart (3.10.6), but I get this crash pretty reliably as I scroll around time ranges with a specific archive.

[ app2 ]$ pmchart -z -a 20150807.0
pmchart(13806,0x7fff7e625300) malloc: *** error for object 0x1028caf80: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6```

I'll attach the actual OSX Debug window data as a separate file. The archive I think I'll have to provide externally rather than here.

Contributor

tallpsmith commented Aug 10, 2015

Leaving a reference for myself, this was app2 in AUH 20150807

Contributor

kmcdonell commented Aug 10, 2015

Thanks for that Paul ... especially as it lands in my lap via the much maligned interp.c.

I would expect this to have nothing to do with pmchart, other than guilt by association.

But I will need some additional information to prosecute this:

  1. the archive (maybe under NDA if the information is sensitive)
  2. the metrics you were monitoring ... if these came from a pmchart config, then that file would help and it would help to know if you added or deleted metrics and/or charts between the start of pmchart and the failure
  3. update interval (default, or -t from command line, or changed via time control widget)
  4. travelling forwards or backwards in the archive
  5. position in the archive ... just playing forwards from the start, or just playing backwards from the end, or -S, -O, -A or -T command line options, or drag-n-drop the time line via the time control widget)

Unfortunately there are many levers that influence the interpolation code, and it helps to have the fullest possible picture of all of them leading up to the kaboom.

Contributor

kmcdonell commented Aug 10, 2015

Oh, and
6. which version of PCP is this?

Contributor

tallpsmith commented Aug 11, 2015

pmlogger was running on 3.6.5-1, pmchart running on the latest build (downloaded yesterday), 3.10.6.

I'll setup a step by step guide if I can, and email you the archive privately.

Contributor

tallpsmith commented Aug 11, 2015

Here's the gist with the command to run, plus the chart file I created, when I run the command with this chart against the archive (emailed to you), it crashes immediately (I don't see anything, just a crash).

https://gist.github.com/tallpsmith/06d6638abc4e1d170d96

@natoscott natoscott added the bug label Aug 27, 2015

Contributor

tallpsmith commented Sep 7, 2015

I'll just say that this is currently affecting me a lot. I wouldn't normally nag, but I'm struggling to work around this on OSX, doesn't seem to affect others on Linux here for some reason.

Any progress would be met with a happy dance.

Contributor

kmcdonell commented Sep 9, 2015

Paul I'm in Europe at the moment ... no PCP bandwidth for another couple of weeks I'm afraid.

-----Original Message-----
From: Paul Smith notifications@github.com
To: performancecopilot/pcp pcp@noreply.github.com
Cc: Ken McDonell kenj@internode.on.net
Sent: Mon, 07 Sep 2015 8:53 pm
Subject: Re: [pcp] pmchart crash on OSX with certain archive (#35)

I'll just say that this is currently affecting me a lot. I wouldn't normally nag, but I'm struggling to work around this on OSX, doesn't seem to affect others on Linux here for some reason.

Any progress would be met with a happy dance.


Reply to this email directly or view it on GitHub:
#35 (comment)

Contributor

natoscott commented Sep 9, 2015

@kmcdonell Paul has sent me the archive too now, hoping to diagnose & fix within a week.

Contributor

natoscott commented Sep 11, 2015

| Any progress would be met with a happy dance.

I've reproduced the problem, debugged a bit (ohhhh man, interp.c) and have an initial workaround which I'm testing at the moment to see whether it might serve for next weeks release. I'm not 100% sure of the intention of some parts of the affected code (in libpcp/src/interp.c cache_read) so the long-term fix will very likely need to wait Kens input.

But, I can definitely replay that archive now on Mac OS X (its all in libpcp memory management code, and nothing to do with archive format changing, or anything like that, as was wondered earlier).

cheers.

Contributor

natoscott commented Sep 14, 2015

Paul, could you try the dmg file I've put at ftp://ftp.pcp.io/projects/pcp/download/pcp-3.10.7-rc1.dmg and see how that pmchart fares for you? Seems to be working again for me for your test case.

I git-bisect'ed back to commit 7881fd5 ("libpcp pdubuf: rewrite using <search.h> binary trees") which is a libpcp pdubuf performance optimisation only used by pmwebd currently. I believe it may have something to do with buffer unpinning of cache'd pmResult valuesets in the interp.c cache_read() code, but not sure beyond that.

Like you, I'm also only seeing it fail on Mac OS X - perhaps the memory allocation strategies there are circumventing the logic at the start of __pmFreeResultValueSets() - again, not clear. I'll add a pointer to some trace logs here shortly so others can have a peek too.

cheers.

Contributor

natoscott commented Sep 14, 2015

To clarify a little, the dmg above is a build with the libpcp commit above reverted - there is no known fix beyond that at this stage.

Contributor

tallpsmith commented Sep 14, 2015

I'm on leave until next Monday up in Sunny Queensland.ill try when I get
back.

Thanks heaps for looking into it.
On Mon, 14 Sep 2015 at 18:14 Nathan Scott notifications@github.com wrote:

To clarify a little, the dmg above is a build with the libpcp commit above
reverted - there is no known fix beyond that at this stage.


Reply to this email directly or view it on GitHub
#35 (comment)
.

natoscott added a commit that referenced this issue Sep 15, 2015

qa: add test 816 to exercise a libpcp pdu-buffer memory issue
Exercise the libpcp pdubuf/interp issue from github issue #35.
This adds a small, anonymised archive based on the production
data sent privately in relation to the above github issue.

The test reproduces the same bug using pmdumptext, which is a
handy development as its more easily automated and it's also
alot easier to follow than the pmchart GUI code.

natoscott added a commit that referenced this issue Sep 15, 2015

libpcp: temporarily revert pdubuf tsearch-based optimisation
As discussed in github issue #35, this commit has inadvertently
caused users of production PCP data much grief on Mac OS X, and
we are none the wiser as to root cause yet.  So, revert to the
original pdubuf pinning scheme (circa 1995) while the newer code
is being further diagnosed, and so that we can get a known-good
release out for Mac users once more.

Once figured out, commits 7881fd5, 1401c20 + f0f64be
all need be reinstated (7881fd5 has been git-bisected as the
cause, the others were unrelated followups to that initial one).

See also newly added QA test 816.
Contributor

natoscott commented Sep 15, 2015

I've found a simpler reproducer using pmdumptext, and a smaller version of Paul's production data. I've anonymised that archive and committed it as qa/archives/small and added qa/816 which reproduces the problem using pmdumptext.

For reference, I'll add a couple of gists of the full before/after debug logs, shortly showing failure vs success to help triangulate where things go astray.

I've temporarily backed out the libpcp optimization behind the regression for pcp-3.10.7 (which is about to release in a day or so).

Contributor

natoscott commented Sep 15, 2015

Failing case - https://gist.github.com/natoscott/f8288c218c1e06ede622
Passing case - https://gist.github.com/natoscott/00bd07a41db20e2c6efd

See qa/816 for additional details (and/or the commands at head of each gist)

Contributor

tallpsmith commented Sep 21, 2015

Thanks Nato. just downloaded -7 today and seems to be working a treat! Sanity restored, many thanks.

Contributor

natoscott commented Sep 21, 2015

No problem, thanks Paul.

@natoscott natoscott closed this Sep 21, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment