Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Fix two instances of memcpy with overlapping source and destination. #324

Merged
merged 4 commits into from Jul 11, 2012

Conversation

Projects
None yet
4 participants
Contributor

walshb commented Jun 27, 2012

Found using valgrind -- but looks like Nathaniel Smith already worked that one out by intuition:

http://mail.scipy.org/pipermail/numpy-discussion/2012-June/063010.html

There are a few more memcpy calls in that file, but I think they're innocent until proven guilty.

This pull request passes (merged 2014f27 into e15d0bd).

Contributor

walshb commented Jun 27, 2012

Seems to work ok (or maybe I just got lucky with the 2.4 build :-) @njsmith

Owner

njsmith commented Jun 28, 2012

Correctness is more important than speed, so technically speaking, memcpy should be considered guilty until proven innocent... generally by code inspection to be certain that the memory ranges cannot possibly overlap. That's pretty hard to do in this case, since these are low-level generic routines, and it's easy to imagine that we just haven't run a test case with the right wacky array layout.

@mwiebe, do you know reasons to trust that the remaining memcpy's in lowlevel_strided_loops.c.src are safe?

Contributor

walshb commented Jun 28, 2012

It looks like the other memcpy's are each copying a single element, so it's unlikely they would cause a problem (two atomic elements of arrays won't overlap).

Contributor

walshb commented Jun 28, 2012

Just noticed that this memcpy problem occurs less often than you'd expect, due to a check in array_asign_array.c:

68 /*
69 * Overlap check for the 1D case. Higher dimensional arrays and
70 * opposite strides cause a temporary copy before getting here.
71 */
72 if (ndim == 1 && src_data < dst_data &&
73 src_data + shape_it[0] * src_strides_it[0] > dst_data) {
74 src_data += (shape_it[0] - 1) * src_strides_it[0];
75 dst_data += (shape_it[0] - 1) * dst_strides_it[0];
76 src_strides_it[0] = -src_strides_it[0];
77 dst_strides_it[0] = -dst_strides_it[0];
78 }

This assumes that memcpy will work when we're copying with overlapping but to a lower address in memory, which is a bad assumption to make (although in practice we get away with it). We could probably remove this logic and just let memmove do the business.

Of course this doesn't answer the question of how we get the problems occasionally at the moment. Will keep looking...

Owner

njsmith commented Jun 28, 2012

Do we ever byteswap an array in-place? That looks like it would produce a memcpy with src==dst. I guess most memcpys will treat this is a no-op, but it's hard to know for sure -- technically it invokes undefined behaviour, so if there's some way that compiler optimizers can screw this up then they probably will.

The ones on lines 805 and 841 (in @prefix@_cast_@name1@_to_@name2@) are clearly safe, though.

Owner

njsmith commented Jun 28, 2012

Ah, yes, we do have a primitive for byte-swapping in-place: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.byteswap.html

Owner

njsmith commented Jun 28, 2012

The non-deterministic failures are probably because gcc/glibc on the Travis-CI machines is deciding whether memcpy should go forward or backwards based on some arcane internal logic that includes things like what memory alignment malloc happened to give us.

This assumes that the logic in array_assign_array.c is buggy, though. It must be. However, we can't just remove it without careful examination -- if an array assignment requires multiple calls to one of the lowlevel loops, then it still needs to make sure to perform those multiple calls in the correct order. So that code needs to be audited/debugged if we want to fix this properly.

Member

mwiebe commented Jun 28, 2012

I think this test failure can be pinned just on _contig_to_contig memcpy, the other one is a memcpy of a single element just as the other cases in the file are. The symptom I would expect if a single element memcpy was screwing up is some kind of value corruption.

The general logic being used for overlapping assignments is to allow the one-dimensional case with strides of matching sign to pass to the inner loop functions, as they can handle it well (except for bugs like using memcpy instead of memmove), and making a temporary copy of the input in more complicated cases.

Member

mwiebe commented Jun 28, 2012

Incidentally, this looks like the same kind of thing as the flash bug on linux 64 that caused bad audio: https://bugzilla.redhat.com/show_bug.cgi?id=638477

The assumption is that memcpy copies in a forward direction, which was the case in all implementations prior to the one that triggered the flash bug.

Owner

njsmith commented Jun 28, 2012

Right, this is exactly the same issue that Adobe flash ran into. I asked about this in #travis, and the current hypothesis is that for whatever reason, when glibc is running on their "ppp1" machine it does memcpy backwards, and when it runs on their "ppp2" or "ppp3" machines then memcpy goes forwards. They gave me a copy of /proc/cpuinfo for each of those machines in case it gives a clue: https://gist.github.com/481db073dacdd0036ad2
(Weirdly, ppp1 and ppp3 appear to have almost identical CPUs, the only difference is the presence of a "microcode" line, and I don't know what that means.)

Unfortunately travisbot did not test this PR on the ppp1 machine, so if this hypothesis is correct then the PR actually been tested properly. They didn't have any suggestions for how to test this better except to keep pushing and waiting until a build gets assigned to 'ppp1', though.

@mwiebe: Right, I doubt that we're actually seeing memcpy screwing up on "no-op" copies, but now that we've bumped into this once (and seen what a hassle it is to test and debug!), I'd rather fix the underlying problem for good. So if we think memcpy might possibly in the future on some machine somewhere screw up on "no-op" copies (and doing so would be totally compliant with the C spec), then IMHO we should just switch them all to memmove now, rather than risk mangling numpy users' data.

Contributor

walshb commented Jun 28, 2012

@njsmith : Re: my "occasional" comment: I thought there was some extra logic in numpy itself causing problems, but this wasn't the case.

Re: replacing memcpy with memmove: agreed, we should be conservative and replace memcpy with memmove if we think that it may ever be called with src = dst.

I'm happy to make this change. Also I'd really like to remove the 1D code from array_assign_array.c. I'm not sure what other checks you'd like to perform on the code; it seems to be copying correctly in the 2D or more case. I've added some more extensive tests at least.

I wonder if valgrind is available on these "travis" machines...?

Contributor

walshb commented Jun 28, 2012

I've just been checking and in my test cases for the 2D (or more) case, PyArray_AssignArray is taking a copy before doing the assignment, as intended. This means that in this case the assignment will be ok.

This pull request passes (merged 4f03448 into e15d0bd).

This pull request fails (merged d2c8843 into e15d0bd).

Contributor

walshb commented Jul 5, 2012

@njsmith @mwiebe : The last Travis build failed because somebody powered down the machine (!) but apart from that I think it's fine.

Owner

njsmith commented Jul 6, 2012

Looks that way. Were we going to remove the array_assign_array.c 1-D overlap check too?

Contributor

walshb commented Jul 9, 2012

I'd like to remove the 1D overlap check when memmove is used, but we still have to do the "reverse direction of copy" in the case that memmove is not being used (ie. we're doing the striding ourselves). This would require moving some code around, and it seems a bit like overkill for this PR. I'd rather have a poke around and submit it as a new pull request.

Owner

njsmith commented Jul 11, 2012

Oh, I see, I thought we only did that on 2d arrays. In that case, I agree.

@njsmith njsmith added a commit that referenced this pull request Jul 11, 2012

@njsmith njsmith Merge pull request #324 from walshb/fix_memcpy_overlap
Use memmove instead of memcpy for strided copies, since src and dst can overlap.
0920bed

@njsmith njsmith merged commit 0920bed into numpy:master Jul 11, 2012

@takluyver takluyver pushed a commit to takluyver/pandas that referenced this pull request Nov 19, 2013

@y-p y-p BUG: workaround numpy issue in 1.6.2 on certain distros
pandas issue #2490
numpy/numpy#324

manifests only on certain distors, probably libc dependent.
debian wheezy, libc 2.13-37 is affected
ubuntu precise 2.15-0ubuntu10.2 is not

These are not necessarily the earliest/latest package
revisions addected/not affected respectively.
6cadd6c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment