Permalink
Browse files

2x speedup from __builtin_prefetch

  • Loading branch information...
naftaliharris committed Nov 8, 2013
1 parent 41b2b38 commit 318c3aa54339f09c24d35d269d6beaa43b6b6e98
Showing with 6 additions and 0 deletions.
  1. +6 −0 lazysorted.c
@@ -42,6 +42,11 @@
#define Py_TYPE(ob) (((PyObject*)(ob))->ob_type)
#endif
/* Macros to support different compilers */
#if !(defined(__GNUC__) || defined(__clang__))

This comment has been minimized.

@cemeyer

cemeyer Nov 14, 2013

Why not just #ifndef __builtin_prefetch to make room for future compilers with GCC intrinsic support?

This comment has been minimized.

@naftaliharris

naftaliharris Nov 15, 2013

Owner

Will that work? I wasn't sure... if so, then what you propose is definitely better.

This comment has been minimized.

@cemeyer

cemeyer Nov 15, 2013

Worth a shot ;-).

This comment has been minimized.

@douglasbagnall

douglasbagnall Nov 15, 2013

No, it won't work. __builtin_prefetch is a built-in function, not a preprocessor macro. The preprocessor, which handles #ifndef, knows nothing about it.

#define __builtin_prefetch(x)
#endif
/* Definitions and functions for the binary search tree of pivot points.
* The BST implementation is a Treap, selected because of its general speed,
* especially when inserting and removing elements, which happens a lot in this
@@ -709,6 +714,7 @@ partition(LSObject *ls, Py_ssize_t left, Py_ssize_t right)
Py_ssize_t i;
for (i = left + 1; i < right; i++) {
__builtin_prefetch(ob_item[i+3]);

This comment has been minimized.

@jzwinck

jzwinck Nov 15, 2013

Is it really safe to assume that ob_item[i+3] is a valid item to reference? It seems like this could result in undefined behavior when "right" (the argument to this function) is near the end of the array--going three past it will surely be invalid. GCC's documentation on __builtin_prefetch explicitly disallows dereferencing invalid addresses as an argument to the intrinsic.

This comment has been minimized.

@mattspatola

mattspatola Nov 15, 2013

It's gonna depend on exactly what ob_item actually is, of course, but, in general, this doesn't look safe judging by the docs. Adding a check should be pretty cheap, especially with hinting if needed, but I haven't profiled, so who knows. The other obvious option is only going to right - 4 and handling the rest outside the loop, which likely won't have much detriment, but of course that depends on what the compiler does. I do wonder what the best general solution would be.

This comment has been minimized.

@cemeyer

cemeyer Nov 15, 2013

Just allocate 3 extra elements in your array. Problem solved.

This comment has been minimized.

@douglasbagnall

douglasbagnall Nov 15, 2013

You lot are misreading the documentation:

Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid.
For example, a prefetch of p->next does not fault if p->next is not a valid address, but evaluation faults
if p is not a valid address.

This is the highlighted case. It does not fault.

This comment has been minimized.

@mattgodbolt

mattgodbolt Nov 15, 2013

I do still think this is an issue. As the processor accesses memory at ob_item[i+3] it's reading the address of the pointer to prefetch. Then it goes and prefetches it. The read of the pointer address itself can fault. Consider the case where ob_item points to a 4K page (with no valid page directly behind it). If pointers are 8 bytes then if i == 509, then the prefetch(obj_item[i+3]) will segfault. (The prefetch instrinsic will likely compile to a read of the pointer, followed by the actual prefetcht0 instruction. The pointer read is where the fault will happen. See, e.g. http://url.godbolt.org/weyep )

This comment has been minimized.

@douglasbagnall

douglasbagnall Nov 15, 2013

@mattgodbold et. al.: oh yes. sorry.

This comment has been minimized.

@naftaliharris

naftaliharris Nov 15, 2013

Owner

Ok, we should fix this--preferably by just iterating up to right - 3 with prefetch, and then the last 3 elements without. Possibly you don't even need to check if the list has fewer than three elements, since partition might only be called for lists with more than 8 or 16 elements or something, (but I don't recall).

If one of you wants to submit a PR to fix this, that would be much appreciated, otherwise I can just fix this tomorrow morning. :-)

IFLT(ob_item[i], pivot) {
last_less++;
SWAP(i, last_less);

0 comments on commit 318c3aa

Please sign in to comment.