Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and optimize the eg_value() extractor. #211

Closed
wants to merge 6 commits into from

Conversation

mstembera
Copy link
Contributor

Speedups vary depending on gcc version but never slower.

i7 mobile, Windows7-64

gcc 4.7.4
Results for 20 tests for each version:

        Base      Test      Diff
Mean    1145602   1145932   -330
StDev   26686     24296     4889

p-value: 0.527

gcc 4.8.2
Results for 20 tests for each version:

        Base      Test      Diff
Mean    1118372   1136727   -18355
StDev   11992     14562     3831

p-value: 1

gcc 4.9.2
Results for 20 tests for each version:

        Base      Test      Diff
Mean    1089072   1091734   -2662
StDev   11955     12205     2827

p-value: 0.827

No functional change.

mstembera added 2 commits January 13, 2015 18:01
Speedups vary depending on gcc version but never slower.

i7 mobile, Windows7-64

gcc 4.7.4
Results for 20 tests for each version:

            Base      Test      Diff
    Mean    1145602   1145932   -330
    StDev   26686     24296     4889

p-value: 0.527

gcc 4.8.2
Results for 20 tests for each version:

            Base      Test      Diff
    Mean    1118372   1136727   -18355
    StDev   11992     14562     3831

p-value: 1

gcc 4.9.2
Results for 20 tests for each version:

            Base      Test      Diff
    Mean    1089072   1091734   -2662
    StDev   11955     12205     2827

p-value: 0.827

No functional change.
@mstembera
Copy link
Contributor Author

I just realized the & 0xFFFFU is not needed either. I will post new timings a bit later.

@lucasart
Copy link

maybe it works with gcc, but what makes you think it is portable?

@mcostalba
Copy link

Thx for the patch, but this really opens a cans of worms. If you look at
past history of that code (using blame with GitHub) you will see a lot of
compatibility issues with MSVC and iirc Intel compiler.

On Wednesday, January 14, 2015, mstembera notifications@github.com wrote:

I just realized the & 0xFFFFU is not needed either. I will post new
timings a bit later.


Reply to this email directly or view it on GitHub
#211 (comment)
.

@zamar
Copy link

zamar commented Jan 14, 2015

At one point we had different implementation for GCC and other compilers, but I wouldn't go back to that UNLESS the speed up is big, let's say >= 2%.

@zamar
Copy link

zamar commented Jan 14, 2015

The patch is not committable as it stands, so I'm closing the pull request.

@zamar zamar closed this Jan 14, 2015
@mstembera
Copy link
Contributor Author

Joona, I'm sorry I wasn't able to respond right away but please at least read what I have to say below before dismissing this patch.
BTW, if anyone finds even one platform where this doesn't work I will immediately concede.


I agree that this needs to be 100% portable if it is to be committed and NOT be a gcc specific implementation either.
I wanted to submit an even simpler patch a long time ago

return Value((int16_t)s);

which also "works" and is deceptively similar but didn't precisely because of portability.
The problem with the above patch is that casting a signed value to a smaller signed value is implementation defined.

However this patch
return Value(int16_t(uint16_t(uint32_t(s))));
never does this. It can be equivalently rewritten as

uint32_t a = uint32_t(s); // signed to unsigned (also used in current implementation)
uint16_t b = uint16_t(a); // keep only the low 16 unsigned bits
int16_t c = int16_t(b); // reinterpret 16 unsigned as 16 signed
Value v = Value(c); // sign extend from 16 to 32 bits
return v;

I believe each of the above steps is portable. Also there is no dependence on big or little endian as one of the past patches had.
More to the point I tested:
AMD PhenomII w/ Ubuntu(14.04) w/ both gcc(4.8.2) and clang(3.4) both 64 and 32 bit compiles.
Intel i7 Windows7-64 gcc(4.7.4, 4.8.2, 4.9.2) and MSVC(2013) both 64 and 32 bit compiles.

Could someone w/ the Intel compiler please give it a go? Also, if you can think of some other platform to test I'm all for it. In light of past portability issues extra caution and scrutiny IS warranted but I also hope that IF everything checks out we can commit because this code is on the hot path and the patch is both a simplification and a speedup.

@zamar zamar reopened this Jan 14, 2015
@zamar
Copy link

zamar commented Jan 14, 2015

Thanks for the response, I've reopened the rpull equest for further discussion.

@mstembera
Copy link
Contributor Author

Thanks. In the mean time I've asked on the forum if anyone could try the Intel compiler.
https://groups.google.com/forum/#!topic/fishcooking/VCC1TFca78s

@glinscott
Copy link
Contributor

We should definitely add a comment here, as people would be tempted to simplify the sequence of 3 casts and break things in the process :).

If it compiles on Intel, looks good! Thanks for testing out so many platforms.

@zamar
Copy link

zamar commented Jan 14, 2015

If we decide to go that way, then for consistency we should consider:

inline Value mg_value(Score s) {
return Value(int16_t(uint16_t(uint32_t(s + 0x8000) >> 16)));
}

@DU-jdto
Copy link

DU-jdto commented Jan 14, 2015

This seems to be functionally equivalent to the original code when compiled in MSVC (number of nodes searched in the bench is unchanged).

EDIT: In fact, the bench remains unchanged even if I remove all of the casts in the new code except the int16_t one.

@mstembera
Copy link
Contributor Author

@joona
Wow that was fast :) I did test
return Value(int16_t(uint16_t(uint32_t(s + (uint32_t(s) & 0x8000U)) >> 16)));
(almost identical to yours) yesterday but didn't include it in the submission because it wasn't faster
(maybe a tiny bit slower but not statistically significant) but from a consistency point of view I completely agree.

@mstembera
Copy link
Contributor Author

@joona
Sorry, your version is simpler and faster. I will post new timings w/ it shortly.

@zamar
Copy link

zamar commented Jan 14, 2015

Hmm... and stylistically it's better use "unsigned" than "uint32_t" as 32-bits means nothing here...
So perhaps my current favourite is:

inline Value mg_value(Score s) {
return Value(int16_t(uint16_t(unsigned(s + 0x8000) >> 16)));
}

inline Value eg_value(Score s) {
return Value(int16_t(uint16_t(unsigned(s))));
}

The only question is that if this is standard compliant?

@mstembera
Copy link
Contributor Author

Yes that makes sense. The original implementation also casts signed and unsigned of the same size back and forth so the two steps that are new is the 32bit unsigned to 16bit unsigned cast and the 16bit signed to 32bit signed sign extend. I can retest the platforms I already tested w/ the above tonight.

@zamar
Copy link

zamar commented Jan 15, 2015

I think that probably strictly according to the standard uint16_t -> int16_t conversion is implementation specific for "negative values".

But then again, it's very likely that all compilers would behave sensibly here, so maybe we can just go ahead with this...

@zamar
Copy link

zamar commented Jan 15, 2015

See: http://en.cppreference.com/w/cpp/language/implicit_cast

If the destination type is signed, the value does not change if the source integer can be represented in the destination type. Otherwise the result is implementation-defined. (Note that this is different from signed integer arithmetic overflow, which is undefined)

@mstembera
Copy link
Contributor Author

To make sure I'm drawing the correct conclusion from your comment. We should be fine correct?
If it turns out to be a show stopper anyway we can still avoid it like this

inline Value mg_value(Score s) {
const union us16 { uint16_t u; int16_t s; } mg = { uint16_t(unsigned(s + 0x8000) >> 16) };
return Value(mg.s);
}

inline Value eg_value(Score s) {
const union us16 { uint16_t u; int16_t s; } eg = { uint16_t(unsigned(s)) };
return Value(eg.s);
}

I think this is uglier though. Speed looks similar. I will post timings.

@mstembera
Copy link
Contributor Author

Timings of Joonas patch against master:

gcc 474
Results for 20 tests for each version:
Base Test Diff
Mean 1114071 1116287 -2216
StDev 25946 26214 3818
p-value: 0.719

gcc 482
Results for 20 tests for each version:
Base Test Diff
Mean 1103317 1123009 -19692
StDev 15890 15481 4077
p-value: 1

gcc 492
Results for 20 tests for each version:
Base Test Diff
Mean 1038425 1042409 -3984
StDev 32922 33162 4728
p-value: 0.8

Union patch against master:
gcc 474
Results for 20 tests for each version:
Base Test Diff
Mean 1179923 1181505 -1582
StDev 15138 15041 2196
p-value: 0.764

gcc 482
Results for 20 tests for each version:
Base Test Diff
Mean 1104881 1124275 -19394
StDev 20022 21807 3472
p-value: 1

gcc 492
Results for 20 tests for each version:
Base Test Diff
Mean 1128140 1130533 -2393
StDev 23758 24802 2220
p-value: 0.859

@mcostalba
Copy link

I strongly suggest
to avoid any known undefined behavior by standard. This is the
only sane approach.

o
On Thursday, January 15, 2015, Joona Kiiski notifications@github.com
wrote:

I think that probably strictly according to the standard uint16_t ->
int16_t conversion is implementation specific for "negative values".

But then again, it's very likely that all compilers would behave sensibly
here, so maybe we can just go ahead with this...


Reply to this email directly or view it on GitHub
#211 (comment)
.

@zamar
Copy link

zamar commented Jan 15, 2015

@mcostalba: Just for clarity: "undefined" != "implementation defined".

@zamar
Copy link

zamar commented Jan 15, 2015

@mstembera: union solution has endianess issues.

@mstembera
Copy link
Contributor Author

@joona
This union should not have endianess issues because the union only has 16 bit members and not a mix of 16 and 32 bit like the past solution.

@mstembera
Copy link
Contributor Author

More info...
The previous union looked like

typedef union {
uint32_t full;
struct { int16_t eg, mg; } half;
} ScoreView;

which was used to split the 4 byte word into two 2 byte chunks which ended up being order dependent based on endianess. The new union is only used to reinterpret the same 2 bytes safely from unsigned to signed w/o using a cast. I don't have any big-endian hardware to test.

@zamar
Copy link

zamar commented Jan 15, 2015

@mstembera: OK. You are right. My bad, I misread the code.

@zamar
Copy link

zamar commented Jan 15, 2015

I think that we will have to go ahead with the uglier version. It's the only fully standard compliant version... Or can someone find a flaw?

inline Value mg_value(Score s) {
const union us16 { uint16_t u; int16_t s; } mg = { uint16_t(unsigned(s + 0x8000) >> 16) };
return Value(mg.s);
}

inline Value eg_value(Score s) {
const union us16 { uint16_t u; int16_t s; } eg = { uint16_t(unsigned(s)) };
return Value(eg.s);
}

@zamar
Copy link

zamar commented Jan 15, 2015

I'll do some benchmarks with different gcc versions as well...

@zamar
Copy link

zamar commented Jan 15, 2015

My speed up results:

gcc-4.7 (1.5%)
gcc-4.8 (0.5%)
gcc-4.9 (1.0%)

Results are statistically meaningful.

I will commit the version below which I believe to be fully standard compliant tomorrow unless someone objects with valid arguments:

inline Value mg_value(Score s) {
const union us16 { uint16_t u; int16_t s; } mg = { uint16_t(unsigned(s + 0x8000) >> 16) };
return Value(mg.s);
}

inline Value eg_value(Score s) {
const union us16 { uint16_t u; int16_t s; } eg = { uint16_t(unsigned(s)) };
return Value(eg.s);
}

@glinscott
Copy link
Contributor

Nice! Looks good.

@mstembera
Copy link
Contributor Author

One possibility to mitigate the ugliness is to define the union once outside since we use it in both extractors like this.

union us16 { uint16_t u; int16_t s; };

inline Value mg_value(Score s) {
const us16 mg = { uint16_t(unsigned(s + 0x8000) >> 16) };
return Value(mg.s);
}

inline Value eg_value(Score s) {
const us16 eg = { uint16_t(unsigned(s)) };
return Value(eg.s);
}

It's just a matter of taste though and I have no actual preference. Thanks for all your time Joona!

@mcostalba
Copy link

I think us16 pollutes the global namespace, I'd suggest:

union ValueExtractor { uint16_t u; int16_t s; };

Moreover, once we move to c++11 we could write (not tested):

inline Value mg_value(Score s) {
return ValueExtractor{ uint16_t(unsigned(s + 0x8000) >> 16) }.s;
}

@mcostalba
Copy link

Alternatively also unnamed unions can work:

inline Value mg_value(Score s) {
union { uint16_t u; int16_t s; } mg = { uint16_t(unsigned(s + 0x8000) >> 16) };
return Value(mg.s);
}

inline Value eg_value(Score s) {
union { uint16_t u; int16_t s; } eg = { uint16_t(unsigned(s)) };
return Value(eg.s);
}

And perhaps is the best solution in C++03

@zamar
Copy link

zamar commented Jan 16, 2015

OK. Let's use unnamed unions.

@mstembera
Copy link
Contributor Author

I like this also. Thanks Marco.

@zamar zamar closed this in 58fdb84 Jan 16, 2015
@mstembera
Copy link
Contributor Author

mstembera commented May 1, 2016

Here is an interesting way to write the mg_value() extractor using an Endian dependent union but avoiding the issue by using another Endian dependent union to do the indexing.

const union { uint32_t u32; struct { uint16_t high, low; }; } endianIdx = { 1 };

inline Value mg_value(Score s) {
    union { int32_t s32; int16_t s16[2]; } mg = { s + 0x8000 };
    return Value(mg.s16[endianIdx.high]);
}

It's not any faster but may be useful in dealing with other Endian issues in the future.

niklasf pushed a commit to niklasf/Stockfish that referenced this pull request Jan 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants