Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime SSE4.2 detection: #20

Closed
wants to merge 2 commits into from
Closed

Conversation

isage
Copy link

@isage isage commented Apr 26, 2018

As per discussion in #39

Detect presence of SSE4.2 as early as possible
and choose between SSE and non-SSE string hashing functions at runtime.

This allows to build luajit without -msse4.2, but still get a performace gain
on supported hardware.

(benchmark.cxx is currently broken, and travis.yml needs changing too)

Detect presence of SSE4.2 as early as possible
and choose between SSE and non-SSE string hashing functions at runtime.

This allows to build luajit without -msse4.2, but still get a performace gain
on supported hardware.
@yangshuxin
Copy link

yangshuxin commented Apr 27, 2018

What motivate this change? SSE 4.1 is 10+ years old. I thought most intel-CPU based (including Atom) servers are supporting SSE4.1

I personally deeply hate inline asm for couple of reasons:

  1. inline asm has negative impact to performance.

    A static compiler will consider inline-asm as a black box, and make a very conservative assumption about its side-effect. And such assumption has negative impact to compiler's middle-end (like scalar opt) optimization.

    A built-in function is a different story. it has well-defined side-effect...

  2. error-prone
    this is quite obvious

  3. poor compatibility
    An instruction may shine in old generation, and could be deprecated in newer one. CPU vendors likely to ensure compatiability at architecture level (i.e ISA level); it does not make big sense to be at micro-architecture level.

  4. poor portability
    built-in functions are supported by multiple architectures, while inline-asm only support one architecture. On top of that, different compiler has diff way to compile inline asm (but many existing compilers claim to be gcc compatible with this regard).

Being portable is not a matter of black-and-white, it's matter of degree. Although the original code isn't very portable, I would say it's more portable than your change...

While I used to make a living by playing with asm instruction, it does not prevent me to view inline-asm as an eyesore ....

I guess maybe we can avoid using inline-asm by multi-versioning the hash function. The concept is sketched bellow:

if is_sse41_supported()
origin_hash_func()
else
hash_sse41_func()

The hash_sse41_func() is compiled with sse4.1 support, while the rest src code don't. I think they have no problem to be lined together (as long as you are not using O4/LTO optimization).

In short, I personally hate the idea of inline-asm. I would leave the trade-off
between inline-asm, usefulness of this change (and other aspects) to the community:-)

@isage
Copy link
Author

isage commented Apr 27, 2018

What motivate this change? SSE 4.1 is 10+ years old. I thought most intel-CPU based (including Atom) servers are supporting SSE4.1

Keywords being: most, intel and servers. Also, we need SSE4.2, not 4.1. My dev computer is AMD-based without SSE4.2 support, and, looking at original issue - i'm not alone.

I personally deeply hate inline asm for couple of reasons
Well, i do hate it too. Keep in mind, that's it's not a ready-to-merge pull-request, just a quickly thrown together proof-of-concept (see original issue).

And yeah, you right: proper way would be to separate sse4.2-optimized hashing into stand-alone source (not header like now) file. And build it and only it with -msse4.2 and only on x86.

@isage
Copy link
Author

isage commented Apr 27, 2018

I have some improvements for this (e.g. using gcc intrinsics instead of asm), but sadly my test box with sse4.2 is down atm, so it'll have to wait.

@isage
Copy link
Author

isage commented May 8, 2018

Okay, sorry for the delay. I've revorked runtime detection to be closer to the original code and to use gcc intrinsics.
Now, because of that, luajit needs to be explicitly built with TARGET_CFLAGS=-msse4.2. Or, we should enable it by default on x86_64.

@isage
Copy link
Author

isage commented May 8, 2018

BTW, i've noticed pull-request for arm crc support. We can use the same technique there (just set an additional flag for arm >= 8.1)

@isage
Copy link
Author

isage commented Jan 14, 2019

It would be nice to at least hear some feedback on this.

@agentzh
Copy link
Member

agentzh commented Feb 4, 2019

It's on our TODO list. Sorry for the delay. @thibaultcha will you follow up this PR please? Thanks!

@agentzh
Copy link
Member

agentzh commented Feb 4, 2019

@isage I really appreciate your work on this and this is indeed very important. I really like to get this merged as soon as we can manage. Hopefully the next OpenResty release can have this.

@satchamo
Copy link

If you look at the review pages on NewEgg for the old AMD Phenom chips, you'll see we aren't the only people who are still using these after nearly a decade. Phenoms are the microprocessor equivalent of a 1995 Toyota Corolla with 300K miles on the odometer. This change will be valuable to many developers who use old equipment, and I hope it makes it into the mainline.

@agentzh
Copy link
Member

agentzh commented Feb 23, 2019

Yeah, sure. It will be part of the next next OpenResty release for sure. The next OpenResty release's merge window is now already closed. We'll get OpenResty 1.15.8.1 out first, in the next couple of weeks.

@agentzh
Copy link
Member

agentzh commented Aug 2, 2019

@siddhesh It'll be great if you can have a look at this one too :) Thanks!

@siddhesh
Copy link
Collaborator

siddhesh commented Aug 6, 2019

I think the right way to do this would be to always build lj_str_sse_hash with -msse4.2 on x86_64 and call it only when JIT_F_SSE4_2 is set in the flags. To make this happen, lj_str_hash_x64.h will need to built as a proper source file and with -msse4.2. So here's what will look perfect IMO:

  1. Move src/x64/src/lj_str_hash.h to src/x64/src/lj_str_hash.c. This should ideally move back into the toplevel src to make things ready for Arm64 specific support for lj_new_str random hash optimization #21, but that can be done in a separate change.
  2. Add makefile flags to build only this file with -msse4.2
  3. Make the rest of the changes and call lj_str_sse_hash only if JIT_F_SSE4_2 is set in flags.

@agentzh
Copy link
Member

agentzh commented Sep 2, 2019

@siddhesh Do you mind creating a separate PR to incorporate all the changes you proposed? Many thanks!

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

Obsoleted by PR #75.

@agentzh agentzh closed this Oct 27, 2020
@isage
Copy link
Author

isage commented Oct 27, 2020

Hooray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants