Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elliptic Curve code without ASM is broken with Clang 14 #18225

Closed
jpboivin opened this issue May 2, 2022 · 9 comments
Closed

Elliptic Curve code without ASM is broken with Clang 14 #18225

jpboivin opened this issue May 2, 2022 · 9 comments
Labels
branch: master branch: 1.1.1 branch: 3.0 triaged: bug

Comments

@jpboivin
Copy link
Contributor

jpboivin commented May 2, 2022

Hello,

It looks like on Linux (x86_64), starting with Clang 14, the code for Elliptic Curves (when assembler is not enabled) is broken when compiled with optimizations (O2/O3). It is totally fine when using assembler, but when using regular C code, some curves will fail in OpenSSL internals. It doesn't seem to be new (I tested a few older versions) and it seems to be also affecting the 3.0.x branch although I have personally experienced it on the 1.1.1 branch.

From running the tests, it seems to be only affecting a few curves (like secp384r1, prime192v1, prime256v1) and it seems to be that some validation does not pass, potentially due to bad codegen. I don't know if OpenSSL has some undefined behaviors in that section of the code as I haven't investigated yet. But, it was working well up to Clang 13, and GCC 11 is good too.

I've tested and generated logs for the following configs:

  • 1.1.1n | Clang 14 | No ASM (broken)
  • 1.1.1n | Clang 14 | ASM (working)
  • 1.1.1n | GCC 11 | No ASM (working)

All these tests were done in a Docker container based on Ubuntu 22.04. I've also tested 1.1.1k | Clang 14 | No ASM and 3.0.2 | Clang 14 | No ASM which reproduce the exact same issue, so didn't generated logs for them.

config_1_1_1n_noasm_clang14.txt
tests_1_1_1n_noasm_clang14.txt
tests_verbose_1_1_1n_noasm_clang14.txt

config_1_1_1n_asm_clang14.txt
tests_1_1_1n_asm_clang14.txt

config_1_1_1n_noasm_gcc.txt
tests_1_1_1n_noasm_gcc.txt

@jpboivin jpboivin added the issue: bug report label May 2, 2022
@paulidale
Copy link
Contributor

paulidale commented May 3, 2022

This looks to be a problem with Clang optimising better in version 14. Compiling with -fno-strict-aliasing seems to address the issue. Without deeper investigation, it isn't clear if this is our bug or Clang's.

@paulidale paulidale added triaged: bug and removed issue: bug report labels May 3, 2022
@t8m t8m added branch: master branch: 1.1.1 branch: 3.0 labels May 3, 2022
@jpboivin
Copy link
Contributor Author

jpboivin commented May 3, 2022

Thanks @paulidale for pointing out that it was not an issue when disabling strict aliasing. I've identified crypto/bn/bn_nist.c as being the file causing the issues. I'll see if I can identity the actual problematic code and update this ticket with my findings.

@jpboivin
Copy link
Contributor Author

jpboivin commented May 3, 2022

The issue seems to be around the function nist_cp_bn_0, which is called by all BN_nist_mod_XYZ functions except for BN_nist_mod_224 (likely why the 224 curve isn't affected). I don't see anything wrong with the nist_cp_bn_0 function, which is basically a manually written loop to memcpy + memset. It doesn't seem to violate strict aliasing rules from what I've seen as both the source and destination are BN_ULONG arrays.

I'm more a C++ programmer than a C programmer, so I might have missed something that is specific to C, but it can points you toward which section of the code is problematic. The rest of the BN_nist_mod_XYZ functions seem fine and the codegen doesn't seem to change that much except around the nist_cp_bn_0 call.

I've compared the generated code against when strict aliasing is disabled and it seems that the whole copy is discarded (and additional operations from the same else block?). So it might be a compiler bug. I've been able to make it generate seemingly proper assembly by rewriting the copy operation with an explicit call to memcpy and memset instead of the loop.

bn_nist_strict_aliasing.txt
bn_nist_no_strict_aliasing.txt
bn_nist_strict_aliasing_with_workaround.txt

I've dumped the preprocessed source file for easier debugging if it can help you:
bn_nist_preprocessed.txt

Just need to pass these additional arguments to the compiler to match the OpenSSL buildsystem: -fPIC -m64 -O3

@paulidale
Copy link
Contributor

paulidale commented May 4, 2022

Great work. I agree that function should have no aliasing problems.

I've raised an issue with the LLVM folks.

@davidben
Copy link
Contributor

davidben commented May 5, 2022

I'm not sure that's right. I think there is a strict aliasing violation, though you should check with language experts. C's strict aliasing and union rules, as I understand them, are quite harsh. While there is a carve-out in C (but not C++, which is even stricter) for using unions to do type-punning, that only applies when you read or write directly to a union. Writing through a pointer to an inactive arm is UB.

See example 3 of 6.5.2.3 in n1256. (Whether these are the right rules is another matter. I doubt there is much C code out there that uses unions which is well-defined w.r.t. all this.)

Likewise, the strict aliasing rules (6.5, clause 7) doesn't include an allowance for unions in the effective type of the object, only the type of the lvalue expression (which doesn't involve a union).

@paulidale
Copy link
Contributor

paulidale commented May 5, 2022

Agreed, example 3 is damning.
It's also different....

paulidale added a commit to paulidale/openssl that referenced this issue May 6, 2022
As of clang-14 the strict aliasing is causing code to magically disappear.
By explicitly inlining the code, the aliasing problem evaporates.

Fixes openssl#18225
@paulidale
Copy link
Contributor

paulidale commented May 6, 2022

This looks like the fundamental issue is if the array in the union is converted to a pointer or accessed directly as an array. If it is converted, the behaviour is undefined. The patch accesses the union as an array which solves this.

Interesting is that using memcpy/memset which both convert the array to a pointer, doesn't break the code.

At this point I'm satisfied that the problem is in the OpenSSL code. I'm a little miffed that there were no warnings about undefined behaviour ever.

openssl-machine pushed a commit that referenced this issue May 10, 2022
As of clang-14 the strict aliasing is causing code to magically disappear.
By explicitly inlining the code, the aliasing problem evaporates.

Fixes #18225

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Tim Hudson <tjh@openssl.org>
Reviewed-by: Matt Caswell <matt@openssl.org>
(Merged from #18258)

(cherry picked from commit 8712db5)
agl pushed a commit to google/boringssl that referenced this issue May 10, 2022
When introducing EC_SCALAR and EC_FELEM, I used unions as convenience
for converting to and from the byte representation. However,
type-punning with unions is not allowed in C++ and hard to use correctly
in C. As I understand the rules, they are:

- The abstract machine knows what member of union was last written to.

- In C, reading from an inactive member is defined to type-pun. In C++,
  it is UB though some compilers promise the C behavior anyway.

- However, if you read or write from a *pointer* to a union member, the
  strict aliasing rule applies. (A function passed two pointers of
  different types otherwise needs to pessimally assume they came from
  the same union.)

That last rule means the type-punning allowance doesn't apply if you
take a pointer to an inactive member, and it's common to abstract
otherwise direct accesses of members via pointers.

openssl/openssl#18225 is an example where
similar union tricks have caused problems for OpenSSL. While we don't
have that code, EC_SCALAR and EC_FELEM play similar tricks.

We do get a second lifeline because our alternate view is a uint8_t,
which we require to be unsigned char. Strict aliasing always allows the
pointer type to be a character type, so pointer-indirected accesses of
EC_SCALAR.bytes aren't necessarily UB. But if we ever write to
EC_SCALAR.bytes directly (and we do), we'll switch the active arm and
then pointers to EC_SCALAR.words become strict aliasing violations!

This is all far too complicated to deal with. Ideally everyone would
build with -fno-strict-aliasing because no real C code actually follows
these rules. But we don't always control our downstream consumers'
CFLAGS, so let's just avoid the union. This also avoids a pitfall if we
ever move libcrypto to C++.

For p224-64.c, I just converted the representations directly, which
avoids worrying about the top 32 bits in p224_felem_to_generic. Most of
the rest was words vs. bytes conversions and boils down to a cast (we're
still dealing with a character type, at the end of the day). But I took
the opportunity to extract some more "words"-based helper functions out
of BIGNUM, so the casts would only be in one place. That too saves us
from the top bits problem in the bytes-to-words direction.

Bug: 301
Change-Id: I3285a86441daaf824a4f6862e825d463a669efdb
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52505
Commit-Queue: Bob Beck <bbe@google.com>
Reviewed-by: Bob Beck <bbe@google.com>
thesamesam added a commit to thesamesam/openssl that referenced this issue Jun 24, 2022
Notably, this might have caught openssl#18225, as Clang 14 wasn't - and is not yet
until this commit - in OpenSSL's CI.

It makes sense to ensure CI tests compilers used in newer Linux distributions:
* Fedora 36 ships with GCC 12
* Ubuntu 22.04 ships with Clang 14

We switch from 'ubuntu-latest' (which can change meaning but currently points
to ubuntu-20.04) to ubuntu-20.04 for the older existing compilers, and
ubuntu-22.04 for the newer ones added by this commit.

CLA: trivial
openssl-machine pushed a commit that referenced this issue Jun 27, 2022
Notably, this might have caught #18225, as Clang 14 wasn't - and is not yet
until this commit - in OpenSSL's CI.

It makes sense to ensure CI tests compilers used in newer Linux distributions:
* Fedora 36 ships with GCC 12
* Ubuntu 22.04 ships with Clang 14

We switch from 'ubuntu-latest' (which can change meaning but currently points
to ubuntu-20.04) to ubuntu-20.04 for the older existing compilers, and
ubuntu-22.04 for the newer ones added by this commit.

CLA: trivial

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #18639)
openssl-machine pushed a commit that referenced this issue Jun 27, 2022
Notably, this might have caught #18225, as Clang 14 wasn't - and is not yet
until this commit - in OpenSSL's CI.

It makes sense to ensure CI tests compilers used in newer Linux distributions:
* Fedora 36 ships with GCC 12
* Ubuntu 22.04 ships with Clang 14

We switch from 'ubuntu-latest' (which can change meaning but currently points
to ubuntu-20.04) to ubuntu-20.04 for the older existing compilers, and
ubuntu-22.04 for the newer ones added by this commit.

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #18639)
@boris-kolpackov
Copy link

boris-kolpackov commented Aug 3, 2022

Are there plans to fix this for the 1.1.1 release?

mattcaswell pushed a commit to mattcaswell/openssl that referenced this issue Aug 3, 2022
As of clang-14 the strict aliasing is causing code to magically disappear.
By explicitly inlining the code, the aliasing problem evaporates.

Fixes openssl#18225

Backport of openssl#18528 to 1.1.1.
mattcaswell pushed a commit to mattcaswell/openssl that referenced this issue Aug 3, 2022
As of clang-14 the strict aliasing is causing code to magically disappear.
By explicitly inlining the code, the aliasing problem evaporates.

Fixes openssl#18225

Backport of openssl#18258 to 1.1.1.
@mattcaswell
Copy link
Member

mattcaswell commented Aug 3, 2022

Are there plans to fix this for the 1.1.1 release?

I created a backport or the fix in #18948

openssl-machine pushed a commit that referenced this issue Aug 17, 2022
As of clang-14 the strict aliasing is causing code to magically disappear.
By explicitly inlining the code, the aliasing problem evaporates.

Fixes #18225

Backport of #18258 to 1.1.1.

Reviewed-by: Hugo Landau <hlandau@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #18948)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master branch: 1.1.1 branch: 3.0 triaged: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants