Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast isSquare / Legendre symbol / Jacobi symbol - 16.8x acceleration #95

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

mratsim
Copy link
Contributor

@mratsim mratsim commented Oct 13, 2023

Upstreaming on behalf of @AlekseiVambol - original: taikoxyz#3

This implements optimized isSquare/legendre symbol as suggested in #73 (comment).

Performance

The performance improvement is an incredible 16.8x acceleration over exponentiation by (p-1)/2.

image

API direction

At the moment the procedure is named jacobi.

PR #77 introduced the Legendre trait.

  1. AFAIK it was because exponentiation by (p-1)/2 needed to abstract away the usage of curve-specific modulus constants.
    In that case, the trait can be deleted, it should be recent enough that breakage is minimal if any (cc @huitseeker).
  2. Alternatively, the legendre procedure can call jacobi instead and the Legendre trait can be preserved. Though I'm curious what's the use-case of a trait instead of a function in that case.

Implementation

After extensive testing between Bernstein-Yang and Pornin's algorithms,m it turns out that:

References:

mratsim added a commit to taikoxyz/halo2curves that referenced this pull request Oct 13, 2023
@mratsim
Copy link
Contributor Author

mratsim commented Oct 13, 2023

Actually speedup may not be 16.8x but only 10x, quote from original PR

The Legendre symbol is computed by the added method 10.1 times faster than by the previously existing one based on the Euler's criterion.

Note: According to the test bench in /benches/bn256_field.rs, the aforesaid performance advantage is more than 20 times, which is incorrect. The reason for this behavior of the test bench has not been found yet.

@han0110 han0110 self-requested a review October 16, 2023 08:18
@CPerezz CPerezz requested review from CPerezz and kilic October 17, 2023 09:39
@huitseeker
Copy link
Contributor

huitseeker commented Oct 18, 2023

PR #77 introduced the Legendre trait.

AFAIK it was because exponentiation by (p-1)/2 needed to abstract away the usage of curve-specific modulus constants.
In that case, the trait can be deleted, it should be recent enough that breakage is minimal if any (cc @huitseeker).

To answer this part specifically: there should be no breakage, since PR #77 was not part of any released version of halo2curves (0.4.0 at the latest).

Though that should not be taken as a sign of support for a slow release schedule on my part!

github-merge-queue bot pushed a commit that referenced this pull request Oct 19, 2023
* Bernstein yang modular multiplicative inverter (#2)

* rename similar to #95

---------

Co-authored-by: Aleksei Vambol <77882392+AlekseiVambol@users.noreply.github.com>
Copy link
Contributor

@han0110 han0110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for improving the library.

Although I still don't fully understand the reason of i being 30, but I don't want to block the PR. Other parts could be referenced from Computing the Jacobi symbol using Bernstein-Yang and Optimized Binary GCD for Modular Inversion.

Additional small notes for perhaps other reviewers:

  1. (b ^ b >> 1) & 2 == 2 is equal to b % 8 == 3 || b % 8 == 5 when b is odd.
  2. This python version might be the most unoptmized one to understand the algorithm https://gist.github.com/robot-dreams/ceb00162b80384f2ae1913aaa2b35e75 (also quoted in Perf: isSquare - constant-time Jacobi/Kronecker/Legendre symbol using fast GCD mratsim/constantine#199)

@AlekseiVambol
Copy link
Contributor

AlekseiVambol commented Oct 19, 2023

LGTM! Thanks again for improving the library.

Although I still don't fully understand the reason of i being 30, but I don't want to block the PR. Other parts could be referenced from Computing the Jacobi symbol using Bernstein-Yang and Optimized Binary GCD for Modular Inversion.

Additional small notes for perhaps other reviewers:

1. `(b ^ b >> 1) & 2 == 2` is equal to `b % 8 == 3 || b % 8 == 5` when `b` is odd.

2. This python version might be the most unoptmized one to understand the algorithm https://gist.github.com/robot-dreams/ceb00162b80384f2ae1913aaa2b35e75 (also quoted in [Perf: `isSquare` - constant-time Jacobi/Kronecker/Legendre symbol using fast GCD mratsim/constantine#199](https://github.com/mratsim/constantine/issues/199))

Thank you for the review!

The python code you mentioned is not related to my modification of the Pornin's method. It seems to be the Wuille’s Bernstein-Yang-based method of Jacobi symbol computation. It is somewhat slower than the proposed method and lacks the proof of convergence. More information about this method can be found here: https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md (section 8).

Regarding the number of the inner loop's iterations, which is 30 instead of 31:

The initial Pornin's method for modular inversion performs 31 iterations of the inner loop and uses "approximations" of the variables, which contain the arguments at the beginning of the algorithm's execution. These "approximations" are defined in the following way. Let n be the bit length of the largest argument without leading zeros. For n > 64 the "approximation" of the argument, which equals v, is (v div 2 ^ (n - 33)) * 2 ^ 33 + (v mod 2 ^ 31), i.e. it retains the 33 high and 31 low bits of the n-bit representation of v. If n does not exceed 64, an argument and its "approximation" are equal.

It is not hard to see, that before each of the 31 iterations of the inner loop the least significant bit of each "approximation" is the same as the least significant bit of the value that the corresponding variable would have had if it had been updated in the inner loop along with the "approximations". Let us call this value the "potential precise value". Before the first iteration of the inner loop an "approximation" and the corresponding "potential precise value" are equal modulo 2^31, before the second iteration they are equal at least modulo 2^30 and so on. the This allows the initial Pornin's method to make the correct decisions, which are related to parity checks.

However, we need an "approximation" to be equal to the corresponding "potential precise value" at least modulo 8 before each iteration to use the properties of the modified Jacobi symbol (x / |y|). The naive approach lies in performing just 29 iterations in the inner loop. However, I decided to change the "approximation" method in the terms of the used formula: here it is (v div 2 ^ (n - 32)) * 2 ^ 32 + (v mod 2 ^ 32), i.e. it retains the 32 high and 32 low bits of the n-bit representation of v. This modification allows to perform 30 iterations instead of 29. The admissibility of this modification I have proven using the appropriately modified Pornin's theorems.

PS I think I have to make a part of comments more accurate: replace

"corresponding "precise" variables' values, which would be computed, if the "precise" variables were used in the inner loop instead of the "approximation" ones."

with

"corresponding "precise" variables' values, which would have been computed, if the "precise" variables had been updated in the inner loop along with the "approximations"."

Should I do this after merging or now?

PPS Among the several improvements the most practically valuable one lies in using the short-arithmetic binary Euclidean algorithm to finish the computation, when the modified Pornin method’s "approximations" become absolutely precise.

@han0110
Copy link
Contributor

han0110 commented Oct 19, 2023

The python code you mentioned is not related to my modification of the Pornin's method. It seems to be the Wuille’s Bernstein-Yang-based method of Jacobi symbol computation. It is somewhat slower than the proposed method and lacks the proof of convergence. More information about this method can be found here: https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md (section 8).

Yeah I mean to understand Computing the Jacobi symbol using Bernstein-Yang the script might help, and from it to your optmized implementation is easier for myself, since I couldn't find other implementation taking the same strategy.

Before the first iteration of the inner loop an "approximation" and the corresponding "potential precise value" are equal modulo 2^31, before the second iteration they are equal at least modulo 2^30 and so on. the This allows the initial Pornin's method to make the correct decisions, which are related to parity checks. However, we need an "approximation" to be equal to the corresponding "potential precise value" at least modulo 8 before each iteration to use the properties of the modified Jacobi symbol (x / |y|).

I see it now, thanks for clarifying, I think this is what I was missing.

Should I do this after merging or now?

Feel free to do so, besides the PR needs a rebase.

@huitseeker huitseeker mentioned this pull request Oct 23, 2023
mratsim added a commit to taikoxyz/halo2curves that referenced this pull request Oct 25, 2023
…s#83)

* Bernstein yang modular multiplicative inverter (#2)

* rename similar to privacy-scaling-explorations#95

---------

Co-authored-by: Aleksei Vambol <77882392+AlekseiVambol@users.noreply.github.com>
Copy link
Member

@CPerezz CPerezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks so much for this work!

@CPerezz CPerezz added this pull request to the merge queue Oct 31, 2023
Merged via the queue into privacy-scaling-explorations:main with commit 81a0782 Oct 31, 2023
7 checks passed
@mratsim mratsim deleted the pse-pr-legendre branch November 2, 2023 16:24
jonathanpwang added a commit to axiom-crypto/halo2curves that referenced this pull request Nov 13, 2023
* Add field conversion to/from `[u64;4]` (privacy-scaling-explorations#80)

* feat: add field conversion to/from `[u64;4]`

* Added conversion tests
* Added `montgomery_reduce_short` for no-asm
* For bn256, uses assembly conversion when asm feature is on

* fix: remove conflict for asm

* chore: bump rust-toolchain to 1.67.0

* Compute Legendre symbol for `hash_to_curve` (privacy-scaling-explorations#77)

* Add `Legendre` trait and macro

 - Add Legendre macro with norm and legendre symbol computation
 - Add macro for automatic implementation in prime fields

* Add legendre macro call for prime fields

* Remove unused imports

* Remove leftover

* Add `is_quadratic_non_residue` for hash_to_curve

* Add `legendre` function

* Compute modulus separately

* Substitute division for shift

* Update modulus computation

* Add quadratic residue check func

* Add quadratic residue tests

* Add hash_to_curve bench

* Implement Legendre trait for all curves

* Move misplaced comment

* Add all curves to hash bench

* fix: add suggestion for legendre_exp

* fix: imports after rebase

* Add simplified SWU method (privacy-scaling-explorations#81)

* Fix broken link

* Add simple SWU algorithm

* Add simplified SWU hash_to_curve for secp256r1

* add: sswu z reference

* update MAP_ID identifier

Co-authored-by: Han <tinghan0110@gmail.com>

---------

Co-authored-by: Han <tinghan0110@gmail.com>

* Bring back curve algorithms for `a = 0` (privacy-scaling-explorations#82)

* refactor: bring back curve algorithms for `a = 0`

* fix: clippy warning

* fix: Improve serialization for prime fields (privacy-scaling-explorations#85)

* fix: Improve serialization for prime fields

Summary: 256-bit field serialization is currently 4x u64, ie. the native format. This implements the standard of byte-serialization (corresponding to the PrimeField::{to,from}_repr), and an hex-encoded variant of
that for (de)serializers that are human-readable (concretely, json).

- Added a new macro `serialize_deserialize_32_byte_primefield!` for custom serialization and deserialization of 32-byte prime field in different struct (Fq, Fp, Fr) across the secp256r, bn256, and derive libraries.
- Implemented the new macro for serialization and deserialization in various structs, replacing the previous `serde::{Deserialize, Serialize}` direct use.
- Enhanced error checking in the custom serialization methods to ensure valid field elements.
- Updated the test function in the tests/field.rs file to include JSON serialization and deserialization tests for object integrity checking.

* fixup! fix: Improve serialization for prime fields

---------

Co-authored-by: Carlos Pérez <37264926+CPerezz@users.noreply.github.com>

* refactor: (De)Serialization of points using `GroupEncoding` (privacy-scaling-explorations#88)

* refactor: implement (De)Serialization of points using the `GroupEncoding` trait

- Updated curve point (de)serialization logic from the internal representation to the
  representation offered by the implementation of the `GroupEncoding` trait.

* fix: add explicit json serde tests

* Insert MSM and FFT code and their benchmarks. (privacy-scaling-explorations#86)

* Insert MSM and FFT code and their benchmarks.

Resolves taikoxyz/zkevm-circuits#150.

* feedback

* Add instructions

* feeback

* Implement feedback:  Actually supply the correct arguments to `best_multiexp`.

Split into `singlecore` and `multicore` benchmarks so Criterion's result
caching and comparison over multiple runs makes sense.

Rewrite point and scalar generation.

* Use slicing and parallelism to to decrease running time.

Laptop measurements:
k=22: 109 sec
k=16:   1 sec

* Refactor msm

* Refactor fft

* Update module comments

* Fix formatting

* Implement suggestion for fixing CI

* Re-export also mod `pairing` and remove flag `reexport` to alwasy re-export (privacy-scaling-explorations#93)

fix: re-export also mod `pairing` and remove flag `reexport` to alwasy re-export

* fix regression in privacy-scaling-explorations#93 reexport field benches aren't run (privacy-scaling-explorations#94)

fix regression in privacy-scaling-explorations#93, field benches aren't run

* Fast modular inverse - 9.4x acceleration (privacy-scaling-explorations#83)

* Bernstein yang modular multiplicative inverter (#2)

* rename similar to privacy-scaling-explorations#95

---------

Co-authored-by: Aleksei Vambol <77882392+AlekseiVambol@users.noreply.github.com>

* Fast isSquare / Legendre symbol / Jacobi symbol - 16.8x acceleration (privacy-scaling-explorations#95)

* Derivatives of the Pornin's method (taikoxyz#3)

* renaming file

* make cargo fmt happy

* clarifications from privacy-scaling-explorations#95 (comment) [skip ci]

* Formatting and slightly changing a comment

---------

Co-authored-by: Aleksei Vambol <77882392+AlekseiVambol@users.noreply.github.com>

* chore: delete bernsteinyang module (replaced by ff_inverse)

* Bump version to 0.4.1

---------

Co-authored-by: David Nevado <davidnevadoc@users.noreply.github.com>
Co-authored-by: Han <tinghan0110@gmail.com>
Co-authored-by: François Garillot <4142+huitseeker@users.noreply.github.com>
Co-authored-by: Carlos Pérez <37264926+CPerezz@users.noreply.github.com>
Co-authored-by: einar-taiko <126954546+einar-taiko@users.noreply.github.com>
Co-authored-by: Mamy Ratsimbazafy <mamy_github@numforge.co>
Co-authored-by: Aleksei Vambol <77882392+AlekseiVambol@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants