Optimizations

Contains a list of optimizations that have been made to the scroll/halo2 codebase. This is a living document and will be updated as new optimizations are made.

1. PrimeField in Assembly

We use a faster prime field which is generated by iden3/ffiasm.

Relevant commit: here.

2. Lazy IFFT

Some parts of the Halo2 code hold polynomials in both coefficient and evaluation form. Considering that polynomials in coefficient form are only needed after squeezing y, we can delay IFFT until y is squeezed. This helps reduce memory pressure.

Relevant commit: here

2.1. Lazy IFFT for Permutation Argument

Relevant code optimization: here.

Relevant halo2 code: permutation product poly.

2.2. Lazy IFFT for Lookup Argument

Relevant code optimization: here.

Relevant halo2 code: permuted input poly, permuted table poly and lookup product poly.

3. Batch Normalize

Originally, when committing to multiple polynomials sequentially, the conversion from xyzz point type to affine point is done individually for each commitment, which leads to a large number of expensive inverse operations. We optimize this inefficiency with Batch Normalize.

To set up Batch Normalize, conversions from xyzz point type to affine point are delayed until all commitments are calculated. Once all calculations are finished, xyzz point types are converted to affine points at once with BatchNormalize(). This reduces the number of required inverse operations to once per sequence.

Relevant commit: here

3.1. Batch Normalize for Permuted Polys

Relevant code optimization: here.

Relevant halo2 code: permuted input poly and permuted table poly.

3.2. Batch Normalize for Grand Product Polys

Relevant code optimization: here.

Relevant halo2 code: permutation product poly and lookup product poly.

3.3. Batch Inverse in Batch Normalize

Inverse operations are costly for elliptic curves. We can reduce the number of inverse operations on multiple points by using BatchInverse(), which only requires one inverse operation for all the points included.

We fully utilize Batch Inverse to avoid multiple inverse operations. For example, the original batch_normalize() inefficiently computed an inverse operation for every point, so we optimized our BatchNormalize() to use the batch inverse operation to normalize all the points with a single inverse operation.

4. Optimizing parallelization code

We merge multiple parallelized parts into a single parallelized part for better performance by reducing the number of thread spawns and joins. Here are the list of examples:

Description	Tachyon	Halo2
Merge parallelization	merging 1-t	merging 1-h
Use collapse clause	merging 2-t	merging 2-h

5. Optimizing MSM

Several optimizations have been made to the Multi-Scalar Multiplication (MSM) operation. For scalar multiplication, we have implemented windowed non-adjacent form (WNAF), which is a more efficient way of representing the scalar. For group operations, we have changed the bucket type to the PointXYZZ type, which is more efficient than the Jacobian type. View the benchmark result regarding the change to PointXYZZ type here.

6. Saving heap allocation

Considering the memory used by polynomials is typically high, reducing the number of memory allocations and reusing the existing allocated memory is very important.

Description	Tachyon	Halo2
Reuse compressed evals buffer	saving 1-t	saving 1-h
Reuse z buffer	saving 2-t	saving 2-h
Compute l polys	saving 3-t	saving 3-h
Avoid intermediate memory allocation when transposing	saving 4-t	saving 4-h

7. Benchmarks for each circuit

We provide benchmarks of Tachyon and Halo2 for three main circuits by Privacy-scaling-explorations and Scroll.

Tachyon commit hash: 24ebe1f
halo2 commit hash: 0918c29e
zkevm-circuits commit hash: 4f431bf

The following are the results of the real_prover test for the Transfer 0 block in the integration tests of zkevm-circuits. The times, in seconds, were recorded on an AWS EC2 instance (name: r6i.8xlarge, vCPU: 32, RAM: 256GB).

Steps	Tx:Tachyon	Tx:Halo2	EVM:Tachyon	EVM:Halo2	Super:Tachyon	Super:Halo2
degree	20	20	18	18	20	20
before theta	189.44	177.91	33.54	27.02	400.52	405.87
theta	12.77	20.65	6.44	13.50	103.24	171.76
- compress	1.12	2.73	0.23	0.46	6.59	14.35
- permute	9.67	11.04	4.93	8.37	72.87	88.26
- commit	1.98	6.89	1.28	4.66	23.78	69.15
beta-gamma	15.75	32.18	7.79	14.62	91.04	174.76
y	130.50	261.97	161.96	218.37	2130.99	3426.80
- transform poly	4.27	6.57	1.86	2.79	23.83	34.37
- build h poly	126.23	255.40	160.10	215.58	2107.16	3392.43
x	1.49	1.58	2.18	2.47	15.50	14.42
shplonk	8.05	5.87	1.35	3.31	37.52	25.30
backend total	168.56	322.25	179.72	252.27	2378.29	3813.04
end-to-end total	358.58	500.16	213.26	279.29	2778.81	4218.91

Greek characters such as theta and beta are adopted from the halo2-book
before theta refers to the total time taken by the Halo2 including the circuit frontend.
backend total refers to the total time taken from theta to shplonk.
end-to-end total refers to the total time taken by the program (before theta + backend total).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optimizations

1. PrimeField in Assembly

2. Lazy IFFT

2.1. Lazy IFFT for Permutation Argument

2.2. Lazy IFFT for Lookup Argument

3. Batch Normalize

3.1. Batch Normalize for Permuted Polys

3.2. Batch Normalize for Grand Product Polys

3.3. Batch Inverse in Batch Normalize

4. Optimizing parallelization code

5. Optimizing MSM

6. Saving heap allocation

7. Benchmarks for each circuit

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optimizations

1. PrimeField in Assembly

2. Lazy IFFT

2.1. Lazy IFFT for Permutation Argument

2.2. Lazy IFFT for Lookup Argument

3. Batch Normalize

3.1. Batch Normalize for Permuted Polys

3.2. Batch Normalize for Grand Product Polys

3.3. Batch Inverse in Batch Normalize

4. Optimizing parallelization code

5. Optimizing MSM

6. Saving heap allocation

7. Benchmarks for each circuit