TurboQuant_practice
TurboQuant paper: https://arxiv.org/abs/2504.19874
1.codebook.py: creates a simple low-color image and shows codebook-size quantization results.2.lemma1.py: plots empirical coordinate distribution on the sphere vs Lemma 1 density and Gaussian approximation.3.QJL.py: minimal QJL and inverse-QJL implementation with reconstruction difference printout.3-1.QJL_simulation.py: QJL simulation across dimensions (error distributions, variance markers, cosine similarity).4.QJL-lemma4_unbiased.py: Monte Carlo check of Lemma 4 unbiasedness by averaging repeated QJL reconstructions.5.QJL-lemma4_variance.py: Monte Carlo check of Lemma 4 variance bound.6.central_limit_theorem.py: CLT simulation and sample-mean variance decay visualization.7.concentration_of_measure.py: concentration-of-measure simulation on high-dimensional spheres.8-1.TurboQuant_mse_simulation.py: current TurboQuant_mse simulation script (randomized quantizer per trial).8-1.TurboQuant_mse_simulation_fix.py: fixed-quantizer version of 8-1 (rotation/codebook fixed before trials).9-1.TurboQuant_prod_simulation.py: current TurboQuant_prod simulation script (randomized quantizer per trial).9-1.TurboQuant_prod_simulation_fix.py: fixed-quantizer(rotation matrix) version of 9-1 (rotation/QJL fixed before trials).10.TurboQuant_final_simulation.py: final comparison across bit widths with per-trial randomized quantizers.10.TurboQuant_final_simulation_fix.py: fixed-quantizer(rotation matrix) version of final comparison.11.TurboQuant_quantizaiton_simulation.py: quantized-only similarity experiment and Q(float)-K(quantized) softmax comparison.TurboQuant_mse.py: baseline TurboQuant_mse implementation (exact-density style).TurboQuant_mse_montecarlo.py: Monte Carlo codebook-learning version of TurboQuant_mse.TurboQuant_mse_lgamma.py: numerically stable TurboQuant_mse using log-gamma (lgamma) formulas.TurboQuant_prod.py: baseline TurboQuant_prod built onTurboQuant_mse.py.TurboQuant_prod_montecarlo.py: Monte Carlo TurboQuant_prod built onTurboQuant_mse_montecarlo.py.TurboQuant_prod_lgamma.py: numerically stable TurboQuant_prod built onTurboQuant_mse_lgamma.py.
Lemma 1 in TurboQuant/main.tex states the following:
- if
$x \in S^{d-1}$ is uniformly distributed on the unit sphere in$R^d$ - then any coordinate
$x_j$ follows a Beta-type distribution on$[-1, 1]$
The formula in the paper is:
The intuition is that if we fix one coordinate of a random point on the sphere to
The Gamma function extends the factorial function to real and complex arguments:
Useful identities are:
$\Gamma(z+1) = z \Gamma(z)$ - for any integer
$n$ ,$\Gamma(n) = (n-1)!$ $\Gamma(1/2) = \sqrt{\pi}$
This is why the Gamma function naturally appears in formulas for high-dimensional balls and spheres.
The volume of a
and the surface area of the sphere
In particular, for radius
- unit ball volume:
$V_d(1) = \pi^{d/2} / \Gamma(d/2 + 1)$ - unit sphere surface area:
$A_{d-1}(1) = 2\pi^{d/2} / \Gamma(d/2)$
The Gamma function appears in Lemma 1 because the proof computes the size of spherical cross-sections, and those cross-sections are lower-dimensional spheres whose surface area is given by the formula above.
The proof in the paper uses a cross-sectional area argument.
-
Fix
$x_j = x$ . Then the remaining$d-1$ coordinates must satisfy$x_1^2 + ... + x_{j-1}^2 + x_{j+1}^2 + ... + x_d^2 = 1 - x^2$ . -
Therefore, the feasible set is a sphere in
$R^{d-1}$ with radius$\sqrt{1-x^2}$ and dimension$d-2$ . -
Its surface area is
$A_{d-2}(\sqrt{1-x^2}) = \frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)} (1-x^2)^{(d-2)/2}$ . -
The total sample space is the unit sphere
$S^{d-1}$ , whose surface area is$A_{d-1}(1) = \frac{2\pi^{d/2}}{\Gamma(d/2)}$ . -
The density is proportional to the ratio between the cross-sectional area and the total surface area. When expressing the density with respect to the coordinate
$x$ , an additional Jacobian factor$\frac{1}{\sqrt{1-x^2}}$ appears. In the paper this is described as coming from the Pythagorean-theorem geometry. -
Therefore,
$f_X(x) = \frac{\frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)} (1-x^2)^{(d-2)/2}}{\frac{2\pi^{d/2}}{\Gamma(d/2)}} \cdot \frac{1}{\sqrt{1-x^2}}$ and simplifying gives$f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi}\Gamma((d-1)/2)} (1-x^2)^{(d-3)/2}$ .
So the key idea behind Lemma 1 is that the distribution of one coordinate of a random point on the sphere is determined by the size of the spherical slice at that coordinate value. As the dimension grows, this distribution becomes increasingly concentrated near zero, which is why it converges to
Definition 1 in TurboQuant/main.tex introduces QJL, short for Quantized Johnson-Lindenstrauss. It is a
The forward QJL map is
where
The dequantization map is
The role of QJL is the following:
- first apply a random Gaussian projection through
$\mathbf{S}$ - then keep only the signs of the projected coordinates
- reconstruct a vector by multiplying the sign vector with
$\mathbf{S}^\top$ and the scale factor$\sqrt{\pi/2}/d$
This is useful because it gives an extremely compact representation, only one bit per coordinate, while still preserving inner products in expectation. In the TurboQuant paper, QJL is used on the residual vector after MSE-oriented quantization so that the final estimator becomes unbiased for inner product queries.
Intuitively, QJL trades exact coordinate values for random signed sketches. The Gaussian projection spreads the information of the input vector across coordinates, and the sign operation keeps only coarse directional information. The specific dequantization scaling is chosen so that the reconstructed vector has the correct inner-product behavior on average.
The factor
So when we form the dequantized vector using
Multiplying by
exactly cancels this factor, so the reconstruction is centered at the original vector in the sense needed for unbiased inner-product estimation. That is the reason the dequantization map uses the coefficient
A more formal derivation is as follows. Let the rows of
so
By rotational invariance of the Gaussian distribution, the vector
must be parallel to
To find
Since
Therefore,
and hence
Now applying the dequantization factor gives
This is the precise reason the coefficient
Lemma 4 in TurboQuant/main.tex gives the main statistical guarantee for QJL. For any unit vector
obtained in three steps:
- apply the random Gaussian map
$\mathbf{S}$ - keep only the signs through
$Q_{\tt qjl}(\mathbf{x}) = \mathrm{sign}(\mathbf{S}\mathbf{x})$ - dequantize using
$Q_{\tt qjl}^{-1}$
Lemma 4 says that this estimator has two key properties:
- it is unbiased
- its variance is small and decreases as the dimension
$d$ increases
More precisely, the lemma states
and
The first identity is the unbiasedness statement. It means that QJL does not systematically overestimate or underestimate the true inner product. If we repeated the random quantization many times and averaged the results, that average would converge to the true value
The second inequality is the variance bound. It controls how much the estimator fluctuates around the correct mean. The important feature is the factor
This lemma is the reason QJL is useful. One-bit quantization is extremely aggressive, but Lemma 4 shows that after the correct dequantization scaling, the resulting inner-product estimator is still mathematically well behaved.
Write the rows of
and substituting
This formula comes from a direct substitution. First,
Taking the inner product with
Now use the identity
Then
If the rows of
Therefore their inner product is
Substituting this back yields
So the full estimator is an average of
Then
This is the central simplification in the proof.
From the normalization argument in the previous QJL section, we already know that for unit vectors
Taking inner products with
So the QJL inner-product estimator is exactly unbiased.
Since the estimator is the average of independent samples
because the
So it remains to bound the variance of one term. The paper uses
This step uses the basic fact that variance is at most the second moment, together with the fact that multiplying by a sign changes only the sign, not the magnitude.
Now
Hence
Substituting this into the variance-of-the-average formula gives
Lemma 4 provides both correctness and concentration:
- correctness: the expected estimate equals the true inner product
- concentration: the fluctuations around that mean shrink like
$1/d$
This is exactly the property TurboQuant needs. The MSE-oriented part of the algorithm is good at reducing reconstruction error, but it does not by itself guarantee unbiased inner-product estimates. QJL is applied to the residual precisely because Lemma 4 ensures that the residual correction term preserves inner products in expectation and introduces only controlled variance. That combination is what allows TurboQuant to achieve unbiased inner-product estimation with small distortion.
Section 3.1 of TurboQuant/main.tex explains how TurboQuant is designed when the goal is to minimize reconstruction mean-squared error rather than preserve inner products. The input is a worst-case vector
The first idea is to randomize the geometry of the input. Instead of quantizing
Why they use QR decomposition? If matrix A is decomposed to QR, then Q is orthogonal matrix, which each column is orthonormal to another column. therefore
for
This random rotation is what makes the quantization problem tractable. By Lemma 1, each coordinate of
In high dimension this distribution is close to a normal distribution, and different coordinates are nearly independent. That means the original high-dimensional quantization problem can be approximated by a scalar quantization problem applied independently to each coordinate.
The next step is to design the best scalar quantizer for that coordinate distribution. The paper formulates this as a continuous one-dimensional k-means problem on the interval
This objective deserves a more explicit interpretation. The values
So the
The term
is therefore the expected squared distortion contributed by the
This is exactly the continuous analogue of ordinary k-means. In finite-data k-means, one partitions data points into clusters and chooses one centroid per cluster to minimize squared distance. Here the "dataset" is not a finite set of points but a continuous probability distribution
- each point is assigned to its nearest centroid
- each optimal centroid is the conditional mean of the points assigned to its cell
In the continuous setting, if
So
Another useful way to read the objective is that it is distribution-aware. If
The paper notes that this optimization can be solved numerically to any desired precision, and in practice it is solved once for the bit-widths of interest and then stored for reuse. Then the online quantizer only needs to look up the nearest centroid for each rotated coordinate.
So the overall MSE-optimal TurboQuant pipeline is:
- sample a random rotation matrix
$\mathbf{\Pi}$ - solve the scalar quantization problem for the coordinate distribution
$f_X$ and build the codebook${c_1,\dots,c_{2^b}}$ - rotate the input:
$\mathbf{y} = \mathbf{\Pi}\mathbf{x}$ - quantize each coordinate
$y_j$ to the index of its nearest centroid - reconstruct
$\tilde{\mathbf{y}}$ from those centroid indices - rotate back using
$\mathbf{\Pi}^\top$ to obtain$\tilde{\mathbf{x}}$
The quantization map therefore stores only the centroid index of each rotated coordinate. The dequantization map replaces each stored index by its corresponding centroid and then applies the inverse rotation. In symbols, the paper describes:
- quantization: compute
$\mathbf{y} = \mathbf{\Pi}\mathbf{x}$ and store $\mathrm{idx}j = \arg\min{k \in [2^b]} |y_j - c_k|$ for each coordinate$j$ - dequantization: reconstruct $\tilde{y}j = c{\mathrm{idx}_j}, \qquad \tilde{\mathbf{x}} = \mathbf{\Pi}^\top \tilde{\mathbf{y}}$
The important conceptual point is that TurboQuant converts a difficult worst-case vector quantization problem into a distribution-aware scalar quantization problem by first randomizing the basis. The random rotation makes every coordinate statistically predictable, and once that happens, the optimal codebook can be designed using the known spherical coordinate distribution.
The paper also gives intuition for the resulting centroids. In moderately high dimension, where
and for
This section is the design blueprint of
The paper presents Algorithm 1 as a short procedural description of
Input:
- dimension
$d$ - bit-width
$b$
Global setup:
- generate a random rotation matrix
$\mathbf{\Pi} \in \mathbb{R}^{d \times d}$ - construct the codebook by finding centroids
$c_1, c_2, \ldots, c_{2^b} \in [-1,1]$ that minimize the continuous k-means MSE objective$\mathcal{C}(f_X,b)$
Quantization procedure
- Compute the rotated vector
$\mathbf{y} \gets \mathbf{\Pi}\mathbf{x}$ . - For every coordinate
$j \in [d]$ , find the nearest centroid index $\mathrm{idx}j \gets \arg\min{k \in [2^b]} |y_j - c_k|$. - Output the full index vector
$\mathrm{idx}$ .
Dequantization procedure
- For every coordinate
$j \in [d]$ , reconstruct the rotated coordinate by $\tilde{y}j \gets c{\mathrm{idx}_j}$. - Rotate back to the original basis using
$\tilde{\mathbf{x}} \gets \mathbf{\Pi}^\top \tilde{\mathbf{y}}$ . - Output the reconstructed vector
$\tilde{\mathbf{x}}$ .
So Algorithm 1 is structurally very simple:
- rotate the input
- quantize each rotated coordinate independently using a precomputed scalar codebook
- reconstruct the rotated coordinates from their centroid indices
- undo the rotation
This algorithm is exactly the practical realization of the design described in Section 3.1.
After presenting the algorithm, the paper states a performance guarantee for
The paper defines the reconstruction error as
where the expectation is over the randomness of the rotation used by TurboQuant.
The theorem states that the MSE satisfies the uniform upper bound
This is the main asymptotic guarantee: every additional bit per coordinate improves the distortion by roughly a factor of
The paper also reports sharper values for small bit-widths. In particular,
- for
$b=1$ ,$D_{\tt mse} \approx 0.36$ - for
$b=2$ ,$D_{\tt mse} \approx 0.117$ - for
$b=3$ ,$D_{\tt mse} \approx 0.03$ - for
$b=4$ ,$D_{\tt mse} \approx 0.009$
These numbers are more informative in practice than the general upper bound because they show the actual distortion level at the low-bit regimes that matter most for compression.
The interpretation of the theorem is straightforward.
The paper also defines an inner-product optimized version of TurboQuant, denoted $\mathrm{TurboQuant}{\tt prod}$. The motivation is that $\mathrm{TurboQuant}{\tt mse}$ is good for vector reconstruction, but minimizing reconstruction MSE does not automatically make inner-product estimates unbiased. In particular, if one wants to answer queries of the form
To fix this, the paper combines two ingredients:
- a
$(b-1)$ -bit instance of$\mathrm{TurboQuant}_{\tt mse}$ - a
$1$ -bit QJL sketch applied to the residual left over by the MSE quantizer
More concretely, the algorithm first computes an MSE-oriented reconstruction of
It then forms the residual
This residual is small in norm because the MSE quantizer already captures most of the signal. The remaining information is then encoded with QJL. The paper writes the resulting inner-product estimator as
So the MSE part gives a good low-distortion approximation, and the QJL part corrects the residual in a way that preserves inner products in expectation.
The corresponding quantization map stores three objects:
- the MSE codebook indices
$\mathrm{idx}$ - the sign vector produced by QJL on the residual, denoted
$\mathrm{qjl}$ - the residual norm
$\gamma = |\mathbf{r}|_2$
In the paper, this is summarized as
During dequantization, the algorithm reconstructs the MSE component, reconstructs the QJL correction term from the sign sketch and the residual norm, and then adds the two pieces together.
The practical logic of
- quantize
$\mathbf{x}$ using$\mathrm{TurboQuant}_{\tt mse}$ with bit-width$b-1$ - reconstruct the MSE approximation and compute the residual
$\mathbf{r}$ - apply QJL to the residual
- store the MSE indices, the QJL sign vector, and the residual norm
- during dequantization, reconstruct the MSE part and add the QJL residual correction
This is why the algorithm is called inner-product optimal: it uses the MSE quantizer for coarse reconstruction, but then spends the last bit on a QJL-style correction specifically chosen to remove inner-product bias.
The paper proves that
The paper then gives the following upper bound for the inner-product distortion
For small bit-widths, the paper also reports refined values:
- for
$b=1$ ,$D_{\tt prod} \approx \frac{1.57}{d}$ - for
$b=2$ ,$D_{\tt prod} \approx \frac{0.56}{d}$ - for
$b=3$ ,$D_{\tt prod} \approx \frac{0.18}{d}$ - for
$b=4$ ,$D_{\tt prod} \approx \frac{0.047}{d}$
The lower bound stated in the paper is
So, up to a constant factor,
The paper also notes that TurboQuant can be made slightly more storage-efficient by entropy-coding the codebook indices after quantization. In
If the centroid indices are denoted by
This is simply the probability mass of the Voronoi region associated with centroid
The important point is that this changes only the representation of the indices, not the quantizer itself. So:
- the reconstruction distortion stays exactly the same
- the average storage cost can decrease
- the gain comes entirely from lossless coding of non-uniformly distributed centroid pointers
The paper reports that the largest gain appears around
So entropy encoding of codebook pointers is an optional compression layer on top of TurboQuant. It does not improve MSE, but it can slightly reduce the final storage footprint by exploiting the fact that some centroid indices are more common than others. The paper ultimately does not include this optimization in the main algorithm because the gain is modest relative to the added implementation complexity.
Below are the figures I generated while implementing and testing the main ideas from the paper.
This figure visualizes a simple low-color image, the quantized reconstruction for different codebook sizes, and the palette learned for each codebook.
This experiment compares three objects across dimensions:
- the empirical distribution of one coordinate of a random point on the unit sphere
- the exact density from Lemma 1
- the Gaussian approximation
$N(0, 1/d)$
It shows that the spherical coordinate distribution becomes increasingly close to a normal distribution in high dimension.
This figure studies QJL across several dimensions. It includes:
- the inner-product error distribution
- the squared inner-product error distribution
- the cosine similarity between
$x$ and the QJL reconstruction
The plots show that QJL is designed for unbiased inner-product estimation rather than for maximizing cosine similarity with the original vector.
This simulation uses an exponential base distribution and shows two effects:
- the standardized sample mean converges toward a standard normal distribution
- the variance of the sample mean decreases as the sample size grows
This figure illustrates concentration of measure on high-dimensional spheres. As the dimension increases, one coordinate of a random unit vector becomes much more concentrated near zero.
This experiment evaluates the MSE-oriented quantizer across dimensions. The figure includes:
- the distribution of
$D_{\tt mse} = |x-\tilde{x}|_2^2$ - the per-coordinate MSE distribution
- the inner-product squared error distribution
- the cosine similarity between the original and reconstructed vectors
It helps compare total distortion and per-coordinate distortion as dimension grows.
This experiment evaluates the inner-product optimized quantizer across dimensions. The figure includes:
- the distribution of
$D_{\tt mse}$ - the per-coordinate MSE distribution
- the distribution of
$D_{\tt prod}$ - the lower and upper bounds for
$D_{\tt prod}$ - the cosine similarity between the original and reconstructed vectors
This makes it possible to compare geometric reconstruction quality and inner-product distortion at the same time.
This final figure compares $\mathrm{TurboQuant}{\tt mse}$ and $\mathrm{TurboQuant}{\tt prod}$ as the bit width changes. It includes:
- raw inner-product error histograms
- squared inner-product error histograms
- the average inner-product distortion together with lower and upper bounds
- the average MSE together with lower and upper bounds
This is the main summary figure for comparing the two quantizers as compression becomes more or less aggressive.
This experiment validates two practical points:
- similarity can be estimated directly from quantized vectors (index/codebook domain) without reconstructing to the original space
- when
$Q$ is not quantized but$K$ is quantized, the resulting softmax distribution is still close to full precision
The figure includes:
- true similarity vs quantized-only similarity scatter
- quantized-only similarity error distribution across bit widths
- one-query softmax distribution comparison (
$Q$ full precision,$K$ quantized) - JS-divergence of softmax distributions versus bit width
After implementing and testing the main components of the paper, my overall conclusion is that the random-rotation viewpoint is the key idea that makes the whole framework work. Once the vector is rotated, the coordinate distribution becomes predictable, and that allows a scalar codebook to be designed in a principled way rather than heuristically.
From the experiments, I found that
However, $ \mathrm{TurboQuant}{\tt prod} $ don't seem better than $\mathrm{TurboQuant}{\tt mse}$. I think that this is because of implementation error.








