Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
443 lines (390 sloc) 17.4 KB
% formatting
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em} % prevent overfull lines
% compact main title. see
% math macros
\newcommand{\innp}[1]{\langle #1 \rangle}
\title{Fast Johnson-Lindenstrauss}
\author{Preetum Nakkiran}
\date{May 19, 2016}
The Johnson-Lindenstrauss (JL) Transform says that, informally,
we can embed high-dimensional points into a much lower dimension, while still
preserving their pairwise distances.
In this post we'll start with the classical JL transform,
then focus on the Fast JL Transform (FJLT) by Ailon and Chazelle \cite{fjlt},
which achieves the JL embedding more
efficiently (w.r.t. runtime and randomness).
We'll look at the FJLT from the perspective of ``preconditioning'' a sparse
estimator, which comes with nice intuition from Fourier duality.
We conclude by mentioning more recent developments in this area
{\bf Motivation.}
It's an interesting structural result, and the JL transform also has
algorithmic and machine-learning applications. Eg, any application that only
depends on approximate pairwise distances (such as nearest-neighbors), or even on pairwise
inner-products\footnote{Because, if the norms $||x_i-x_j||$, $||x_i||$, and $||x_j||$ are
approximately preserved, then so is the inner-product $\innp{x_i,x_j}$.
(such as kernel methods) may equivalently work with the
dimensionality-reduced version of their input.
This also has applications in sketching and scarification.
\section{Classical JL}
The JL embedding theorem is:
Given points $x_1, \dots, x_n \in \R^d$, and $\epsilon > 0$, there exists an embedding $f: \R^d \to
\R^k$ such that
$$\forall i, j: \quad
(1-\epsilon) ||x_i - x_j||_2
||f(x_i) - f(x_j)||_2
(1+\epsilon) ||x_i - x_j||_2
and $k = O(\epsilon^{-2}\log n)$.
That is, the embedding roughly preserves pairwise distances in $\ell_2$.
Think of the regime $n \gg d \gg k$.
Note that the target dimension $k = O(\epsilon^{-2}\log n)$
is (perhaps surprisingly) independent of the source dimension $d$, and only
logarithmic in the number of points $n$.
In fact, a random linear map works as an embedding (w.h.p.).
This is established by the following lemma.
For any $\delta > 0$, set $k = O(\epsilon^{-2}\log(1/\delta))$,
and let $A \in \R^{k \x d}$ be a random matrix with iid normal $\N(0, 1/k)$ entries.
$$\forall x \in \R^d: \quad
\Pr_{A}\left[~ ||Ax||_2^2 \in (1 \pm \epsilon)||x||_2^2 ~\right] \geq 1- \delta$$
That is, a random matrix preserves the norm of vectors with good probability.
To see that this implies the JL Theorem, consider applying the matrix $A$ on
the $O(n^2)$ vectors of pairwise differences $(x_i - x_j)$.
For fixed $i,j$, the lemma implies that
$||A(x_i - x_j)|| \approx ||x_i - x_j||$ except w.p. $\delta$.
Thus, setting $\delta = 1/n^3$ and union bounding, we have that $A$ preserves the norm of
\emph{all} differences $(x_i - x_j)$ with high probability.
Letting the embedding map $f = A$, we have
$$\forall i,j: \quad ||f(x_i) - f(x_j)|| = ||Ax_i - Ax_j|| =
||A(x_i - x_j)|| \approx
||x_i - x_j||$$
as desired.
\footnote{Note that it was important that $f$ be linear for us to reduce
preserving pairwise-distances to preserving norms.}
{\bf Runtime.} The runtime of embedding a single vector with this construction (setting
$\delta=1/n^3$ and $\epsilon=O(1)$) is
$O(dk) = O(d \log n)$. In the Fast JL below, we will show how to do this in
time almost $O(d \log d)$.
We will now prove Lemma~\ref{lem:jl}.
First, note that our setting is scale-invariant, so it suffices to prove the
lemma for all unit vectors $x$.
%, which will give a better sense of what is
%going on, and how to generalize/improve it.
As a warm-up, let $g \in \R^d$ be a random vector with entires iid
$\N(0, 1)$, and consider the inner-product
$$Y := \innp{g, x}$$
(for a fixed unit vector $x \in \R^d$).
Notice the random variable $Y$
has expectation $\E[Y] = 0$,
and variance $$\E[Y^2] = \E[\innp{g,x}^2] = ||x||^2$$
{\bf Thus, $Y^2$ is an unbiased estimator for $||x||^2$.}
%This is the key observation.
Further, it concentrates well: assuming (wlog) that $||x||=1$,
we have $Y \sim \N(0, 1)$, so $Y^2$ has constant variance (and more generally,
is subguassian\footnote{
Subguassian with parameter $\sigma$ basically means
tail probabilities behave as a guassian with variance $\sigma^2$ would behave.
Formally, a zero-mean random variable $X$ is ``subgaussian with parameter
$\sigma$'' if: $\E[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2}$ for all
$\lambda \in \R$.
with a constant parameter).
This would correspond to a random linear projection into 1 dimension, where the
matrix $A$ in Lemma~\ref{lem:jl} is just $A = g^T$. Then
$||Ax||^2 = \innp{g, x}^2 = Y^2$. However, this estimator does
not concentrate well enough (we want tail probability $\delta$ to eventually be
inverse-poly, not a constant).
We can get a better estimator for $||x||^2$ by averaging many iid copies.
In particular, for any iid subgaussian random variables $Z_i$, with expectation
$1$ and subguassian parameter $\sigma$, the Hoeffding bound gives
$$\Pr\left[ \left|\left(\frac{1}{k}\sum_{i=1}^k Z_i \right) - 1 \right| > \epsilon \right]
\leq e^{-\Omega(k\epsilon^2/\sigma^2)}$$
Applying this for $Z_i = Y_i^2$ and $\sigma = O(1)$,
we can set $k = O(\epsilon^{-2}\log(1/\delta))$
so the above tail probability is bounded by $\delta$.
This is exactly the construction of Lemma~\ref{lem:jl}.
Each row of $A$ is $\frac{1}{\sqrt{k}} g_i^T$, where $g_i \in \R^d$ is iid $\N(0, 1)$.
$$||A x||^2
= \sum_{i=1}^k \innp{\frac{1}{\sqrt{k}} g_i, x}^2
= \frac{1}{k} \sum_{i=1}^k \innp{g_i, x}^2
= \frac{1}{k} \sum_{i=1}^k Y_i^2$$
And thus, for $||x||=1$,
$$\Pr\left[ \left| ||A x||^2 - 1 \right| > \epsilon\right] \leq \delta$$
as desired in Lemma~\ref{lem:jl}. \qed
To recap, the key observation was that if we draw $g \sim N(0, I_d)$, then $\innp{g, x}^2$
is a ``good'' estimator of $||x||^2$, so we can average
$O(\epsilon^{-2}\log(1/\delta))$ iid copies and get an estimator within a
multiplicative factor $(1 \pm \epsilon)$ with probability $\geq 1-\delta$.
Filling the transform matrix $A$ with guassians is clearly not necessary;
any distribution with iid entries that are subguassian would work, with the same
proof as above.
For example, picking each entry in $A$ as iid $\pm \frac{1}{\sqrt k}$ would work.
We can now think of different JL transforms as constructing different estimators
of $||x||^2$.
For example, can we draw from a distribution such that $g$ is sparse?
(Not quite, but with some ``preconditioning'' this will work, as we see below).
\section{Fast JL}
\subsection{First Try: Coordinate Sampling}
As a (bad) first try, consider the estimator that randomly samples a coordinate
of the given vector $x$, scaled appropriately. That is,
$$Y := \sqrt{d} ~x_j \quad\text{for uniformly random coordinate $j \in [d]$}$$
Equivalently, draw a random standard basis vector $e_j$, and let
$Y := \sqrt{d} \innp{e_j, x}$.
Notice that $Y^2$ has the right expectation:
$$\E[Y^2] = \E_j[(\sqrt{d} x_j)^2] = d \E_j[x_j^2] = ||x||^2$$
However, it does not concentrate well. The variance is
$Var[Y^2] = Var[d x_j^2] = d^2 Var[x_j]$.
If $x$ is a standard basis vector (say $x = e_1$),
%$x_j$ is Bernoulli($1/d$), so $Var[x_j] \approx 1/d$ and
%$Var[Y^2] \approx d$.
then this could be as bad as $Var[Y^2] \approx d$.
This is bad, because it means we would need to average $\Omega(d)$ iid samples to get a
sufficiently good estimator, which does not help us in reducing the dimension of
$x \in \R^d$.
The bad case in the above analysis is when $x$ is very
concentrated/sparse, so sampling a random coordinate of $x$ is a poor
estimator of its magnitude.
However, if $x$ is very ``spread out'', then sampling a
random coordinate would work well.
For example, if all entries of $x$ are bounded by $\pm O(\sqrt{\frac{1}{d}})$,
then $Var[Y^2] = O(1)$, and taking iid copies of this estimator would work.
This would be nice for runtime, since randomly sampling a coordinate can be done quickly (it
is not a dense inner-product).
Thus, if we can (quickly) ``precondition'' our vector $x$ to have
$||x||_{\infty} \leq O(\sqrt{\frac{1}{d}})$, we could then use coordinate
sampling to achieve a fast JL embedding.
We won't quite achieve this, but we will be able to precondition such that
$||x||_\infty \leq O(\sqrt{\frac{\log(d/\delta)}{d}})$, as described in the next
With this in mind, we will need the following easy claim (that with the weaker
bound on $\ell_\infty$, coordinate sampling works to reduce the dimension
to almost our target dimension).
Let $t = \Theta(\epsilon^{-2}\log(1/\delta)\log(d/\delta))$.
Let $S \in \R^{t \x d}$ be a matrix
with rows $s_i := \sqrt{\frac{d}{t}} e^{(i)}_{j_i}$,
where each $j_i \in [d]$ is an iid uniform index.
(That is, each row of $S$ randomly samples a coordinate, scaled
Then, for all $x$ s.t $||x||_2=1$ and $||x||_\infty \leq O(\sqrt{\frac{\log(d/\delta)}{d}})$,
we have
$$\Pr_{S}[ ||Sx||^2 \in 1 \pm \epsilon] \geq 1-\delta$$
= \sum_{i=1}^t \innp{\sqrt{\frac{d}{t}} e_{j_i}, x}^2
= \frac{1}{t}\sum_{i=1}^t d (x_{j_i})^2$$
Then, the r.vs $\{d (x_{j_i})\}$ are iid and absolutely bounded by
$O(\sqrt{\log(d/\delta)})$, so by Chernoff-Hoeffding and our choice of $t$,
$$\Pr[ |\frac{1}{t}\sum_{i=1}^t d (x_{j_i})^2 - 1| > \epsilon]
\leq e^{-\Omega(\epsilon^2 t / \log(d/\delta))} \leq \delta$$
%With the appropriate preconditioning, this lemma states that coordinate sampling
%can reduce the dimension to $t = O(\epsilon^{-2}\log(1/\delta)\log(d/\delta))$,
%which is just a factor $\log(d / \delta)$ more than our ultimate goal.
\subsection{FJLT: Preconditioning with random Hadamard}
The main idea of FJLT is that we can quickly precondition vectors to be
``smooth'', by using the Fast Hadamard Transform.
Recall the $d \x d$ Hadamard transform $H_d$ (for $d$ a power of 2) is defined
recursively as
$$H_1 := 1
H_{2d} := \frac{1}{\sqrt{2}} \bmqty{H_d & H_d\\H_d & -H_d}$$
More explicitly, $H_d[i,j] = \frac{1}{\sqrt{d}} (-1)^{\innp{i, j}}$
where indices $i,j \in \{0, 1\}^{\log d}$, and the inner-product is mod 2.
The Hadamard transform is just like the discrete Fourier transform\footnote{Indeed, it is exactly the Fourier transform over the group
$(\Z_2)^n$. For more on Fourier transforms over abelian groups, see for
example \href{}{Luca's
: it can be computed in
time $O(d\log d)$ by recursion, and it is an orthonormal transform.
Intuitively, the Hadamard transform may be useful to ``spread out'' vectors,
since Fourier transforms take things that are sparse/concentrated in time-domain
to things that are spread out in frequency domain (by time-frequency duality/Uncertainty principle).
Unfortunately this won't quite work, since duality goes both ways: It will also
take vectors that are already spread out and make them sparse.
To fix this, it turns out we can first randomize the signs, then apply the
Hadamard transform.
Let $H_d$ be the $d \x d$ Hadamard transform, and let $D$ be a random
diagonal matrix with iid $\pm 1$ entries on the diagonal.
$$\forall x \in \R^d, ||x||=1: \quad
||H_d D x||_\infty >
\leq \delta$$
We may expect something like this to hold: randomizing the signs of $x$
corresponds to
pointwise-multiplying by random white noise. White noise is
spectrally flat, and multiplying by it in time-domain corresponds to
convolving by its (flat) spectrum in frequency domain.
Thus, multiplying by $D$ should ``spread out'' the spectrum of $x$.
Applying $H_d$ computes this spectrum, so should yield a spread-out vector.
The above intuition seems messy to formalize
%\footnote{I expected power-spectral-density to make an appearance...},
but the proof is surprisingly simple.
\footnote{This proof presented slightly differently from the one in
Alon-Chazelle, but the idea is the same.}
\begin{proof}[Proof of Lemma~\ref{lem:Hadamard}]
Consider the first entry of $H_d D x$.
Let $D = diag(a_1, a_2, \dots, a_d)$ where $a_i$ are iid $\pm 1$.
The first row of $H_d$ is
$\frac{1}{\sqrt{d}} \bmqty{1 & 1 & \dots & 1}$, so
$$(H_d D x)[1] =
\sum_i a_ix_i$$
Here, the $x_i$ are fixed s.t. $||x||_2=1$, and the $a_i = \pm 1$ iid.
Thus we can again bound this by Hoeffding
The following form of Hoeffding bound is useful (it follows directly from
Hoeffding for subgaussian variables, but is also a corollary of
For iid zero-mean random variables $Z_i$, absolutely bounded by $1$,
$\Pr[|\sum c_i Z_i| > \epsilon]
\leq 2exp(-\frac{\epsilon^2}{2 \sum_i c_i^2})$.
\sum_{i=1}^d a_i \frac{x_i}{\sqrt{d}}|
> \eta]
e^{-\Omega(\eta^2 d / ||x||_2^2)}
For $\eta = \Omega(\sqrt{\frac{\log(d/\delta)}{d}})$,
this probability is bounded by $(\frac{\delta}{d})$.
Moreover, the same bound applies for all coordinates of
$H_d Dx$, since all rows of $H_d$ have the form
$\frac{1}{\sqrt{d}} \bmqty{\pm 1 & \pm1 & \dots & \pm1}$.
Thus, union bound over $d$ coordinates establishes the lemma.
\subsection{The Full Fast JL Transform}
{\it This presentation of FJLT is due to Jelani Nelson; see the notes \cite{jlnotes}.}
Putting all the pieces together,
the FJLT is defined as:
$$A = J S H_d D$$
A: \quad
\item $S$: the sparse coordinate-sampling matrix of Lemma~\ref{lem:S}
\item $H_d$: the $d \x d$ Hadamard transform.
\item $D$: diagonal iid $\pm 1$.
\item $J$: a dense ``normal'' JL matrix (iid Gaussian entries).
For parameters
\item $t = \Theta(\epsilon^{-2}\log(1/\delta)\log(d / \delta))$
\item $k = \Theta(\epsilon^{-2}\log(1/\delta))$
That is, we first precondition with the randomized Hadamard transform,
then sample random coordinates (which does most of the dimensionality reduction),
then finally apply a normal JL transform to get rid of the last $\log(d/\delta)$
factor in the dimension.
{\bf Correctness.} Since the matrix $D$ and the Hadamard transform are
isometric, they do not affect the norms of vectors.
Then, after the preconditioning,
Lemma~\ref{lem:S} guarantees that $S$ only affects norms by $(1 \pm
\epsilon)$, and Lemma~\ref{lem:jl} guarantees that the
final step is also roughly isometric. These steps fail w.p. $\delta$, so the
final transform affects norms by at most say $(1\pm 3\epsilon)$ except w.p.
%That is,
%$$\Pr[ ||Ax||^2_2 = (1 \pm 3\epsilon)||x||^2] \geq 1 - 3 \delta$$
This is sufficient to establish the JL embedding.
{\bf Runtime.}
For computing a JL embedding (ie, setting $\delta = 1/n^3, \epsilon=O(1)$),
the time to embed a single vector is
$O(d \log d + \log^3 n)$.
\section{Closing Remarks}
{\bf Optimality.}
The target dimension given by the JL construction is known to be optimal.
That is, one cannot embed $n$ points into dimension less than $k=\Omega(\epsilon^{-2}\log n)$
with distortion $\epsilon$.
The first near-optimal lower-bound, in \cite[Section 9]{alon}
works by showing upper-bounds on the number of nearly-orthogonal vectors in a
given dimension (so a too-good embedding of orthogonal vectors would violate this bound).
A more recent, optimal bound, is in \cite[Section 6]{kane}.
They actually show optimality of the JL Lemma (that is, restricting to linear
embeddings), which works (roughly) by arguing that if the target dimension is
too small, then the kernel is too big, so a random vector is likely to be very
{\bf Recent Advances.}
Note that the FJLT is fast, but is not \emph{sparse}.
We may hope that embedding a sparse vector $x$ will take time proportional to
the sparsity of $x$. A major result in this area was
the sparse JL construction of \cite{sparseJL}; see also the notes \cite{jlnotes}.
There is also work in derandomized JL, see for example \cite{kane}.
{\it I'll stop here, since I haven't read these works yet, but
perhaps we will revisit this another time.\\
This post was derived from my talk at Berkeley theory retreat,
on the theme of ``theoretical guarantees for machine learning.''