dppy.finite_dpps

Exact sampling

Consider a finite DPP defined by its correlation kernel K eq:inclusion_proba_DPP_K or likelihood kernel L eq:likelihood_DPP_L. There exist three main types of exact sampling procedures:

The spectral method (used by default) requires the eigendecomposition of the correlation kernel K or the likelihood kernel L. It is based on the fact that generic DPPs are mixtures of projection DPPs <finite_dpps_mixture> together with the application of the chain rule to sample projection DPPs. It is presented in Section finite_dpps_exact_sampling_spectral_method.
A Cholesky-based procedure which requires the correlation kernel K (even non-Hermitian!). It boils down to applying the chain rule on sets; where each item in turn is decided to be excluded or included in the sample. It is presented in Section finite_dpps_exact_sampling_cholesky_method.
Recently, DeCaVa19 have also proposed an alternative method to get exact samples: first sample an intermediate distribution and correct the bias by thinning the intermediate sample using a carefully designed DPP. This approach does not require a Cholesky/Eigen-decomposition of the DPP, but the runtime instead scale with the expected sample size of the DPP (see finite_dpps_number_of_points). It is presented in Section finite_dpps_exact_sampling_intermediate_sampling_method. A more refined procedure based on this approach was introduced in CaDeVa20 for k-DPP sampling.

In general, for small N (i.e. less than 1000) spectral or cholesky samplers are recommended for numerical stability. For larger N (i.e. up to millions) and moderate k (i.e. in the hundreds) intermediate sampling is recommended for scalability.

The following table summarizes the complexity of all exact samplers currently available, where the expected sample size 𝔼[|X|] is equal to k for k-DPPs and d_eff for random-sized DPPs.

+-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | | mode= | Time to first sample Notes | + + +--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+ + | | | DPP | k-DPP | DPP | k-DPP | | +=======================+==============+======================================+======================================+======================================+======================================+==============================================================================+ | Spectral sampler O(N³) O(Nd_eff²) The three variants differ slightly, | | | | | | and depending on the DPP they can | | | | | | have different numerical stability. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | Cholesky sampler | "chol" | O(N³) O(N³) Also works for non-Hermitian DPPs. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | Intermediate sampler | "vfx" | O(Nd_eff⁶) O(d_eff⁶) For "alpha" we report worst case runtime, but depending on the DPP | + +--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+ structure best case runtime can be much faster than "vfx". For | | | "alpha" | O(Nd_eff⁵) O(d_eff⁶) particularly ill-posed DPPs "vfx" can be more numerically stable. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+

Note

There exist specific samplers for special DPPs, like the ones presented in Section exotic_dpps.

Important

In the next section, we describe the Algorithm 18 of HKPV06, based on the chain rule, which was originally designed to sample continuous projection DPPs <continuous_dpps_exact_sampling_projection_dpp_chain_rule>. Obviously, it has found natural a application in the finite setting for sampling projection DPP (K). However, we insist on the fact that this chain rule mechanism is specific to orthogonal projection kernels. In particular, it cannot be applied blindly to sample general k-DPP (L) but it is valid when L is an orthogonal projection kernel.

This crucial point is developed in the following finite_kdpps_exact_sampling_chain_rule_projection_kernel_caution section.

Projection DPPs: the chain rule

In this section, we describe the generic projection DPP sampler, originally derived by HKPV06 Algorithm 18.

Recall that the number of points of a projection <finite_dpps_properties_number_of_points_dpp_K_projection> r = DPP (K) is, almost surely, equal to rank (K), so that the likelihood eq:likelihood_projection_K of S = {s₁, …, s_r} reads

ℙ[𝒳 = S] = det K_S.

Using the invariance by permutation of the determinant and the fact that K is an orthogonal projection matrix, it is sufficient to apply the chain rule to sample (s₁, …, s_r) with joint distribution

$$\mathbb{P}[(s_1, \dots, s_r)] = \frac{1}{r!} \mathbb{P}[\mathcal{X}=\{s_1, \dots, s_r\}] = \frac{1}{r!} \det \mathbf{K}_S,$$

and forget about the sequential feature of the chain rule to get a valid sample {s₁, …, s_r} ∼ DPP (K).

Considering S = {s₁, …, s_r} such that ℙ[𝒳 = S] = det K_S > 0, the following generic formulation of the chain rule

$$\mathbb{P}[(s_1, \dots, s_r)] = \mathbb{P}[s_1] \prod_{i=2}^{r} \mathbb{P}[s_{i} | s_{1:i-1}],$$

can be expressed as a telescopic ratio of determinants

$$\mathbb{P}[(s_1, \dots, s_r)] = \dfrac{\mathbf{K}_{s_1,s_1}}{r} \prod_{i=2}^{r} \dfrac{1}{r-(i-1)} \frac{\det \mathbf{K}_{S_i}} {\det \mathbf{K}_{S_{i-1}}},$$

where S_i − 1 = {s₁, …, s_i − 1}.

Using Woodbury's formula the ratios of determinants in eq:chain_rule_K can be expanded into

$$\mathbb{P}[(s_1, \dots, s_r)] = \dfrac{\mathbf{K}_{s_1,s_1}}{r} \prod_{i=2}^{r} \dfrac{ \mathbf{K}_{s_i, s_i} - \mathbf{K}_{s_i, S_{i-1}} {\mathbf{K}_{S_{i-1}}}^{-1} \mathbf{K}_{S_{i-1}, s_i} }{r-(i-1)}.$$

Hint

MLers will recognize in eq:chain_rule_schur the incremental posterior variance of the Gaussian Process (GP) associated to K, see RaWi06 Equation 2.26.

Caution

The connexion between the chain rule eq:chain_rule_schur and Gaussian Processes is valid in the case where the GP kernel is an orthogonal projection kernel, see also finite_kdpps_exact_sampling_chain_rule_projection_kernel_caution.

Algorithm 18 HKPV06
continuous_dpps_exact_sampling_projection_dpp_chain_rule in the continuous case

Geometrical interpretation

Hint

Since K is an orthogonal projection matrix, the following Gram factorizations provide an insightful geometrical interpretation of the chain rule mechanism eq:chain_rule_K:

1. Using K = K² and K^† = K, we have K = KK^†, so that the chain rule eq:chain_rule_K becomes

$$\begin{aligned} \mathbb{P}[(s_1, \dots, s_r)] &= \frac{1}{r!} \operatorname{Volume}^2( \mathbf{K}_{s_{1},:}, \dots, \mathbf{K}_{s_{r},:} )\\\ &= \dfrac{\left\| \mathbf{K}_{s_1,:} \right\|^2}{r} \prod_{i=2}^{r} \dfrac{ \operatorname{distance}^2 (\mathbf{K}_{s_{i},:}, \operatorname{Span} \left\{ \mathbf{K}_{s_{1},:}, \dots, \mathbf{K}_{s_{i-1},:} \right\} }{r-(i-1)}. \end{aligned}$$

Using the eigendecomposition, we can write K = UU^† where U^†U = I_r, so that the chain rule eq:chain_rule_K becomes

$$\begin{aligned} \mathbb{P}[(s_1, \dots, s_r)] &= \frac{1}{r!} \operatorname{Volume}^2( U_{s_{1},:}, \dots, U_{s_{r},:} )\\\ &= \dfrac{\left\| U_{s_1,:} \right\|^2}{r} \prod_{i=2}^{r} \dfrac{ \operatorname{distance}^2 (U_{s_{i},:}, \operatorname{Span} \left\{ U_{s_{1},:}, \dots, U_{s_{i-1},:} \right\} }{r-(i-1)}. \end{aligned}$$

In other words, the chain rule formulated as eq:chain_rule_dist2_K and eq:chain_rule_dist2_U is akin to do Gram-Schmidt orthogonalization of the feature vectors K_i, : or U_i, :. These formulations can also be understood as an application of the base × height formula. In the end, projection DPPs favors sets of r = rank (K) of items are associated to feature vectors that span large volumes. This is another way of understanding diversity <finite_dpps_diversity>.

MCMC samplers

zonotope sampling <finite_dpps_mcmc_sampling_zonotope>
basis exchange <finite_dpps_mcmc_sampling_E>

In practice

The cost of getting one sample from a projection DPP is of order 𝒪(Nrank (K)²), whenever DPP (K) is defined through

K itself; sampling relies on formulations eq:chain_rule_dist2_K or eq:chain_rule_schur
import numpy as np from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

seed = 0 rng = np.random.RandomState(seed)

r, N = 4, 10 eig_vals = np.ones(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

DPP = FiniteDPP(kernel_type='correlation',

projection=True, *{'K': (eig_vecs eig_vals).dot(eig_vecs.T)})

for mode in ('GS', 'Schur', 'Chol'): # default: GS

rng = np.random.RandomState(seed) DPP.flush_samples()

for _ in range(10):

DPP.sample_exact(mode=mode, random_state=rng)

print(DPP.sampling_mode) print(DPP.list_of_samples)

GS [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]] Schur [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]] Chol [[5, 7, 6, 0], [4, 6, 5, 7], [9, 5, 0, 1], [5, 9, 2, 4], [0, 8, 1, 7], [9, 0, 5, 1], [0, 6, 5, 9], [5, 0, 1, 9], [5, 0, 2, 8], [5, 6, 9, 1]]
- :py~FiniteDPP.sample_exact
- HKPV06 Theorem 7, Algorithm 18 and Proposition 19, for the original idea
- Pou19 Algorithm 3, for the equivalent Cholesky-based perspective with cost of order 𝒪(Nrank (K)²)
its eigenvectors U, i.e., K = UU^† with U^†U = I_rank (K); sampling relies on eq:chain_rule_dist2_U
import numpy as np from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

seed = 0 rng = np.random.RandomState(seed)

r, N = 4, 10 eig_vals = np.ones(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

DPP = FiniteDPP(kernel_type='correlation',

projection=True, **{'K_eig_dec': (eig_vals, eig_vecs)})

rng = np.random.RandomState(seed)

for _ in range(10):

# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

print(DPP.list_of_samples)

[[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]]
- :py~FiniteDPP.sample_exact
- HKPV06 Theorem 7, Algorithm 18 and Proposition 19, for the original idea
- KuTa12 Algorithm 1, for a first interpretation of the spectral counterpart of HKPV06 Algorithm 18 running in 𝒪(Nrank (K)³)
- Gil14 Algorithm 2, for the 𝒪(Nrank (K)²) implementation
- TrBaAm18 Algorithm 3, for a technical report on DPP sampling

Spectral method

Main idea

The procedure stems from Theorem 7 of HKPV06, i.e., the fact that generic DPPs are mixtures of projection DPPs <finite_dpps_mixture>, suggesting the following two steps algorithm. Given the spectral decomposition of the correlation kernel K

$$\mathbf{K} = U \Lambda U^{\dagger} = \sum_{n=1}^{N} \lambda_n u_n u_n^{\dagger}.$$

Step 1. Draw independent Bernoulli random variables B_n ∼ ℬer (λ_n) for n = 1, …, N and collect ℬ = {n ; B_n=1},

Step 2. Sample from the projection DPP with correlation kernel U_: ℬU_: ℬ^† = ∑_{n ∈ ℬ}u_nu_n^†, see the section above <finite_dpps_exact_sampling_projection_dpp_chain_rule_in_practice>.

Note

Step 1. selects a component of the mixture while Step 2. requires sampling from the corresponding projection DPP, cf. finite_dpps_exact_sampling_projection_dpp_chain_rule

In practice

Sampling projection DPP (K) from the eigendecomposition of K = UU^† with U^†U = I_rank (K)) was presented in the section above <finite_dpps_exact_sampling_projection_dpp_chain_rule_in_practice>
Sampling DPP (K) from 0_N ≼ K ≼ I_N can be done by following

Step 0. compute the eigendecomposition of K = UΛU^† in 𝒪(N³).

Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> draw independent Bernoulli random variables B_n ∼ ℬer (λ_n) for n = 1, …, N and collect ℬ = {n ; B_n=1}

Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> sample from the projection DPP with correlation kernel defined by its eigenvectors U_:,ℬ

Important

Step 0. must be performed once and for all in 𝒪(N³). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]²), where $\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{K})=\sum_{n=1}^{N} \lambda_n$.

from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

rng = RandomState(0)

r, N = 4, 10 eig_vals = rng.rand(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

DPP = FiniteDPP(kernel_type='correlation',

projection=False, *{'K': (eig_vecseig_vals).dot(eig_vecs.T)})

for _ in range(10):

# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

print(DPP.list_of_samples)

[[7, 0, 1, 4], [6], [0, 9], [0, 9], [8, 5], [9], [6, 5, 9], [9], [3, 0], [5, 1, 6]]
Sampling DPP (L) from L ≽ 0_N can be done by following

Step 0. compute the eigendecomposition of L = VΓV^† in 𝒪(N³).

Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> is adapted to: draw independent Bernoulli random variables $B_n \sim \operatorname{\mathcal{B}er}(\frac{\gamma_n}{1+\gamma_n})$ for n = 1, …, N and collect ℬ = {n ; B_n=1}

Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V_:,ℬ.

Important

Step 0. must be performed once and for all in 𝒪(N³). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]²), where $\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{L(I+L)^{-1}})=\sum_{n=1}^{N} \frac{\gamma_n}{1+\gamma_n}$

from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

rng = RandomState(0)

r, N = 4, 10 phi = rng.randn(r, N)

DPP = FiniteDPP(kernel_type='likelihood',

projection=False, **{'L': phi.T.dot(phi)})

for _ in range(10):

# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

print(DPP.list_of_samples)

[[3, 1, 0, 4], [9, 6], [4, 1, 3, 0], [7, 0, 6, 4], [5, 0, 7], [4, 0, 2], [5, 3, 8, 4], [0, 5, 2], [7, 0, 2], [6, 0, 3]]
Sampling a DPP (L) for which each item is represented by a d ≤ N dimensional feature vector, all stored in a feature matrix Φ ∈ ℝ^d × N, so that L = Φ^⊤Φ ≽ 0_N, can be done by following
Step 0. compute the so-called dual kernel L̃ = ΦΦ^† ∈ ℝ^d×, eigendecompose it $\tilde{\mathbf{L}} = W \Delta W^{\top}$ and recover the eigenvectors of L as $V=\Phi^{\top}W \Delta^{-\frac{1}{2}}$ This corresponds to a cost of order 𝒪(Nd² + d³ + d² + Nd²).

Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> is adapted to: draw independent Bernoulli random variables $B_i \sim \operatorname{\mathcal{B}er}(\frac{\delta_i}{1+\delta_i})$ for i = 1, …, d and collect ℬ = {i ; B_i=1}

Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V_:,ℬ = [Φ^⊤WΔ^− 1/2]_:,ℬ.

Important

Step 0. must be performed once and for all in 𝒪(Nd² + d³). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]²), where $\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{\tilde{L}(I+\tilde{L})^{-1}})=\sum_{i=1}^{d} \frac{\delta_i}{1+\delta_i}\leq d$
For a different perspective see
- Gil14 Section 2.4.4 and Algorithm 3
- KuTa12 Section 3.3.3 and Algorithm 3
from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

rng = RandomState(0)

r, N = 4, 10 phi = rng.randn(r, N) # L = phi.T phi, L_dual = phi phi.T

DPP = FiniteDPP(kernel_type='likelihood',

projection=False, **{'L_gram_factor': phi})

for _ in range(10):

# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

print(DPP.list_of_samples)

L_dual = Phi Phi.T was computed: Phi (dxN) with d<N [[9, 0, 2, 3], [0, 1, 5, 2], [7, 0, 9, 4], [2, 0, 3], [6, 4, 0, 3], [5, 0, 6, 3], [0, 6, 3, 9], [4, 0, 9], [7, 3, 9, 4], [9, 4, 3]]

Cholesky-based method

Main idea

This method requires access to the correlation kernel K to perform a bottom-up chain rule on sets: starting from the empty set, each item in turn is decided to be added or excluded from the sample. This can be summarized as the exploration of the binary probability tree displayed in fig:cholesky_chain_rule_sets.

Probability tree corresponding to the chain rule on sets

Example: for N = 5, if {1,4} was sampled, the path in the probability tree would correspond to

$$\begin{aligned} \mathbb{P}\!\left[\mathcal{X} = \left\{ 1, 4 \right\}\right] = &\mathbb{P}\!\left[ 1\in \mathcal{X} \right]\\\ &\times\mathbb{P}\!\left[ 2\notin \mathcal{X} \mid 1\in \mathcal{X} \right]\\\ &\times\mathbb{P}\!\left[ 3\notin \mathcal{X} \mid 1\in \mathcal{X}, 2\notin \mathcal{X} \right]\\\ &\times\mathbb{P}\!\left[ 4\in \mathcal{X} \mid 1\in \mathcal{X}, \left\{ 2, 3 \right\} \cap \mathcal{X} = \emptyset \right]\\\ &\times\mathbb{P}\!\left[ 5\notin \mathcal{X} \mid \left\{ 1, 4 \right\} \subset \mathcal{X}, \left\{ 2, 3 \right\} \cap \mathcal{X} = \emptyset \right], \end{aligned}$$

where each conditional probability can be written in closed formed using eq:conditioned_on_S_in_X and eq:conditioned_on_S_notin_X, namely

$$\begin{aligned} \mathbb{P}[T \subset \mathcal{X} \mid S \subset \mathcal{X}] &= \det\left[\mathbf{K}_T - \mathbf{K}_{TS} \mathbf{K}_S^{-1} \mathbf{K}_{ST}\right]\\\ \mathbb{P}[T \subset \mathcal{X} \mid S \cap \mathcal{X} = \emptyset] &= \det\left[\mathbf{K}_T - \mathbf{K}_{TS} (\mathbf{K}_S - I)^{-1} \mathbf{K}_{ST}\right]. \end{aligned}$$

Important

This quantities can be computed efficiently as they appear in the computation of the Cholesky-type LDL^† or LU factorization of the correlation K kernel, in the Hermitian or non-Hermitian case, respectively. See Pou19 for the details.

Note

The sparsity of K can be leveraged to derive faster samplers using the correspondence between the chain rule on sets and Cholesky-type factorizations, see e.g., Pou19 Section 4.

In practice

Important

The method is fully generic since it applies to any (valid), even non-Hermitian, correlation kernel K.
Each sample costs 𝒪(N³).
Nevertheless, the link between the chain rule on sets and Cholesky-type factorization is nicely supported by the surprisingly simple implementation of the corresponding generic sampler.

# Poulson (2019, Algorithm 1) pseudo-code

sample = []
A = K.copy()

for j in range(N):

    if np.random.rand() < A[j, j]:  # Bernoulli(A_jj)
        sample.append(j)
    else:
        A[j, j] -= 1

    A[j+1:, j] /= A[j, j]
    A[j+1:, j+1:] -= np.outer(A[j+1:, j], A[j, j+1:])

return sample, A

from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

rng = RandomState(1)

r, N = 4, 10 eig_vals = rng.rand(r) eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

DPP = FiniteDPP(kernel_type='correlation',: projection=False, *{'K': (eig_vecseig_vals).dot(eig_vecs.T)})
for _ in range(10):: DPP.sample_exact(mode='Chol', random_state=rng)

print(DPP.list_of_samples)

[[2, 9], [0], [2], [6], [4, 9], [2, 7, 9], [0], [1, 9], [0, 1, 2], [2]]

Pou19
LaGaDe18

Intermediate sampling method

Main idea

This method is based on the concept of a distortion-free intermediate sample, where we draw a larger sample of points in such a way that we can then downsample to the correct DPP distribution. We assume access to the likelihood kernel L (although a variant of this method also exists for projection DPPs). Crucially the sampling relies on an important connection between DPPs and so-called ridge leverage scores (RLS, see AlMa15), which are commonly used for sampling in randomized linear algebra. Namely, the marginal probability of the i-th point in 𝒳 ∼ DPP (L) is also the i-th ridge leverage score of L (with ridge parameter equal 1):

ℙ[i ∈ 𝒳] = [L(I + L)^− 1]_ii = τ_i, ith 1-ridge leverage score.

Suppose that we draw a sample σ of t points i.i.d. proportional to ridge leverage scores, i.e., σ = (σ₁, σ₂, ..., σ_t) such that ℙ[σ_j = i] ∝ τ_i. Intuitively, this sample is similar fo 𝒳 ∼ DPP (L) because the marginals are the same, but it "ignores" all the dependencies between the points. However, if we sample sufficiently many points i.i.d. according to RLS, then a proper sample 𝒳 will likely be contained within σ. This can be formally shown for t = O(𝔼[|𝒳|]²). When 𝔼[|𝒳|]² ≪ N, then this allows us to reduce the size of the DPP kernel L from N × N to a much smaller size L̃ t × t. Making this sampling exact requires considerably more care, because even with a large t there is always a small probability that the i.i.d. sample σ is not sufficiently diverse. We guard against this possibility by rejection sampling.

Important

Use this method for sampling 𝒳 ∼ DPP (L) when $\mathbb{E}\left[|\mathcal{X}|\right]\ll\sqrt{N}$.

Preprocessing costs 𝒪(N ⋅ poly(𝔼[|𝒳|]) polylog(N)).
Each sample costs 𝒪(𝔼[|𝒳|]⁶).

There are two implementations of intermediate sampling available in dppy: the mode='vfx' sampler and the mode='alpha' sampler.

In practice

from numpy.random import RandomState from dppy.finite_dpps import FiniteDPP from dppy.utils import example_eval_L_linear

rng = RandomState(1)

r, N = 4, 10

DPP = FiniteDPP('likelihood',: **{'L_eval_X_data': (example_eval_L_linear, rng.randn(N, r))})
for _ in range(10):: DPP.sample_exact(mode='vfx', random_state=rng, verbose=False)

print(DPP.list_of_samples)

[[5, 1, 0, 3], [9, 0, 8, 3], [6, 4, 1], [5, 1, 2], [2, 1, 3], [3, 8, 4, 0], [0, 8, 1], [7, 8], [1, 8, 2, 0], [5, 8, 3]]

The verbose=False flag is used to suppress the default progress bars when running in batch mode (e.g. when generating these docs).

Given, the RLS τ₁, …, τ_N, the normalization constant $\det(I+\tilde{\mathbf L}_\sigma)$ and access to the likelihood kernel $\tilde{\mathbf L}_\sigma$, the intermediate sampling method proceeds as follows:

$$\begin{aligned} &\textbf{repeat}&\\\ &&\text{sample }t \sim\mathrm{Poisson}(k^2\,\mathrm{e}^{1/k}),\hspace{3.5cm}\text{where }k=\mathbb{E}[|\mathcal{X}|]\\\ &&\text{sample }\sigma_1,\dots,\sigma_t\sim (\tau_1,\dots\tau_N),\\\ &&\text{sample } \textit{Acc}\sim\!\text{Bernoulli}\Big(\frac{\mathrm{e}^{k}\det(I+\tilde{\mathbf{L}}_{\sigma})}{\mathrm{e}^{t/k}\det(I+\mathbf{L})}\Big),\quad\text{where }\tilde{L}_{ij} = \frac1{k\sqrt{\tau_i\tau_j}}L_{ij},\\\ &\textbf{until }&\textit{Acc}=\text{true}\\\ &\textbf{return }&\mathcal{X}=\{\sigma_i:i\in \tilde{\mathcal{X}}\}\quad\text{where }\tilde{\mathcal{X}}\sim \operatorname{DPP}(\tilde{\mathbf{L}}_{\sigma}) \end{aligned}$$

It can be shown that 𝒳 is distributed exactly according to DPP (L) and the expected number of rejections is a small constant. The intermediate likelihood kernel $\tilde{\mathbf L}_\sigma$ forms a t × t DPP subproblem that can be solved using any other DPP sampler.

Since the size of the intermediate sample is t = 𝒪(𝔼[𝒳]²), the primary cost of the sampling is computing $\det(I+\tilde{\mathbf L}_\sigma)$ which takes 𝒪(t³) = 𝒪(𝔼[𝒳]⁶) time. This is also the expected cost of sampling from $\operatorname{DPP}(\tilde{\mathbf{L}}_{\sigma})$ if we use, for example, the spectral method.

The algorithm requires precomputing the RLS τ₁, …, τ_n and det (I + L). Computing them exactly takes 𝒪(N³), however, surprisingly, if we use sufficiently accurate approximations then the exactness of the sampling can be retained (details in DeCaVa19). Efficient methods for approximating leverage scores (see RuCaCaRo18) bring the precomputing cost down to 𝒪(Npoly(𝔼[|𝒳|])polylog(N)).

When 𝔼[|𝒳|] is sufficiently small, the entire sampling procedure only looks at a small fraction of the entries of L. This makes the method useful when we want to avoid constructing the entire likelihood kernel.

When the likelihood kernel is given implicitly via a matrix X such that L = XX^⊤ (dual formulation) then a version of this method is given by Dere19

A variant of this method also exists for projection DPPs DeWaHs18

DeCaVa19 (Likelihood kernel)
Dere19 (Dual formulation)
DeWaHs18 (Projection DPP)

k-DPPs

Main idea

Recall from <finite_dpps_definition_k_dpps> eq:likelihood_kDPP_L that k-DPP (L) can be viewed as a DPP (L) constrained to a have fixed cardinality k ≤ rank (L).

To generate a sample of k-DPP (L), one natural solution would be to use a rejection mechanism: draw S ∼ DPP (L) and keep it only if |X| = k. However, the rejection constant may be pretty bad depending on the choice of k regarding the distribution of the number of points eq:number_of_points of S ∼ DPP (L).

An alternative solution was found by KuTa12 Section 5.2.2. The procedure relies on a slight modification of Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> of the finite_dpps_exact_sampling_spectral_method which requires the computation of the elementary symmetric polynomials.

In practice

Sampling k-DPP (L) from L ≽ 0_N can be done by following

Step 0.

compute the eigendecomposition of L = VΓV^† in 𝒪(N³)

evaluate the elementary symmetric polynomials in the eigenvalues of L: E[l, n] := e_l(γ₁, …, γ_n) for 0 ≤ l ≤ k and 0 ≤ n ≤ N. These computations can done recursively in 𝒪(Nk) using Algorithm 7 of KuTa12.

Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> is replaced by Algorithm 8 of KuTa12 which we illustrate with the following pseudo-code
# Algorithm 8 of Kulesza Taskar (2012).
# This is a pseudo-code of in particular Python indexing is not respected everywhere

B = set({})
l = k

for n in range(N, 0, -1):

  if Unif(0,1) < gamma[n] * E[l-1, n-1] / E[l, n]:
    l -= 1
    B.union({n})

    if l == 0:
      break
Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V_:,ℬ, with a cost of order 𝒪(Nk²).

Important

Step 0. must be performed once and for all in 𝒪(N³ + Nk). Then the cost of getting one sample by applying Steps 1. and 2. is 𝒪(Nk²).

import numpy as np from dppy.finite_dpps import FiniteDPP

rng = np.random.RandomState(1)

r, N = 5, 10 # Random feature vectors Phi = rng.randn(r, N) DPP = FiniteDPP('likelihood', **{'L': Phi.T.dot(Phi)})

k = 4 for _ in range(10): DPP.sample_exact_k_dpp(size=k, random_state=rng)

print(DPP.list_of_samples)

[[1, 8, 5, 7], [3, 8, 5, 9], [5, 3, 1, 8], [5, 8, 2, 9], [1, 2, 9, 6], [1, 0, 2, 3], [7, 0, 3, 5], [8, 3, 7, 6], [0, 2, 3, 7], [1, 3, 7, 5]]

:py~FiniteDPP.sample_exact_k_dpp
Step 0. requires KuTa12 Algorithm 7 for the recursive evaluation of the elementary symmetric polynomials [e_l(γ₁, …, γ_n)]_{l = 1, n = 1}^k, N in the eigenvalues of L
Step 1. calls KuTa12 Algorithm 8 for selecting the eigenvectors for Step 2.

Caution

Attention

Since the number of points of k-DPP (L) is fixed, like for projection DPPs <finite_dpps_properties_number_of_points_dpp_K_projection>, it might be tempting to sample k-DPP (L) using a chain rule in the way it was applied in eq:chain_rule_schur to sample projection DPPs. However, it is incorrect: sampling sequentially

s₁ ∝ L_s, s, then s_i ∣ s₁, …, s_i − 1 ∝ L_s, s − L_{s, S_i − 1}L_{S_i − 1}^− 1L_{S_i − 1, s}, for 2 ≤ i ≤ k,

where S_i − 1 = {s₁, …, s_i − 1}, and forgetting about the order s₁, …, s_k were selected does not provide a subset {s₁, …, s_k} ∼ k-DPP (L), in the general case. Nevertheless, it is valid when L is an orthogonal projection kernel!

Here are the reasons why

First keep in mind that, the ultimate goal is to draw a subset S = {s₁, …, s_k} ∼ k-DPP (L) with probability eq:likelihood_kDPP_L

$$\mathbb{P}[\mathcal{X}=S] = \frac{1}{e_k(\mathbf{L})} \det \mathbf{L}_S 1_{|S|=k}.$$

Now, if we were to use the chain rule eq:chain_rule_caution this would correspond to sampling sequentially the items s₁, …, s_k, so that the resulting vector (s₁, …, s_k) has probability

$$\begin{aligned} \mathbb{Q}[(s_{1}, \dots, s_{k})] &= \dfrac{\mathbf{L}_{s_1,s_1}}{Z_1} \prod_{i=2}^{k} \dfrac{ \mathbf{L}_{s_i, s_i} - \mathbf{L}_{s_i, S_{i-1}} {\mathbf{L}_{S_{i-1}}}^{-1} \mathbf{L}_{S_{i-1}, s_i} }{Z_i(s_{1}, \dots, s_{i-1})}\\\ &= \frac{1}{Z(s_{1}, \dots, s_{k})} \det \mathbf{L}_S. \end{aligned}$$

Contrary to Z₁ = trace (L), the normalizations Z_i(s₁, …, s_i − 1) of the successive conditionals depend, a priori, on the order s₁, …, s_k were selected. For this reason we denote the global normalization constant Z(s₁, …, s_k).

Warning

Equation eq:chain_rule_caution_vector suggests that, the sequential feature of the chain rule matters, a priori; the distribution of (s₁,…,s_k) is not exchangeable a priori, i.e., it is not invariant to permutations of its coordinates. This fact, would only come from the normalization Z(s₁, …, s_k), since L_S is invariant by permutation.

Note

To see this, let's compute the normalization constant Z_i(s₁, …, s_i − 1) in eq:chain_rule_caution_vector for a generic L ≽ 0_N factored as L = VV^†, with no specific assumption on V.

$$\begin{aligned} Z_i(s_1, \dots, s_{i-1}) &= \sum_{i=1}^N \mathbf{L}_{ii} - \mathbf{L}_{i,S_{i-1}} \mathbf{\mathbf{L}}_{S_{i-1}}^{-1} \mathbf{L}_{S_{i-1},i}\\\ &= \operatorname{trace}( \mathbf{L} - \mathbf{L}_{:, S_{i-1}} \left[\mathbf{\mathbf{L}}_{S_{i-1}}\right]^{-1} \mathbf{L}_{S_{i-1}, :} )\\\ &= \operatorname{trace}\left( \mathbf{L} - V {V^{\dagger}}_{:,S_{i-1}} \left[V_{S_{i-1},:} {V^{\dagger}}_{:,S_{i-1}}\right]^{-1} V_{S_{i-1},:} V^{\dagger} \right)\\\ &= \operatorname{trace} \big( \mathbf{L}_{ii} - \underbrace{{V_{S_{i-1}, :}}^{\dagger} \left[V_{S_{i-1}, :} {V_{S_{i-1}, :}}^{\dagger}\right]^{-1} V_{S_{i-1}, :}}_{\Pi_{V_{S_{i-1}, :}}} V^{\dagger}V \big)\\\ &= \operatorname{trace}(\mathbf{L}) - \operatorname{trace}(\Pi_{V_{S_{i-1}, :}}V^{\dagger}V), \end{aligned}$$

where Π_{V_{Si − 1, :}} denotes the orthogonal projection onto Span {V_s₁, :, …, V_{s_i − 1, :}}, the supspace spanned the feature vectors associated to s₁, …, s_i − 1.

Then, summing eq:chain_rule_caution_vector over the k! permutations of 1, …, k, yields the probability of drawing the subset S = {s₁,…,s_k}, namely

$$\mathbb{Q}[\{ s_{1}, \dots, s_{k} \}] = \sum_{\sigma \in \mathfrak{S}_k} \mathbb{Q}[(s_{\sigma(1)}, \dots, s_{\sigma(k)})] = \det\mathbf{L}_S \underbrace{ \sum_{\sigma \in \mathfrak{S}_k} \frac{1}{Z(s_{\sigma(1)}, \dots, s_{\sigma(k)})} }_{ 1/Z_S }.$$

For the chain rule eq:chain_rule_caution_vector to be a valid procedure for sampling k-DPP (L), we must be able to identify eq:caution_likelihood_kDPP_L and eq:chain_rule_caution_set, i.e., ℚ[S] = ℙ[S] for all |S| = k, or equivalently Z_S = e_k(L) for all |S| = k.

Important

A sufficient condition (very likely to be necessary) is that the joint distribution of (s₁, …, s_k), generated by the chain rule mechanism eq:chain_rule_caution_vector is exchangeable (invariant to permutations of the coordinates). In that case, the normalization in eq:chain_rule_caution_vector would then be constant Z(s₁, …, s_k) = Z . So that Z_S would in fact play the role of the normalization constant of eq:chain_rule_caution_set, since it would be constant as well and equal to $Z_S = \frac{Z}{k!}$. Finally, Z_S = e_k(L) by identification of eq:caution_likelihood_kDPP_L and eq:chain_rule_caution_set.

This is what we can prove in the particular case where L is an orthogonal projection matrix.

To do this, denote r = rank (L) and recall that in this case L satisfies L² = L and L^† = L, so that it can be factored as L = Π_L = L^†L = LL^†

Finally, we can plug V = L in eq:chain_rule_caution_normalization_constant_conditional to obtain

$$\begin{aligned} Z_i(s_1, \dots, s_{i-1}) &= \operatorname{trace}(\mathbf{L}) - \operatorname{trace}(\Pi_{\mathbf{L}_{S_{i-1}, :}}\mathbf{L}^{\dagger}\mathbf{L})\\\ &= \operatorname{trace}(\Pi_{\mathbf{L}}) - \operatorname{trace}(\Pi_{\mathbf{L}_{S_{i-1}, :}}\Pi_{\mathbf{L}})\\\ &= \operatorname{trace}(\Pi_{\mathbf{L}}) - \operatorname{trace}(\Pi_{\mathbf{L}_{S_{i-1}, :}})\\\ &= \operatorname{rank}(\Pi_{\mathbf{L}}) - \operatorname{rank}(\Pi_{\mathbf{L}_{S_{i-1}, :}})\\\ &= r - (i - 1) := Z_i. \end{aligned}$$

Thus, the normalization Z(s₁, …, s_k) in eq:chain_rule_caution_normalization_constant_conditional is constant as well equal to

$$Z(s_1, \dots, s_k) = \prod_{i=1}^{k} Z_i = \prod_{i=1}^{k} r - (i - 1) = \frac{r!}{(r-k)!} = k! {r \choose k} = k! e_k(\mathbf{L}) := Z,$$

where the last equality is a simple computation of the elementary symmetric polynomial

$$\begin{aligned} e_k(\mathbf{L}) = e_k(\gamma_{1:r}=1, \gamma_{r+1:N}=0) = \sum_{\substack{S \subset [N]\\|S|=k}} \prod_{s\in S} \gamma_{s} = {r \choose k} \end{aligned}$$

Important

This shows that, when L is an orthogonal projection matrix, the order the items s₁, …, s_r were selected by the chain rule eq:chain_rule_caution_vector can be forgotten, so that {s₁, …, s_r} can be considered as valid sample of k-DPP (L).

# For our toy example, this sub-optimized implementation is enough
# to illustrate that the chain rule applied to sample k-DPP(L)
# draws s_1, ..., s_k sequentially, with joint probability
# P[(s_1, ..., s_k)] = det L_S / Z(s_1, ..., s_k)
#
# 1. is exchangeable when L is an orthogonal projection matrix
#    P[(s1, s2)] = P[(s_2, s_1)]
# 2. is a priori NOT exchangeable for a generic L >= 0
#    P[(s1, s2)] /= P[(s_2, s_1)]

import numpy as np
import scipy.linalg as LA
from itertools import combinations, permutations

k, N = 2, 4
potential_samples = list(combinations(range(N), k))

rank_L = 3

rng = np.random.RandomState(1)

eig_vecs, _ = LA.qr(rng.randn(N, rank_L), mode='economic')

for projection in [True, False]:

    eig_vals = 1.0 + (0.0 if projection else 2 * rng.rand(rank_L))
    L = (eig_vecs * eig_vals).dot(eig_vecs.T)

    proba = np.zeros((N, N))
    Z_1 = np.trace(L)

    for S in potential_samples:

        for s in permutations(S):

            proba[s] = LA.det(L[np.ix_(s, s)])

            Z_2_s0 = np.trace(L - L[:, s[:1]].dot(LA.inv(L[np.ix_(s[:1], s[:1])])).dot(L[s[:1], :]))

            proba[s] /= Z_1 * Z_2_s0

    print('L is {}projection'.format('' if projection else 'NOT '))

    print('P[s0, s1]', proba, sep='\n')
    print('P[s0]', proba.sum(axis=0), sep='\n')
    print('P[s1]', proba.sum(axis=1), sep='\n')

    print(proba.sum(), '\n' if projection else '')

L is projection
P[s0, s1]
[[0.         0.09085976 0.01298634 0.10338529]
 [0.09085976 0.         0.06328138 0.15368033]
 [0.01298634 0.06328138 0.         0.07580691]
 [0.10338529 0.15368033 0.07580691 0.        ]]
P[s0]
[0.20723139 0.30782147 0.15207463 0.33287252]
P[s1]
[0.20723139 0.30782147 0.15207463 0.33287252]
1.0000000000000002

L is NOT projection
P[s0, s1]
[[0.         0.09986722 0.01463696 0.08942385]
 [0.11660371 0.         0.08062998 0.20535251]
 [0.01222959 0.05769901 0.         0.04170435]
 [0.07995922 0.15726273 0.04463087 0.        ]]
P[s0]
[0.20879253 0.31482896 0.13989781 0.33648071]
P[s1]
[0.20392803 0.4025862  0.11163295 0.28185282]
1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exact_sampling.rst

exact_sampling.rst

Exact sampling

Projection DPPs: the chain rule

Geometrical interpretation

In practice

Spectral method

Main idea

In practice

Cholesky-based method

Main idea

In practice

Intermediate sampling method

Main idea

In practice

k-DPPs

Main idea

In practice

Caution

Files

exact_sampling.rst

Latest commit

History

exact_sampling.rst

File metadata and controls

Exact sampling

Projection DPPs: the chain rule

Geometrical interpretation

In practice

Spectral method

Main idea

In practice

Cholesky-based method

Main idea

In practice

Intermediate sampling method

Main idea

In practice

k-DPPs

Main idea

In practice

Caution