Skip to content

Latest commit

 

History

History
958 lines (653 loc) · 45.9 KB

exact_sampling.rst

File metadata and controls

958 lines (653 loc) · 45.9 KB

dppy.finite_dpps

Exact sampling

Consider a finite DPP defined by its correlation kernel K eq:inclusion_proba_DPP_K or likelihood kernel L eq:likelihood_DPP_L. There exist three main types of exact sampling procedures:

  1. The spectral method (used by default) requires the eigendecomposition of the correlation kernel K or the likelihood kernel L. It is based on the fact that generic DPPs are mixtures of projection DPPs <finite_dpps_mixture> together with the application of the chain rule to sample projection DPPs. It is presented in Section finite_dpps_exact_sampling_spectral_method.
  2. A Cholesky-based procedure which requires the correlation kernel K (even non-Hermitian!). It boils down to applying the chain rule on sets; where each item in turn is decided to be excluded or included in the sample. It is presented in Section finite_dpps_exact_sampling_cholesky_method.
  3. Recently, DeCaVa19 have also proposed an alternative method to get exact samples: first sample an intermediate distribution and correct the bias by thinning the intermediate sample using a carefully designed DPP. This approach does not require a Cholesky/Eigen-decomposition of the DPP, but the runtime instead scale with the expected sample size of the DPP (see finite_dpps_number_of_points). It is presented in Section finite_dpps_exact_sampling_intermediate_sampling_method. A more refined procedure based on this approach was introduced in CaDeVa20 for k-DPP sampling.

In general, for small N (i.e. less than 1000) spectral or cholesky samplers are recommended for numerical stability. For larger N (i.e. up to millions) and moderate k (i.e. in the hundreds) intermediate sampling is recommended for scalability.

The following table summarizes the complexity of all exact samplers currently available, where the expected sample size 𝔼[|X|] is equal to k for k-DPPs and deff for random-sized DPPs.

+-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | | mode= | Time to first sample Notes | + + +--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+ + | | | DPP | k-DPP | DPP | k-DPP | | +=======================+==============+======================================+======================================+======================================+======================================+==============================================================================+ | Spectral sampler O(N3) O(Ndeff2) The three variants differ slightly, | | | | | | and depending on the DPP they can | | | | | | have different numerical stability. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | Cholesky sampler | "chol" | O(N3) O(N3) Also works for non-Hermitian DPPs. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | Intermediate sampler | "vfx" | O(Ndeff6) O(deff6) For "alpha" we report worst case runtime, but depending on the DPP | + +--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+ structure best case runtime can be much faster than "vfx". For | | | "alpha" | O(Ndeff5) O(deff6) particularly ill-posed DPPs "vfx" can be more numerically stable. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+

Note

  • There exist specific samplers for special DPPs, like the ones presented in Section exotic_dpps.

Important

In the next section, we describe the Algorithm 18 of HKPV06, based on the chain rule, which was originally designed to sample continuous projection DPPs <continuous_dpps_exact_sampling_projection_dpp_chain_rule>. Obviously, it has found natural a application in the finite setting for sampling projection DPP (K). However, we insist on the fact that this chain rule mechanism is specific to orthogonal projection kernels. In particular, it cannot be applied blindly to sample general k​-DPP (L) but it is valid when L is an orthogonal projection kernel.

This crucial point is developed in the following finite_kdpps_exact_sampling_chain_rule_projection_kernel_caution section.

Projection DPPs: the chain rule

In this section, we describe the generic projection DPP sampler, originally derived by HKPV06 Algorithm 18.

Recall that the number of points of a projection <finite_dpps_properties_number_of_points_dpp_K_projection> r = DPP (K) is, almost surely, equal to rank (K), so that the likelihood eq:likelihood_projection_K of S = {s1, …, sr} reads


ℙ[𝒳 = S] = det KS.

Using the invariance by permutation of the determinant and the fact that K is an orthogonal projection matrix, it is sufficient to apply the chain rule to sample (s1, …, sr) with joint distribution

$$\mathbb{P}[(s_1, \dots, s_r)] = \frac{1}{r!} \mathbb{P}[\mathcal{X}=\{s_1, \dots, s_r\}] = \frac{1}{r!} \det \mathbf{K}_S,$$

and forget about the sequential feature of the chain rule to get a valid sample {s1, …, sr} ∼ DPP (K).

Considering S = {s1, …, sr} such that ℙ[𝒳 = S] = det KS > 0, the following generic formulation of the chain rule

$$\mathbb{P}[(s_1, \dots, s_r)] = \mathbb{P}[s_1] \prod_{i=2}^{r} \mathbb{P}[s_{i} | s_{1:i-1}],$$

can be expressed as a telescopic ratio of determinants

$$\mathbb{P}[(s_1, \dots, s_r)] = \dfrac{\mathbf{K}_{s_1,s_1}}{r} \prod_{i=2}^{r} \dfrac{1}{r-(i-1)} \frac{\det \mathbf{K}_{S_i}} {\det \mathbf{K}_{S_{i-1}}},$$

where Si − 1 = {s1, …, si − 1}.

Using Woodbury's formula the ratios of determinants in eq:chain_rule_K can be expanded into

$$\mathbb{P}[(s_1, \dots, s_r)] = \dfrac{\mathbf{K}_{s_1,s_1}}{r} \prod_{i=2}^{r} \dfrac{ \mathbf{K}_{s_i, s_i} - \mathbf{K}_{s_i, S_{i-1}} {\mathbf{K}_{S_{i-1}}}^{-1} \mathbf{K}_{S_{i-1}, s_i} }{r-(i-1)}.$$

Hint

MLers will recognize in eq:chain_rule_schur the incremental posterior variance of the Gaussian Process (GP) associated to K, see RaWi06 Equation 2.26.

Caution

The connexion between the chain rule eq:chain_rule_schur and Gaussian Processes is valid in the case where the GP kernel is an orthogonal projection kernel, see also finite_kdpps_exact_sampling_chain_rule_projection_kernel_caution.

  • Algorithm 18 HKPV06
  • continuous_dpps_exact_sampling_projection_dpp_chain_rule in the continuous case

Geometrical interpretation

Hint

Since K is an orthogonal projection matrix, the following Gram factorizations provide an insightful geometrical interpretation of the chain rule mechanism eq:chain_rule_K:

1. Using K = K2 and K = K, we have K = KK, so that the chain rule eq:chain_rule_K becomes

$$\begin{aligned} \mathbb{P}[(s_1, \dots, s_r)] &= \frac{1}{r!} \operatorname{Volume}^2( \mathbf{K}_{s_{1},:}, \dots, \mathbf{K}_{s_{r},:} )\\\ &= \dfrac{\left\| \mathbf{K}_{s_1,:} \right\|^2}{r} \prod_{i=2}^{r} \dfrac{ \operatorname{distance}^2 (\mathbf{K}_{s_{i},:}, \operatorname{Span} \left\{ \mathbf{K}_{s_{1},:}, \dots, \mathbf{K}_{s_{i-1},:} \right\} }{r-(i-1)}. \end{aligned}$$

  1. Using the eigendecomposition, we can write K = UU where UU = Ir, so that the chain rule eq:chain_rule_K becomes

$$\begin{aligned} \mathbb{P}[(s_1, \dots, s_r)] &= \frac{1}{r!} \operatorname{Volume}^2( U_{s_{1},:}, \dots, U_{s_{r},:} )\\\ &= \dfrac{\left\| U_{s_1,:} \right\|^2}{r} \prod_{i=2}^{r} \dfrac{ \operatorname{distance}^2 (U_{s_{i},:}, \operatorname{Span} \left\{ U_{s_{1},:}, \dots, U_{s_{i-1},:} \right\} }{r-(i-1)}. \end{aligned}$$

In other words, the chain rule formulated as eq:chain_rule_dist2_K and eq:chain_rule_dist2_U is akin to do Gram-Schmidt orthogonalization of the feature vectors Ki, : or Ui, :. These formulations can also be understood as an application of the base × height formula. In the end, projection DPPs favors sets of r = rank (K) of items are associated to feature vectors that span large volumes. This is another way of understanding diversity <finite_dpps_diversity>.

MCMC samplers

  • zonotope sampling <finite_dpps_mcmc_sampling_zonotope>
  • basis exchange <finite_dpps_mcmc_sampling_E>

In practice

The cost of getting one sample from a projection DPP is of order 𝒪(Nrank (K)2), whenever DPP (K) is defined through

  • K itself; sampling relies on formulations eq:chain_rule_dist2_K or eq:chain_rule_schur

    import numpy as np from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

    seed = 0 rng = np.random.RandomState(seed)

    r, N = 4, 10 eig_vals = np.ones(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

    DPP = FiniteDPP(kernel_type='correlation',

    projection=True, *{'K': (eig_vecs eig_vals).dot(eig_vecs.T)})

    for mode in ('GS', 'Schur', 'Chol'): # default: GS

    rng = np.random.RandomState(seed) DPP.flush_samples()

    for _ in range(10):

    DPP.sample_exact(mode=mode, random_state=rng)

    print(DPP.sampling_mode) print(DPP.list_of_samples)

    GS [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]] Schur [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]] Chol [[5, 7, 6, 0], [4, 6, 5, 7], [9, 5, 0, 1], [5, 9, 2, 4], [0, 8, 1, 7], [9, 0, 5, 1], [0, 6, 5, 9], [5, 0, 1, 9], [5, 0, 2, 8], [5, 6, 9, 1]]

    • :py~FiniteDPP.sample_exact
    • HKPV06 Theorem 7, Algorithm 18 and Proposition 19, for the original idea
    • Pou19 Algorithm 3, for the equivalent Cholesky-based perspective with cost of order 𝒪(Nrank (K)2)
  • its eigenvectors U, i.e., K = UU with UU = Irank (K); sampling relies on eq:chain_rule_dist2_U

    import numpy as np from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

    seed = 0 rng = np.random.RandomState(seed)

    r, N = 4, 10 eig_vals = np.ones(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

    DPP = FiniteDPP(kernel_type='correlation',

    projection=True, **{'K_eig_dec': (eig_vals, eig_vecs)})

    rng = np.random.RandomState(seed)

    for _ in range(10):

    # mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

    print(DPP.list_of_samples)

    [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]]

    • :py~FiniteDPP.sample_exact
    • HKPV06 Theorem 7, Algorithm 18 and Proposition 19, for the original idea
    • KuTa12 Algorithm 1, for a first interpretation of the spectral counterpart of HKPV06 Algorithm 18 running in 𝒪(Nrank (K)3)
    • Gil14 Algorithm 2, for the 𝒪(Nrank (K)2) implementation
    • TrBaAm18 Algorithm 3, for a technical report on DPP sampling

Spectral method

Main idea

The procedure stems from Theorem 7 of HKPV06, i.e., the fact that generic DPPs are mixtures of projection DPPs <finite_dpps_mixture>, suggesting the following two steps algorithm. Given the spectral decomposition of the correlation kernel K

$$\mathbf{K} = U \Lambda U^{\dagger} = \sum_{n=1}^{N} \lambda_n u_n u_n^{\dagger}.$$

Step 1. Draw independent Bernoulli random variables Bn ∼ ℬer (λn) for n = 1, …, N and collect ℬ = {n ; Bn=1},

Step 2. Sample from the projection DPP with correlation kernel U : ℬU : ℬ = ∑n ∈ ℬunun, see the section above <finite_dpps_exact_sampling_projection_dpp_chain_rule_in_practice>.

Note

Step 1. selects a component of the mixture while Step 2. requires sampling from the corresponding projection DPP, cf. finite_dpps_exact_sampling_projection_dpp_chain_rule

In practice

  • Sampling projection DPP (K) from the eigendecomposition of K = UU with UU = Irank (K)) was presented in the section above <finite_dpps_exact_sampling_projection_dpp_chain_rule_in_practice>
  • Sampling DPP (K) from 0N ≼ K ≼ IN can be done by following

    Step 0. compute the eigendecomposition of K = UΛU in 𝒪(N3).

    Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> draw independent Bernoulli random variables Bn ∼ ℬer (λn) for n = 1, …, N and collect ℬ = {n ; Bn=1}

    Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> sample from the projection DPP with correlation kernel defined by its eigenvectors U:,ℬ

    Important

    Step 0. must be performed once and for all in 𝒪(N3). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]2), where $\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{K})=\sum_{n=1}^{N} \lambda_n$.

    from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

    rng = RandomState(0)

    r, N = 4, 10 eig_vals = rng.rand(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

    DPP = FiniteDPP(kernel_type='correlation',

    projection=False, *{'K': (eig_vecseig_vals).dot(eig_vecs.T)})

    for _ in range(10):

    # mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

    print(DPP.list_of_samples)

    [[7, 0, 1, 4], [6], [0, 9], [0, 9], [8, 5], [9], [6, 5, 9], [9], [3, 0], [5, 1, 6]]

  • Sampling DPP (L) from L ≽ 0N can be done by following

    Step 0. compute the eigendecomposition of L = VΓV in 𝒪(N3).

    Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> is adapted to: draw independent Bernoulli random variables $B_n \sim \operatorname{\mathcal{B}er}(\frac{\gamma_n}{1+\gamma_n})$ for n = 1, …, N and collect ℬ = {n ; Bn=1}

    Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V:,ℬ.

    Important

    Step 0. must be performed once and for all in 𝒪(N3). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]2), where $\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{L(I+L)^{-1}})=\sum_{n=1}^{N} \frac{\gamma_n}{1+\gamma_n}$

    from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

    rng = RandomState(0)

    r, N = 4, 10 phi = rng.randn(r, N)

    DPP = FiniteDPP(kernel_type='likelihood',

    projection=False, **{'L': phi.T.dot(phi)})

    for _ in range(10):

    # mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

    print(DPP.list_of_samples)

    [[3, 1, 0, 4], [9, 6], [4, 1, 3, 0], [7, 0, 6, 4], [5, 0, 7], [4, 0, 2], [5, 3, 8, 4], [0, 5, 2], [7, 0, 2], [6, 0, 3]]

  • Sampling a DPP (L) for which each item is represented by a d ≤ N dimensional feature vector, all stored in a feature matrix Φ ∈ ℝd × N, so that L = ΦΦ ≽ 0N, can be done by following

    Step 0. compute the so-called dual kernel  = ΦΦ ∈ ℝd×, eigendecompose it $\tilde{\mathbf{L}} = W \Delta W^{\top}$ and recover the eigenvectors of L as $V=\Phi^{\top}W \Delta^{-\frac{1}{2}}$ This corresponds to a cost of order 𝒪(Nd2 + d3 + d2 + Nd2).

    Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> is adapted to: draw independent Bernoulli random variables $B_i \sim \operatorname{\mathcal{B}er}(\frac{\delta_i}{1+\delta_i})$ for i = 1, …, d and collect ℬ = {i ; Bi=1}

    Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V:,ℬ = [ΦWΔ − 1/2]:,ℬ.

    Important

    Step 0. must be performed once and for all in 𝒪(Nd2 + d3). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]2), where $\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{\tilde{L}(I+\tilde{L})^{-1}})=\sum_{i=1}^{d} \frac{\delta_i}{1+\delta_i}\leq d$

    For a different perspective see

    • Gil14 Section 2.4.4 and Algorithm 3
    • KuTa12 Section 3.3.3 and Algorithm 3

    from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

    rng = RandomState(0)

    r, N = 4, 10 phi = rng.randn(r, N) # L = phi.T phi, L_dual = phi phi.T

    DPP = FiniteDPP(kernel_type='likelihood',

    projection=False, **{'L_gram_factor': phi})

    for _ in range(10):

    # mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)

    print(DPP.list_of_samples)

    L_dual = Phi Phi.T was computed: Phi (dxN) with d<N [[9, 0, 2, 3], [0, 1, 5, 2], [7, 0, 9, 4], [2, 0, 3], [6, 4, 0, 3], [5, 0, 6, 3], [0, 6, 3, 9], [4, 0, 9], [7, 3, 9, 4], [9, 4, 3]]

Cholesky-based method

Main idea

This method requires access to the correlation kernel K to perform a bottom-up chain rule on sets: starting from the empty set, each item in turn is decided to be added or excluded from the sample. This can be summarized as the exploration of the binary probability tree displayed in fig:cholesky_chain_rule_sets.

Probability tree corresponding to the chain rule on sets

Probability tree corresponding to the chain rule on sets

Example: for N = 5, if {1,4} was sampled, the path in the probability tree would correspond to

$$\begin{aligned} \mathbb{P}\!\left[\mathcal{X} = \left\{ 1, 4 \right\}\right] = &\mathbb{P}\!\left[ 1\in \mathcal{X} \right]\\\ &\times\mathbb{P}\!\left[ 2\notin \mathcal{X} \mid 1\in \mathcal{X} \right]\\\ &\times\mathbb{P}\!\left[ 3\notin \mathcal{X} \mid 1\in \mathcal{X}, 2\notin \mathcal{X} \right]\\\ &\times\mathbb{P}\!\left[ 4\in \mathcal{X} \mid 1\in \mathcal{X}, \left\{ 2, 3 \right\} \cap \mathcal{X} = \emptyset \right]\\\ &\times\mathbb{P}\!\left[ 5\notin \mathcal{X} \mid \left\{ 1, 4 \right\} \subset \mathcal{X}, \left\{ 2, 3 \right\} \cap \mathcal{X} = \emptyset \right], \end{aligned}$$

where each conditional probability can be written in closed formed using eq:conditioned_on_S_in_X and eq:conditioned_on_S_notin_X, namely

$$\begin{aligned} \mathbb{P}[T \subset \mathcal{X} \mid S \subset \mathcal{X}] &= \det\left[\mathbf{K}_T - \mathbf{K}_{TS} \mathbf{K}_S^{-1} \mathbf{K}_{ST}\right]\\\ \mathbb{P}[T \subset \mathcal{X} \mid S \cap \mathcal{X} = \emptyset] &= \det\left[\mathbf{K}_T - \mathbf{K}_{TS} (\mathbf{K}_S - I)^{-1} \mathbf{K}_{ST}\right]. \end{aligned}$$

Important

This quantities can be computed efficiently as they appear in the computation of the Cholesky-type LDL or LU factorization of the correlation K kernel, in the Hermitian or non-Hermitian case, respectively. See Pou19 for the details.

Note

The sparsity of K can be leveraged to derive faster samplers using the correspondence between the chain rule on sets and Cholesky-type factorizations, see e.g., Pou19 Section 4.

In practice

Important

  • The method is fully generic since it applies to any (valid), even non-Hermitian, correlation kernel K.
  • Each sample costs 𝒪(N3).
  • Nevertheless, the link between the chain rule on sets and Cholesky-type factorization is nicely supported by the surprisingly simple implementation of the corresponding generic sampler.
# Poulson (2019, Algorithm 1) pseudo-code

sample = []
A = K.copy()

for j in range(N):

    if np.random.rand() < A[j, j]:  # Bernoulli(A_jj)
        sample.append(j)
    else:
        A[j, j] -= 1

    A[j+1:, j] /= A[j, j]
    A[j+1:, j+1:] -= np.outer(A[j+1:, j], A[j, j+1:])

return sample, A

from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP

rng = RandomState(1)

r, N = 4, 10 eig_vals = rng.rand(r) eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

DPP = FiniteDPP(kernel_type='correlation',

projection=False, *{'K': (eig_vecseig_vals).dot(eig_vecs.T)})

for _ in range(10):

DPP.sample_exact(mode='Chol', random_state=rng)

print(DPP.list_of_samples)

[[2, 9], [0], [2], [6], [4, 9], [2, 7, 9], [0], [1, 9], [0, 1, 2], [2]]

  • Pou19
  • LaGaDe18

Intermediate sampling method

Main idea

This method is based on the concept of a distortion-free intermediate sample, where we draw a larger sample of points in such a way that we can then downsample to the correct DPP distribution. We assume access to the likelihood kernel L (although a variant of this method also exists for projection DPPs). Crucially the sampling relies on an important connection between DPPs and so-called ridge leverage scores (RLS, see AlMa15), which are commonly used for sampling in randomized linear algebra. Namely, the marginal probability of the i-th point in 𝒳 ∼ DPP (L) is also the i-th ridge leverage score of L (with ridge parameter equal 1):


ℙ[i ∈ 𝒳] = [L(I + L) − 1]ii = τi, ith 1-ridge leverage score.

Suppose that we draw a sample σ of t points i.i.d. proportional to ridge leverage scores, i.e., σ = (σ1, σ2, ..., σt) such that ℙ[σj = i] ∝ τi. Intuitively, this sample is similar fo 𝒳 ∼ DPP (L) because the marginals are the same, but it "ignores" all the dependencies between the points. However, if we sample sufficiently many points i.i.d. according to RLS, then a proper sample 𝒳 will likely be contained within σ. This can be formally shown for t = O(𝔼[|𝒳|]2). When 𝔼[|𝒳|]2 ≪ N, then this allows us to reduce the size of the DPP kernel L from N × N to a much smaller size t × t. Making this sampling exact requires considerably more care, because even with a large t there is always a small probability that the i.i.d. sample σ is not sufficiently diverse. We guard against this possibility by rejection sampling.

Important

Use this method for sampling 𝒳 ∼ DPP (L) when $\mathbb{E}\left[|\mathcal{X}|\right]\ll\sqrt{N}$.

  • Preprocessing costs 𝒪(N ⋅ poly(𝔼[|𝒳|]) polylog(N)).
  • Each sample costs 𝒪(𝔼[|𝒳|]6).

There are two implementations of intermediate sampling available in dppy: the mode='vfx' sampler and the mode='alpha' sampler.

In practice

from numpy.random import RandomState from dppy.finite_dpps import FiniteDPP from dppy.utils import example_eval_L_linear

rng = RandomState(1)

r, N = 4, 10

DPP = FiniteDPP('likelihood',

**{'L_eval_X_data': (example_eval_L_linear, rng.randn(N, r))})

for _ in range(10):

DPP.sample_exact(mode='vfx', random_state=rng, verbose=False)

print(DPP.list_of_samples)

[[5, 1, 0, 3], [9, 0, 8, 3], [6, 4, 1], [5, 1, 2], [2, 1, 3], [3, 8, 4, 0], [0, 8, 1], [7, 8], [1, 8, 2, 0], [5, 8, 3]]

The verbose=False flag is used to suppress the default progress bars when running in batch mode (e.g. when generating these docs).

Given, the RLS τ1, …, τN, the normalization constant $\det(I+\tilde{\mathbf L}_\sigma)$ and access to the likelihood kernel $\tilde{\mathbf L}_\sigma$, the intermediate sampling method proceeds as follows:

$$\begin{aligned} &\textbf{repeat}&\\\ &&\text{sample }t \sim\mathrm{Poisson}(k^2\,\mathrm{e}^{1/k}),\hspace{3.5cm}\text{where }k=\mathbb{E}[|\mathcal{X}|]\\\ &&\text{sample }\sigma_1,\dots,\sigma_t\sim (\tau_1,\dots\tau_N),\\\ &&\text{sample } \textit{Acc}\sim\!\text{Bernoulli}\Big(\frac{\mathrm{e}^{k}\det(I+\tilde{\mathbf{L}}_{\sigma})}{\mathrm{e}^{t/k}\det(I+\mathbf{L})}\Big),\quad\text{where }\tilde{L}_{ij} = \frac1{k\sqrt{\tau_i\tau_j}}L_{ij},\\\ &\textbf{until }&\textit{Acc}=\text{true}\\\ &\textbf{return }&\mathcal{X}=\{\sigma_i:i\in \tilde{\mathcal{X}}\}\quad\text{where }\tilde{\mathcal{X}}\sim \operatorname{DPP}(\tilde{\mathbf{L}}_{\sigma}) \end{aligned}$$

It can be shown that 𝒳 is distributed exactly according to DPP (L) and the expected number of rejections is a small constant. The intermediate likelihood kernel $\tilde{\mathbf L}_\sigma$ forms a t × t DPP subproblem that can be solved using any other DPP sampler.

  • Since the size of the intermediate sample is t = 𝒪(𝔼[𝒳]2), the primary cost of the sampling is computing $\det(I+\tilde{\mathbf L}_\sigma)$ which takes 𝒪(t3) = 𝒪(𝔼[𝒳]6) time. This is also the expected cost of sampling from $\operatorname{DPP}(\tilde{\mathbf{L}}_{\sigma})$ if we use, for example, the spectral method.
  • The algorithm requires precomputing the RLS τ1, …, τn and det (I + L). Computing them exactly takes 𝒪(N3), however, surprisingly, if we use sufficiently accurate approximations then the exactness of the sampling can be retained (details in DeCaVa19). Efficient methods for approximating leverage scores (see RuCaCaRo18) bring the precomputing cost down to 𝒪(Npoly(𝔼[|𝒳|])polylog(N)).
  • When 𝔼[|𝒳|] is sufficiently small, the entire sampling procedure only looks at a small fraction of the entries of L. This makes the method useful when we want to avoid constructing the entire likelihood kernel.
  • When the likelihood kernel is given implicitly via a matrix X such that L = XX (dual formulation) then a version of this method is given by Dere19
  • A variant of this method also exists for projection DPPs DeWaHs18
  • DeCaVa19 (Likelihood kernel)
  • Dere19 (Dual formulation)
  • DeWaHs18 (Projection DPP)

k-DPPs

Main idea

Recall from <finite_dpps_definition_k_dpps> eq:likelihood_kDPP_L that k​-DPP (L) can be viewed as a DPP (L) constrained to a have fixed cardinality k ≤ rank (L).

To generate a sample of k​-DPP (L), one natural solution would be to use a rejection mechanism: draw S ∼ DPP (L) and keep it only if |X| = k. However, the rejection constant may be pretty bad depending on the choice of k regarding the distribution of the number of points eq:number_of_points of S ∼ DPP (L).

An alternative solution was found by KuTa12 Section 5.2.2. The procedure relies on a slight modification of Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> of the finite_dpps_exact_sampling_spectral_method which requires the computation of the elementary symmetric polynomials.

In practice

Sampling k​-DPP (L) from L ≽ 0N can be done by following

Step 0.

  1. compute the eigendecomposition of L = VΓV in 𝒪(N3)
  2. evaluate the elementary symmetric polynomials in the eigenvalues of L: E[l, n] := el(γ1, …, γn) for 0 ≤ l ≤ k and 0 ≤ n ≤ N. These computations can done recursively in 𝒪(Nk) using Algorithm 7 of KuTa12.

Step 1. <finite_dpps_exact_sampling_spectral_method_step_1> is replaced by Algorithm 8 of KuTa12 which we illustrate with the following pseudo-code

# Algorithm 8 of Kulesza Taskar (2012).
# This is a pseudo-code of in particular Python indexing is not respected everywhere

B = set({})
l = k

for n in range(N, 0, -1):

  if Unif(0,1) < gamma[n] * E[l-1, n-1] / E[l, n]:
    l -= 1
    B.union({n})

    if l == 0:
      break

Step 2. <finite_dpps_exact_sampling_spectral_method_step_2> is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V:,ℬ, with a cost of order 𝒪(Nk2).

Important

Step 0. must be performed once and for all in 𝒪(N3 + Nk). Then the cost of getting one sample by applying Steps 1. and 2. is 𝒪(Nk2).

import numpy as np from dppy.finite_dpps import FiniteDPP

rng = np.random.RandomState(1)

r, N = 5, 10 # Random feature vectors Phi = rng.randn(r, N) DPP = FiniteDPP('likelihood', **{'L': Phi.T.dot(Phi)})

k = 4 for _ in range(10): DPP.sample_exact_k_dpp(size=k, random_state=rng)

print(DPP.list_of_samples)

[[1, 8, 5, 7], [3, 8, 5, 9], [5, 3, 1, 8], [5, 8, 2, 9], [1, 2, 9, 6], [1, 0, 2, 3], [7, 0, 3, 5], [8, 3, 7, 6], [0, 2, 3, 7], [1, 3, 7, 5]]

  • :py~FiniteDPP.sample_exact_k_dpp
  • Step 0. requires KuTa12 Algorithm 7 for the recursive evaluation of the elementary symmetric polynomials [el(γ1, …, γn)]l = 1, n = 1k, N in the eigenvalues of L
  • Step 1. calls KuTa12 Algorithm 8 for selecting the eigenvectors for Step 2.

Caution

Attention

Since the number of points of k​-DPP (L) is fixed, like for projection DPPs <finite_dpps_properties_number_of_points_dpp_K_projection>, it might be tempting to sample k​-DPP (L) using a chain rule in the way it was applied in eq:chain_rule_schur to sample projection DPPs. However, it is incorrect: sampling sequentially


s1 ∝ Ls, s, then si ∣ s1, …, si − 1 ∝ Ls, s − Ls, Si − 1LSi − 1 − 1LSi − 1, s, for 2 ≤ i ≤ k,

where Si − 1 = {s1, …, si − 1}, and forgetting about the order s1, …, sk were selected does not provide a subset {s1, …, sk} ∼ k​-DPP (L), in the general case. Nevertheless, it is valid when L is an orthogonal projection kernel!

Here are the reasons why

  1. First keep in mind that, the ultimate goal is to draw a subset S = {s1, …, sk} ∼ k​-DPP (L) with probability eq:likelihood_kDPP_L

$$\mathbb{P}[\mathcal{X}=S] = \frac{1}{e_k(\mathbf{L})} \det \mathbf{L}_S 1_{|S|=k}.$$

  1. Now, if we were to use the chain rule eq:chain_rule_caution this would correspond to sampling sequentially the items s1, …, sk, so that the resulting vector (s1, …, sk) has probability

$$\begin{aligned} \mathbb{Q}[(s_{1}, \dots, s_{k})] &= \dfrac{\mathbf{L}_{s_1,s_1}}{Z_1} \prod_{i=2}^{k} \dfrac{ \mathbf{L}_{s_i, s_i} - \mathbf{L}_{s_i, S_{i-1}} {\mathbf{L}_{S_{i-1}}}^{-1} \mathbf{L}_{S_{i-1}, s_i} }{Z_i(s_{1}, \dots, s_{i-1})}\\\ &= \frac{1}{Z(s_{1}, \dots, s_{k})} \det \mathbf{L}_S. \end{aligned}$$

Contrary to Z1 = trace (L), the normalizations Zi(s1, …, si − 1) of the successive conditionals depend, a priori, on the order s1, …, sk were selected. For this reason we denote the global normalization constant Z(s1, …, sk).

Warning

Equation eq:chain_rule_caution_vector suggests that, the sequential feature of the chain rule matters, a priori; the distribution of (s1,…,sk) is not exchangeable a priori, i.e., it is not invariant to permutations of its coordinates. This fact, would only come from the normalization Z(s1, …, sk), since LS is invariant by permutation.

Note

To see this, let's compute the normalization constant Zi(s1, …, si − 1) in eq:chain_rule_caution_vector for a generic L ≽ 0N factored as L = VV, with no specific assumption on V.

$$\begin{aligned} Z_i(s_1, \dots, s_{i-1}) &= \sum_{i=1}^N \mathbf{L}_{ii} - \mathbf{L}_{i,S_{i-1}} \mathbf{\mathbf{L}}_{S_{i-1}}^{-1} \mathbf{L}_{S_{i-1},i}\\\ &= \operatorname{trace}( \mathbf{L} - \mathbf{L}_{:, S_{i-1}} \left[\mathbf{\mathbf{L}}_{S_{i-1}}\right]^{-1} \mathbf{L}_{S_{i-1}, :} )\\\ &= \operatorname{trace}\left( \mathbf{L} - V {V^{\dagger}}_{:,S_{i-1}} \left[V_{S_{i-1},:} {V^{\dagger}}_{:,S_{i-1}}\right]^{-1} V_{S_{i-1},:} V^{\dagger} \right)\\\ &= \operatorname{trace} \big( \mathbf{L}_{ii} - \underbrace{{V_{S_{i-1}, :}}^{\dagger} \left[V_{S_{i-1}, :} {V_{S_{i-1}, :}}^{\dagger}\right]^{-1} V_{S_{i-1}, :}}_{\Pi_{V_{S_{i-1}, :}}} V^{\dagger}V \big)\\\ &= \operatorname{trace}(\mathbf{L}) - \operatorname{trace}(\Pi_{V_{S_{i-1}, :}}V^{\dagger}V), \end{aligned}$$

where ΠVSi − 1, : denotes the orthogonal projection onto Span {Vs1, :, …, Vsi − 1, :}, the supspace spanned the feature vectors associated to s1, …, si − 1.

Then, summing eq:chain_rule_caution_vector over the k! permutations of 1, …, k, yields the probability of drawing the subset S = {s1,…,sk}, namely

$$\mathbb{Q}[\{ s_{1}, \dots, s_{k} \}] = \sum_{\sigma \in \mathfrak{S}_k} \mathbb{Q}[(s_{\sigma(1)}, \dots, s_{\sigma(k)})] = \det\mathbf{L}_S \underbrace{ \sum_{\sigma \in \mathfrak{S}_k} \frac{1}{Z(s_{\sigma(1)}, \dots, s_{\sigma(k)})} }_{ 1/Z_S }.$$

  1. For the chain rule eq:chain_rule_caution_vector to be a valid procedure for sampling k​-DPP (L), we must be able to identify eq:caution_likelihood_kDPP_L and eq:chain_rule_caution_set, i.e., ℚ[S] = ℙ[S] for all |S| = k, or equivalently ZS = ek(L) for all |S| = k.

Important

A sufficient condition (very likely to be necessary) is that the joint distribution of (s1, …, sk), generated by the chain rule mechanism eq:chain_rule_caution_vector is exchangeable (invariant to permutations of the coordinates). In that case, the normalization in eq:chain_rule_caution_vector would then be constant Z(s1, …, sk) = Z . So that ZS would in fact play the role of the normalization constant of eq:chain_rule_caution_set, since it would be constant as well and equal to $Z_S = \frac{Z}{k!}$. Finally, ZS = ek(L) by identification of eq:caution_likelihood_kDPP_L and eq:chain_rule_caution_set.

This is what we can prove in the particular case where L is an orthogonal projection matrix.

To do this, denote r = rank (L) and recall that in this case L satisfies L2 = L and L = L, so that it can be factored as L = ΠL = LL = LL

Finally, we can plug V = L in eq:chain_rule_caution_normalization_constant_conditional to obtain

$$\begin{aligned} Z_i(s_1, \dots, s_{i-1}) &= \operatorname{trace}(\mathbf{L}) - \operatorname{trace}(\Pi_{\mathbf{L}_{S_{i-1}, :}}\mathbf{L}^{\dagger}\mathbf{L})\\\ &= \operatorname{trace}(\Pi_{\mathbf{L}}) - \operatorname{trace}(\Pi_{\mathbf{L}_{S_{i-1}, :}}\Pi_{\mathbf{L}})\\\ &= \operatorname{trace}(\Pi_{\mathbf{L}}) - \operatorname{trace}(\Pi_{\mathbf{L}_{S_{i-1}, :}})\\\ &= \operatorname{rank}(\Pi_{\mathbf{L}}) - \operatorname{rank}(\Pi_{\mathbf{L}_{S_{i-1}, :}})\\\ &= r - (i - 1) := Z_i. \end{aligned}$$

Thus, the normalization Z(s1, …, sk) in eq:chain_rule_caution_normalization_constant_conditional is constant as well equal to

$$Z(s_1, \dots, s_k) = \prod_{i=1}^{k} Z_i = \prod_{i=1}^{k} r - (i - 1) = \frac{r!}{(r-k)!} = k! {r \choose k} = k! e_k(\mathbf{L}) := Z,$$

where the last equality is a simple computation of the elementary symmetric polynomial

$$\begin{aligned} e_k(\mathbf{L}) = e_k(\gamma_{1:r}=1, \gamma_{r+1:N}=0) = \sum_{\substack{S \subset [N]\\|S|=k}} \prod_{s\in S} \gamma_{s} = {r \choose k} \end{aligned}$$

Important

This shows that, when L is an orthogonal projection matrix, the order the items s1, …, sr were selected by the chain rule eq:chain_rule_caution_vector can be forgotten, so that {s1, …, sr} can be considered as valid sample of k​-DPP (L).

# For our toy example, this sub-optimized implementation is enough
# to illustrate that the chain rule applied to sample k-DPP(L)
# draws s_1, ..., s_k sequentially, with joint probability
# P[(s_1, ..., s_k)] = det L_S / Z(s_1, ..., s_k)
#
# 1. is exchangeable when L is an orthogonal projection matrix
#    P[(s1, s2)] = P[(s_2, s_1)]
# 2. is a priori NOT exchangeable for a generic L >= 0
#    P[(s1, s2)] /= P[(s_2, s_1)]

import numpy as np
import scipy.linalg as LA
from itertools import combinations, permutations

k, N = 2, 4
potential_samples = list(combinations(range(N), k))

rank_L = 3

rng = np.random.RandomState(1)

eig_vecs, _ = LA.qr(rng.randn(N, rank_L), mode='economic')

for projection in [True, False]:

    eig_vals = 1.0 + (0.0 if projection else 2 * rng.rand(rank_L))
    L = (eig_vecs * eig_vals).dot(eig_vecs.T)

    proba = np.zeros((N, N))
    Z_1 = np.trace(L)

    for S in potential_samples:

        for s in permutations(S):

            proba[s] = LA.det(L[np.ix_(s, s)])

            Z_2_s0 = np.trace(L - L[:, s[:1]].dot(LA.inv(L[np.ix_(s[:1], s[:1])])).dot(L[s[:1], :]))

            proba[s] /= Z_1 * Z_2_s0

    print('L is {}projection'.format('' if projection else 'NOT '))

    print('P[s0, s1]', proba, sep='\n')
    print('P[s0]', proba.sum(axis=0), sep='\n')
    print('P[s1]', proba.sum(axis=1), sep='\n')

    print(proba.sum(), '\n' if projection else '')
L is projection
P[s0, s1]
[[0.         0.09085976 0.01298634 0.10338529]
 [0.09085976 0.         0.06328138 0.15368033]
 [0.01298634 0.06328138 0.         0.07580691]
 [0.10338529 0.15368033 0.07580691 0.        ]]
P[s0]
[0.20723139 0.30782147 0.15207463 0.33287252]
P[s1]
[0.20723139 0.30782147 0.15207463 0.33287252]
1.0000000000000002

L is NOT projection
P[s0, s1]
[[0.         0.09986722 0.01463696 0.08942385]
 [0.11660371 0.         0.08062998 0.20535251]
 [0.01222959 0.05769901 0.         0.04170435]
 [0.07995922 0.15726273 0.04463087 0.        ]]
P[s0]
[0.20879253 0.31482896 0.13989781 0.33648071]
P[s1]
[0.20392803 0.4025862  0.11163295 0.28185282]
1.0