dppy.finite_dpps
Consider a finite DPP defined by its correlation kernel K eq:inclusion_proba_DPP_K
or likelihood kernel L eq:likelihood_DPP_L
. There exist three main types of exact sampling procedures:
- The spectral method (used by default) requires the eigendecomposition of the correlation kernel K or the likelihood kernel L. It is based on the fact that
generic DPPs are mixtures of projection DPPs <finite_dpps_mixture>
together with the application of the chain rule to sample projection DPPs. It is presented in Sectionfinite_dpps_exact_sampling_spectral_method
. - A Cholesky-based procedure which requires the correlation kernel K (even non-Hermitian!). It boils down to applying the chain rule on sets; where each item in turn is decided to be excluded or included in the sample. It is presented in Section
finite_dpps_exact_sampling_cholesky_method
. - Recently,
DeCaVa19
have also proposed an alternative method to get exact samples: first sample an intermediate distribution and correct the bias by thinning the intermediate sample using a carefully designed DPP. This approach does not require a Cholesky/Eigen-decomposition of the DPP, but the runtime instead scale with the expected sample size of the DPP (seefinite_dpps_number_of_points
). It is presented in Sectionfinite_dpps_exact_sampling_intermediate_sampling_method
. A more refined procedure based on this approach was introduced inCaDeVa20
for k-DPP sampling.
In general, for small N (i.e. less than 1000) spectral or cholesky samplers are recommended for numerical stability. For larger N (i.e. up to millions) and moderate k (i.e. in the hundreds) intermediate sampling is recommended for scalability.
The following table summarizes the complexity of all exact samplers currently available, where the expected sample size 𝔼[|X|] is equal to k for k-DPPs and deff for random-sized DPPs.
+-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | | mode=
| Time to first sample Notes | + + +--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+ + | | | DPP | k-DPP | DPP | k-DPP | | +=======================+==============+======================================+======================================+======================================+======================================+==============================================================================+ | Spectral sampler O(N3) O(Ndeff2) The three variants differ slightly, | | | | | | and depending on the DPP they can | | | | | | have different numerical stability. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | Cholesky sampler | "chol"
| O(N3) O(N3) Also works for non-Hermitian DPPs. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+ | Intermediate sampler | "vfx"
| O(Ndeff6) O(deff6) For "alpha"
we report worst case runtime, but depending on the DPP | + +--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+ structure best case runtime can be much faster than "vfx"
. For | | | "alpha"
| O(Ndeff5) O(deff6) particularly ill-posed DPPs "vfx"
can be more numerically stable. | +-----------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------------------------------------------------------------------------+
Note
- There exist specific samplers for special DPPs, like the ones presented in Section
exotic_dpps
.
Important
In the next section, we describe the Algorithm 18 of HKPV06
, based on the chain rule, which was originally designed to sample continuous projection DPPs <continuous_dpps_exact_sampling_projection_dpp_chain_rule>
. Obviously, it has found natural a application in the finite setting for sampling projection DPP (K). However, we insist on the fact that this chain rule mechanism is specific to orthogonal projection kernels. In particular, it cannot be applied blindly to sample general k-DPP (L) but it is valid when L is an orthogonal projection kernel.
This crucial point is developed in the following finite_kdpps_exact_sampling_chain_rule_projection_kernel_caution
section.
In this section, we describe the generic projection DPP sampler, originally derived by HKPV06
Algorithm 18.
Recall that the number of points of a projection <finite_dpps_properties_number_of_points_dpp_K_projection>
r = DPP (K) is, almost surely, equal to rank (K), so that the likelihood eq:likelihood_projection_K
of S = {s1, …, sr} reads
ℙ[𝒳 = S] = det KS.
Using the invariance by permutation of the determinant and the fact that K is an orthogonal projection matrix, it is sufficient to apply the chain rule to sample (s1, …, sr) with joint distribution
and forget about the sequential feature of the chain rule to get a valid sample {s1, …, sr} ∼ DPP (K).
Considering S = {s1, …, sr} such that ℙ[𝒳 = S] = det KS > 0, the following generic formulation of the chain rule
can be expressed as a telescopic ratio of determinants
where Si − 1 = {s1, …, si − 1}.
Using Woodbury's formula the ratios of determinants in eq:chain_rule_K
can be expanded into
Hint
MLers will recognize in eq:chain_rule_schur
the incremental posterior variance of the Gaussian Process (GP) associated to K, see RaWi06
Equation 2.26.
Caution
The connexion between the chain rule eq:chain_rule_schur
and Gaussian Processes is valid in the case where the GP kernel is an orthogonal projection kernel, see also finite_kdpps_exact_sampling_chain_rule_projection_kernel_caution
.
- Algorithm 18
HKPV06
continuous_dpps_exact_sampling_projection_dpp_chain_rule
in the continuous case
Hint
Since K is an orthogonal projection matrix, the following Gram factorizations provide an insightful geometrical interpretation of the chain rule mechanism eq:chain_rule_K
:
1. Using K = K2 and K† = K, we have K = KK†, so that the chain rule eq:chain_rule_K
becomes
- Using the eigendecomposition, we can write K = UU† where U†U = Ir, so that the chain rule
eq:chain_rule_K
becomes
In other words, the chain rule formulated as eq:chain_rule_dist2_K
and eq:chain_rule_dist2_U
is akin to do Gram-Schmidt orthogonalization of the feature vectors Ki, : or Ui, :. These formulations can also be understood as an application of the base × height formula. In the end, projection DPPs favors sets of r = rank (K) of items are associated to feature vectors that span large volumes. This is another way of understanding diversity <finite_dpps_diversity>
.
MCMC samplers
zonotope sampling <finite_dpps_mcmc_sampling_zonotope>
basis exchange <finite_dpps_mcmc_sampling_E>
The cost of getting one sample from a projection DPP is of order 𝒪(Nrank (K)2), whenever DPP (K) is defined through
K itself; sampling relies on formulations
eq:chain_rule_dist2_K
oreq:chain_rule_schur
import numpy as np from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP
seed = 0 rng = np.random.RandomState(seed)
r, N = 4, 10 eig_vals = np.ones(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')
- DPP = FiniteDPP(kernel_type='correlation',
projection=True, *{'K': (eig_vecs eig_vals).dot(eig_vecs.T)})
for mode in ('GS', 'Schur', 'Chol'): # default: GS
rng = np.random.RandomState(seed) DPP.flush_samples()
- for _ in range(10):
DPP.sample_exact(mode=mode, random_state=rng)
print(DPP.sampling_mode) print(DPP.list_of_samples)
GS [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]] Schur [[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]] Chol [[5, 7, 6, 0], [4, 6, 5, 7], [9, 5, 0, 1], [5, 9, 2, 4], [0, 8, 1, 7], [9, 0, 5, 1], [0, 6, 5, 9], [5, 0, 1, 9], [5, 0, 2, 8], [5, 6, 9, 1]]
- :py
~FiniteDPP.sample_exact
HKPV06
Theorem 7, Algorithm 18 and Proposition 19, for the original ideaPou19
Algorithm 3, for the equivalent Cholesky-based perspective with cost of order 𝒪(Nrank (K)2)
its eigenvectors U, i.e., K = UU† with U†U = Irank (K); sampling relies on
eq:chain_rule_dist2_U
import numpy as np from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP
seed = 0 rng = np.random.RandomState(seed)
r, N = 4, 10 eig_vals = np.ones(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')
- DPP = FiniteDPP(kernel_type='correlation',
projection=True, **{'K_eig_dec': (eig_vals, eig_vecs)})
rng = np.random.RandomState(seed)
- for _ in range(10):
# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)
print(DPP.list_of_samples)
[[5, 7, 2, 1], [4, 6, 2, 9], [9, 2, 6, 4], [5, 9, 0, 1], [0, 8, 6, 7], [9, 6, 2, 7], [0, 6, 2, 9], [5, 2, 1, 8], [5, 4, 0, 8], [5, 6, 9, 1]]
- :py
~FiniteDPP.sample_exact
HKPV06
Theorem 7, Algorithm 18 and Proposition 19, for the original ideaKuTa12
Algorithm 1, for a first interpretation of the spectral counterpart ofHKPV06
Algorithm 18 running in 𝒪(Nrank (K)3)Gil14
Algorithm 2, for the 𝒪(Nrank (K)2) implementationTrBaAm18
Algorithm 3, for a technical report on DPP sampling
The procedure stems from Theorem 7 of HKPV06
, i.e., the fact that generic DPPs are mixtures of projection DPPs <finite_dpps_mixture>
, suggesting the following two steps algorithm. Given the spectral decomposition of the correlation kernel K
Step 1. Draw independent Bernoulli random variables Bn ∼ ℬer (λn) for n = 1, …, N and collect ℬ = {n ; Bn=1},
Step 2. Sample from the projection DPP with correlation kernel U : ℬU : ℬ† = ∑n ∈ ℬunun†, see the section above <finite_dpps_exact_sampling_projection_dpp_chain_rule_in_practice>
.
Note
Step 1. selects a component of the mixture while Step 2. requires sampling from the corresponding projection DPP, cf. finite_dpps_exact_sampling_projection_dpp_chain_rule
- Sampling projection DPP (K) from the eigendecomposition of K = UU† with U†U = Irank (K)) was presented in
the section above <finite_dpps_exact_sampling_projection_dpp_chain_rule_in_practice>
-
Sampling DPP (K) from 0N ≼ K ≼ IN can be done by following
Step 0. compute the eigendecomposition of K = UΛU† in 𝒪(N3).
Step
1. <finite_dpps_exact_sampling_spectral_method_step_1>
draw independent Bernoulli random variables Bn ∼ ℬer (λn) for n = 1, …, N and collect ℬ = {n ; Bn=1}Step
2. <finite_dpps_exact_sampling_spectral_method_step_2>
sample from the projection DPP with correlation kernel defined by its eigenvectors U:,ℬImportant
Step 0. must be performed once and for all in 𝒪(N3). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]2), where
$\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{K})=\sum_{n=1}^{N} \lambda_n$ .from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP
rng = RandomState(0)
r, N = 4, 10 eig_vals = rng.rand(r) # For projection DPP eig_vecs, _ = qr(rng.randn(N, r), mode='economic')
- DPP = FiniteDPP(kernel_type='correlation',
-
projection=False, *{'K': (eig_vecseig_vals).dot(eig_vecs.T)})
- for _ in range(10):
-
# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)
print(DPP.list_of_samples)
[[7, 0, 1, 4], [6], [0, 9], [0, 9], [8, 5], [9], [6, 5, 9], [9], [3, 0], [5, 1, 6]]
-
Sampling DPP (L) from L ≽ 0N can be done by following
Step 0. compute the eigendecomposition of L = VΓV† in 𝒪(N3).
Step
1. <finite_dpps_exact_sampling_spectral_method_step_1>
is adapted to: draw independent Bernoulli random variables$B_n \sim \operatorname{\mathcal{B}er}(\frac{\gamma_n}{1+\gamma_n})$ for n = 1, …, N and collect ℬ = {n ; Bn=1}Step
2. <finite_dpps_exact_sampling_spectral_method_step_2>
is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V:,ℬ.Important
Step 0. must be performed once and for all in 𝒪(N3). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]2), where
$\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{L(I+L)^{-1}})=\sum_{n=1}^{N} \frac{\gamma_n}{1+\gamma_n}$ from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP
rng = RandomState(0)
r, N = 4, 10 phi = rng.randn(r, N)
- DPP = FiniteDPP(kernel_type='likelihood',
-
projection=False, **{'L': phi.T.dot(phi)})
- for _ in range(10):
-
# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)
print(DPP.list_of_samples)
[[3, 1, 0, 4], [9, 6], [4, 1, 3, 0], [7, 0, 6, 4], [5, 0, 7], [4, 0, 2], [5, 3, 8, 4], [0, 5, 2], [7, 0, 2], [6, 0, 3]]
-
Sampling a DPP (L) for which each item is represented by a d ≤ N dimensional feature vector, all stored in a feature matrix Φ ∈ ℝd × N, so that L = Φ⊤Φ ≽ 0N, can be done by following
Step 0. compute the so-called dual kernel L̃ = ΦΦ† ∈ ℝd×, eigendecompose it
$\tilde{\mathbf{L}} = W \Delta W^{\top}$ and recover the eigenvectors of L as$V=\Phi^{\top}W \Delta^{-\frac{1}{2}}$ This corresponds to a cost of order 𝒪(Nd2 + d3 + d2 + Nd2).Step
1. <finite_dpps_exact_sampling_spectral_method_step_1>
is adapted to: draw independent Bernoulli random variables$B_i \sim \operatorname{\mathcal{B}er}(\frac{\delta_i}{1+\delta_i})$ for i = 1, …, d and collect ℬ = {i ; Bi=1}Step
2. <finite_dpps_exact_sampling_spectral_method_step_2>
is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V:,ℬ = [Φ⊤WΔ − 1/2]:,ℬ.Important
Step 0. must be performed once and for all in 𝒪(Nd2 + d3). Then the average cost of getting one sample by applying Steps 1. and 2. is 𝒪(N𝔼[|𝒳|]2), where
$\mathbb{E}\left[|\mathcal{X}|\right]=\operatorname{trace}(\mathbf{\tilde{L}(I+\tilde{L})^{-1}})=\sum_{i=1}^{d} \frac{\delta_i}{1+\delta_i}\leq d$ For a different perspective see
-
Gil14
Section 2.4.4 and Algorithm 3 -
KuTa12
Section 3.3.3 and Algorithm 3
from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP
rng = RandomState(0)
r, N = 4, 10 phi = rng.randn(r, N) # L = phi.T phi, L_dual = phi phi.T
- DPP = FiniteDPP(kernel_type='likelihood',
-
projection=False, **{'L_gram_factor': phi})
- for _ in range(10):
-
# mode='GS': Gram-Schmidt (default) DPP.sample_exact(mode='GS', random_state=rng)
print(DPP.list_of_samples)
L_dual = Phi Phi.T was computed: Phi (dxN) with d<N [[9, 0, 2, 3], [0, 1, 5, 2], [7, 0, 9, 4], [2, 0, 3], [6, 4, 0, 3], [5, 0, 6, 3], [0, 6, 3, 9], [4, 0, 9], [7, 3, 9, 4], [9, 4, 3]]
-
This method requires access to the correlation kernel K to perform a bottom-up chain rule on sets: starting from the empty set, each item in turn is decided to be added or excluded from the sample. This can be summarized as the exploration of the binary probability tree displayed in fig:cholesky_chain_rule_sets
.
Example: for N = 5, if {1,4} was sampled, the path in the probability tree would correspond to
where each conditional probability can be written in closed formed using eq:conditioned_on_S_in_X
and eq:conditioned_on_S_notin_X
, namely
Important
This quantities can be computed efficiently as they appear in the computation of the Cholesky-type LDL† or LU factorization of the correlation K kernel, in the Hermitian or non-Hermitian case, respectively. See Pou19
for the details.
Note
The sparsity of K can be leveraged to derive faster samplers using the correspondence between the chain rule on sets and Cholesky-type factorizations, see e.g., Pou19
Section 4.
Important
- The method is fully generic since it applies to any (valid), even non-Hermitian, correlation kernel K.
- Each sample costs 𝒪(N3).
- Nevertheless, the link between the chain rule on sets and Cholesky-type factorization is nicely supported by the surprisingly simple implementation of the corresponding generic sampler.
# Poulson (2019, Algorithm 1) pseudo-code
sample = []
A = K.copy()
for j in range(N):
if np.random.rand() < A[j, j]: # Bernoulli(A_jj)
sample.append(j)
else:
A[j, j] -= 1
A[j+1:, j] /= A[j, j]
A[j+1:, j+1:] -= np.outer(A[j+1:, j], A[j, j+1:])
return sample, A
from numpy.random import RandomState from scipy.linalg import qr from dppy.finite_dpps import FiniteDPP
rng = RandomState(1)
r, N = 4, 10 eig_vals = rng.rand(r) eig_vecs, _ = qr(rng.randn(N, r), mode='economic')
- DPP = FiniteDPP(kernel_type='correlation',
projection=False, *{'K': (eig_vecseig_vals).dot(eig_vecs.T)})
- for _ in range(10):
DPP.sample_exact(mode='Chol', random_state=rng)
print(DPP.list_of_samples)
[[2, 9], [0], [2], [6], [4, 9], [2, 7, 9], [0], [1, 9], [0, 1, 2], [2]]
Pou19
LaGaDe18
This method is based on the concept of a distortion-free intermediate sample, where we draw a larger sample of points in such a way that we can then downsample to the correct DPP distribution. We assume access to the likelihood kernel L (although a variant of this method also exists for projection DPPs). Crucially the sampling relies on an important connection between DPPs and so-called ridge leverage scores (RLS, see AlMa15
), which are commonly used for sampling in randomized linear algebra. Namely, the marginal probability of the i-th point in 𝒳 ∼ DPP (L) is also the i-th ridge leverage score of L (with ridge parameter equal 1):
ℙ[i ∈ 𝒳] = [L(I + L) − 1]ii = τi, ith 1-ridge leverage score.
Suppose that we draw a sample σ of t points i.i.d. proportional to ridge leverage scores, i.e., σ = (σ1, σ2, ..., σt) such that ℙ[σj = i] ∝ τi. Intuitively, this sample is similar fo 𝒳 ∼ DPP (L) because the marginals are the same, but it "ignores" all the dependencies between the points. However, if we sample sufficiently many points i.i.d. according to RLS, then a proper sample 𝒳 will likely be contained within σ. This can be formally shown for t = O(𝔼[|𝒳|]2). When 𝔼[|𝒳|]2 ≪ N, then this allows us to reduce the size of the DPP kernel L from N × N to a much smaller size L̃ t × t. Making this sampling exact requires considerably more care, because even with a large t there is always a small probability that the i.i.d. sample σ is not sufficiently diverse. We guard against this possibility by rejection sampling.
Important
Use this method for sampling 𝒳 ∼ DPP (L) when
- Preprocessing costs 𝒪(N ⋅ poly(𝔼[|𝒳|]) polylog(N)).
- Each sample costs 𝒪(𝔼[|𝒳|]6).
There are two implementations of intermediate sampling available in dppy
: the mode='vfx'
sampler and the mode='alpha'
sampler.
from numpy.random import RandomState from dppy.finite_dpps import FiniteDPP from dppy.utils import example_eval_L_linear
rng = RandomState(1)
r, N = 4, 10
- DPP = FiniteDPP('likelihood',
**{'L_eval_X_data': (example_eval_L_linear, rng.randn(N, r))})
- for _ in range(10):
DPP.sample_exact(mode='vfx', random_state=rng, verbose=False)
print(DPP.list_of_samples)
[[5, 1, 0, 3], [9, 0, 8, 3], [6, 4, 1], [5, 1, 2], [2, 1, 3], [3, 8, 4, 0], [0, 8, 1], [7, 8], [1, 8, 2, 0], [5, 8, 3]]
The verbose=False
flag is used to suppress the default progress bars when running in batch mode (e.g. when generating these docs).
Given, the RLS τ1, …, τN, the normalization constant
It can be shown that 𝒳 is distributed exactly according to DPP (L) and the expected number of rejections is a small constant. The intermediate likelihood kernel
- Since the size of the intermediate sample is t = 𝒪(𝔼[𝒳]2), the primary cost of the sampling is computing
$\det(I+\tilde{\mathbf L}_\sigma)$ which takes 𝒪(t3) = 𝒪(𝔼[𝒳]6) time. This is also the expected cost of sampling from$\operatorname{DPP}(\tilde{\mathbf{L}}_{\sigma})$ if we use, for example, the spectral method.- The algorithm requires precomputing the RLS τ1, …, τn and det (I + L). Computing them exactly takes 𝒪(N3), however, surprisingly, if we use sufficiently accurate approximations then the exactness of the sampling can be retained (details in
DeCaVa19
). Efficient methods for approximating leverage scores (seeRuCaCaRo18
) bring the precomputing cost down to 𝒪(Npoly(𝔼[|𝒳|])polylog(N)).- When 𝔼[|𝒳|] is sufficiently small, the entire sampling procedure only looks at a small fraction of the entries of L. This makes the method useful when we want to avoid constructing the entire likelihood kernel.
- When the likelihood kernel is given implicitly via a matrix X such that L = XX⊤ (dual formulation) then a version of this method is given by
Dere19
- A variant of this method also exists for projection DPPs
DeWaHs18
DeCaVa19
(Likelihood kernel)Dere19
(Dual formulation)DeWaHs18
(Projection DPP)
Recall from <finite_dpps_definition_k_dpps>
eq:likelihood_kDPP_L
that k-DPP (L) can be viewed as a DPP (L) constrained to a have fixed cardinality k ≤ rank (L).
To generate a sample of k-DPP (L), one natural solution would be to use a rejection mechanism: draw S ∼ DPP (L) and keep it only if |X| = k. However, the rejection constant may be pretty bad depending on the choice of k regarding the distribution of the number of points eq:number_of_points
of S ∼ DPP (L).
An alternative solution was found by KuTa12
Section 5.2.2. The procedure relies on a slight modification of Step 1. <finite_dpps_exact_sampling_spectral_method_step_1>
of the finite_dpps_exact_sampling_spectral_method
which requires the computation of the elementary symmetric polynomials.
Sampling k-DPP (L) from L ≽ 0N can be done by following
Step 0.
- compute the eigendecomposition of L = VΓV† in 𝒪(N3)
- evaluate the elementary symmetric polynomials in the eigenvalues of L: E[l, n] := el(γ1, …, γn) for 0 ≤ l ≤ k and 0 ≤ n ≤ N. These computations can done recursively in 𝒪(Nk) using Algorithm 7 of
KuTa12
.Step
1. <finite_dpps_exact_sampling_spectral_method_step_1>
is replaced by Algorithm 8 ofKuTa12
which we illustrate with the following pseudo-code# Algorithm 8 of Kulesza Taskar (2012). # This is a pseudo-code of in particular Python indexing is not respected everywhere B = set({}) l = k for n in range(N, 0, -1): if Unif(0,1) < gamma[n] * E[l-1, n-1] / E[l, n]: l -= 1 B.union({n}) if l == 0: breakStep
2. <finite_dpps_exact_sampling_spectral_method_step_2>
is adapted to: sample from the projection DPP with correlation kernel defined by its eigenvectors V:,ℬ, with a cost of order 𝒪(Nk2).
Important
Step 0. must be performed once and for all in 𝒪(N3 + Nk). Then the cost of getting one sample by applying Steps 1. and 2. is 𝒪(Nk2).
import numpy as np from dppy.finite_dpps import FiniteDPP
rng = np.random.RandomState(1)
r, N = 5, 10 # Random feature vectors Phi = rng.randn(r, N) DPP = FiniteDPP('likelihood', **{'L': Phi.T.dot(Phi)})
k = 4 for _ in range(10): DPP.sample_exact_k_dpp(size=k, random_state=rng)
print(DPP.list_of_samples)
[[1, 8, 5, 7], [3, 8, 5, 9], [5, 3, 1, 8], [5, 8, 2, 9], [1, 2, 9, 6], [1, 0, 2, 3], [7, 0, 3, 5], [8, 3, 7, 6], [0, 2, 3, 7], [1, 3, 7, 5]]
- :py
~FiniteDPP.sample_exact_k_dpp
- Step 0. requires
KuTa12
Algorithm 7 for the recursive evaluation of the elementary symmetric polynomials [el(γ1, …, γn)]l = 1, n = 1k, N in the eigenvalues of L - Step 1. calls
KuTa12
Algorithm 8 for selecting the eigenvectors for Step 2.
Attention
Since the number of points of k-DPP (L) is fixed, like for projection DPPs <finite_dpps_properties_number_of_points_dpp_K_projection>
, it might be tempting to sample k-DPP (L) using a chain rule in the way it was applied in eq:chain_rule_schur
to sample projection DPPs. However, it is incorrect: sampling sequentially
s1 ∝ Ls, s, then si ∣ s1, …, si − 1 ∝ Ls, s − Ls, Si − 1LSi − 1 − 1LSi − 1, s, for 2 ≤ i ≤ k,
where Si − 1 = {s1, …, si − 1}, and forgetting about the order s1, …, sk were selected does not provide a subset {s1, …, sk} ∼ k-DPP (L), in the general case. Nevertheless, it is valid when L is an orthogonal projection kernel!
Here are the reasons why
- First keep in mind that, the ultimate goal is to draw a subset S = {s1, …, sk} ∼ k-DPP (L) with probability
eq:likelihood_kDPP_L
- Now, if we were to use the chain rule
eq:chain_rule_caution
this would correspond to sampling sequentially the items s1, …, sk, so that the resulting vector (s1, …, sk) has probability
Contrary to Z1 = trace (L), the normalizations Zi(s1, …, si − 1) of the successive conditionals depend, a priori, on the order s1, …, sk were selected. For this reason we denote the global normalization constant Z(s1, …, sk).
Warning
Equation eq:chain_rule_caution_vector
suggests that, the sequential feature of the chain rule matters, a priori; the distribution of (s1,…,sk) is not exchangeable a priori, i.e., it is not invariant to permutations of its coordinates. This fact, would only come from the normalization Z(s1, …, sk), since LS is invariant by permutation.
Note
To see this, let's compute the normalization constant Zi(s1, …, si − 1) in eq:chain_rule_caution_vector
for a generic L ≽ 0N factored as L = VV†, with no specific assumption on V.
where ΠVSi − 1, : denotes the orthogonal projection onto Span {Vs1, :, …, Vsi − 1, :}, the supspace spanned the feature vectors associated to s1, …, si − 1.
Then, summing eq:chain_rule_caution_vector
over the k! permutations of 1, …, k, yields the probability of drawing the subset S = {s1,…,sk}, namely
- For the chain rule
eq:chain_rule_caution_vector
to be a valid procedure for sampling k-DPP (L), we must be able to identifyeq:caution_likelihood_kDPP_L
andeq:chain_rule_caution_set
, i.e., ℚ[S] = ℙ[S] for all |S| = k, or equivalently ZS = ek(L) for all |S| = k.
Important
A sufficient condition (very likely to be necessary) is that the joint distribution of (s1, …, sk), generated by the chain rule mechanism eq:chain_rule_caution_vector
is exchangeable (invariant to permutations of the coordinates). In that case, the normalization in eq:chain_rule_caution_vector
would then be constant Z(s1, …, sk) = Z . So that ZS would in fact play the role of the normalization constant of eq:chain_rule_caution_set
, since it would be constant as well and equal to eq:caution_likelihood_kDPP_L
and eq:chain_rule_caution_set
.
This is what we can prove in the particular case where L is an orthogonal projection matrix.
To do this, denote r = rank (L) and recall that in this case L satisfies L2 = L and L† = L, so that it can be factored as L = ΠL = L†L = LL†
Finally, we can plug V = L in eq:chain_rule_caution_normalization_constant_conditional
to obtain
Thus, the normalization Z(s1, …, sk) in eq:chain_rule_caution_normalization_constant_conditional
is constant as well equal to
where the last equality is a simple computation of the elementary symmetric polynomial
Important
This shows that, when L is an orthogonal projection matrix, the order the items s1, …, sr were selected by the chain rule eq:chain_rule_caution_vector
can be forgotten, so that {s1, …, sr} can be considered as valid sample of k-DPP (L).
# For our toy example, this sub-optimized implementation is enough
# to illustrate that the chain rule applied to sample k-DPP(L)
# draws s_1, ..., s_k sequentially, with joint probability
# P[(s_1, ..., s_k)] = det L_S / Z(s_1, ..., s_k)
#
# 1. is exchangeable when L is an orthogonal projection matrix
# P[(s1, s2)] = P[(s_2, s_1)]
# 2. is a priori NOT exchangeable for a generic L >= 0
# P[(s1, s2)] /= P[(s_2, s_1)]
import numpy as np
import scipy.linalg as LA
from itertools import combinations, permutations
k, N = 2, 4
potential_samples = list(combinations(range(N), k))
rank_L = 3
rng = np.random.RandomState(1)
eig_vecs, _ = LA.qr(rng.randn(N, rank_L), mode='economic')
for projection in [True, False]:
eig_vals = 1.0 + (0.0 if projection else 2 * rng.rand(rank_L))
L = (eig_vecs * eig_vals).dot(eig_vecs.T)
proba = np.zeros((N, N))
Z_1 = np.trace(L)
for S in potential_samples:
for s in permutations(S):
proba[s] = LA.det(L[np.ix_(s, s)])
Z_2_s0 = np.trace(L - L[:, s[:1]].dot(LA.inv(L[np.ix_(s[:1], s[:1])])).dot(L[s[:1], :]))
proba[s] /= Z_1 * Z_2_s0
print('L is {}projection'.format('' if projection else 'NOT '))
print('P[s0, s1]', proba, sep='\n')
print('P[s0]', proba.sum(axis=0), sep='\n')
print('P[s1]', proba.sum(axis=1), sep='\n')
print(proba.sum(), '\n' if projection else '')
L is projection
P[s0, s1]
[[0. 0.09085976 0.01298634 0.10338529]
[0.09085976 0. 0.06328138 0.15368033]
[0.01298634 0.06328138 0. 0.07580691]
[0.10338529 0.15368033 0.07580691 0. ]]
P[s0]
[0.20723139 0.30782147 0.15207463 0.33287252]
P[s1]
[0.20723139 0.30782147 0.15207463 0.33287252]
1.0000000000000002
L is NOT projection
P[s0, s1]
[[0. 0.09986722 0.01463696 0.08942385]
[0.11660371 0. 0.08062998 0.20535251]
[0.01222959 0.05769901 0. 0.04170435]
[0.07995922 0.15726273 0.04463087 0. ]]
P[s0]
[0.20879253 0.31482896 0.13989781 0.33648071]
P[s1]
[0.20392803 0.4025862 0.11163295 0.28185282]
1.0