We have pre-generated a dataset of simulated NMF matrices that you can load as a list directly using the package simulatedNMF
, which is available through a drat
repository on GitHub. Note that the size of the 'simulatedNMF' package is approximately 8 MB. This package may be installed as follows:
install.packages('simulatedNMF', repos='https://kharchenkolab.github.io/drat/', type='source')
(Please see the drat documentation for more comprehensive explanations and vignettes regarding drat
repositories.)
The following command load this dataset for the following tutorial:
simnmf <- simulatedNMF::simnmf
Generally, structured matrix factorization problem is formulated as a decomposition of original matrix in a product of unknown non-negative matrix
and matrix
of lower rank:
In practice, if has significantly more rows than columns or if observations are noisy and represent statistical sampling, then reformulation of the problem in the "covariance domain" simplifies inference:
Under a relatively mild assumption on spread of column (row) vectors of matrix that belong to column (row) unit simplex, matrix
has a minimum volume across all possible factorizations. Vrnmf, as compared to AnchorFree, considers a noisy version of the problem. Vrnmf seeks to find a factorization pair of matrices
that balances the goodness of fit of
and the volume of the matrix
. The method uses alternating optimization of the following objective function:
An application of the method is exemplified below for a simulated dataset. For that, we will use a dataset with matrices X
, its noisy version X.noise
and decomposition matrices C
and D
such that X = C*D
(to simulate a dataset with parameters of interest one can use function sim_factors
). First, we load the library and the dataset:
library(vrnmf)
simnmf <- simulatedNMF::simnmf
names(simnmf)
#> [1] "X.noise" "X" "C" "D"
The summary below shows that the original matrix X
of the size is a product of two matrices of lower rank 10:
str(simnmf$X)
#> num [1:192, 1:3000] 0.000493 0.000695 0.000347 0.000248 0.000699 ...
str(simnmf$C)
#> num [1:192, 1:10] 0.00387 0.01099 0.00302 0 0.008 ...
str(simnmf$D)
#> num [1:10, 1:3000] 0.03793 0.001 0.00138 0.00229 0.001 ...
As a first step, the input matrix X
is preprocessed using the function vol_preprocess()
to estimate and rescale the co-occurrence matrix Y = X*t(X)
and its singular value decomposition:
vol <- vol_preprocess(t(simnmf$X))
Next, the vrnmf
function decomposes co-occurrence matrix into matrices C
and E
of rank n.comp = 10
and using the weight of volume wvol = 2e-2
(and additional options of intra-block Nesterov extrapolation extrapolate
and inter-block Nesterov-style accelerate
):
volres <- volnmf_main(vol, n.comp = 10, wvol = 2e-2,
n.iter = 3e+3, vol.iter = 1e+1, c.iter = 1e+1,
extrapolate = TRUE, accelerate = TRUE, verbose = FALSE)
#> run standard nmf..
#> done
#>
#> run volume-regularized nmf..
#> done
#>
The output volres
contains the predicted matrix C
and other auxiliary information:
names(volres)
#> [1] "C" "R" "Q" "C.init" "R.init" "Q.init" "C.rand" "R.rand"
#> [9] "Q.rand" "rec"
str(volres$C)
#> num [1:192, 1:10] 0 0.00752 0.00329 0.00629 0.00776 ...
Comparison of original matrix simnmf$C
and inferred matrix volres$C
rescaled back by vol$col.factors
shows that they have highly similar pairs of column vectors:
C <- volres$C*vol$col.factors
apply(cor(simnmf$C, C), 1, max)
#> [1] 0.9163232 0.9032780 0.9729961 0.9935809 0.9873883 0.9933038 0.9944044
#> [8] 0.2875271 0.9791225 0.9859103
Having inferred C
, the second matrix D
can be obtained either through linear regression with constraints or using the function infer_intensities()
:
D <- infer_intensities(C, simnmf$X)
The method has a number of variations in the type of structural constraints on C
or R
:
- The
C.constraint
parameter constrains row vectors (row
) or column vectors (col
) ofC
to unit simplex (sum of vector elements equal to 1). The former option is able to fit a general matrix factorization with non-negative matrixC
, but the latter imposes a specific structural constraint usually met in hyperspectral unmixing. R.constraint
parameter constrains matrixR
to be non-negative (pos
otherwiseno
).
The objective function is nonconvex and thus alternating optimization does not guarantee to find a globally optimal solution. Additional parameters that affect convergence are:
- Parameter
wvol
, which regulates weight of volume. Small values make the problem similar to standard NMF, while large values start collapsing eigenvalues of matrixE
to zero. A carefully chosenwvol
parameter is critical for good performance. - Initialization matrices: Initial matrices
C
,R
andQ
can be obtained using NMF optimization without volume term (default option, parametersdo.nmf
anditer.nmf
) or initialized manually usingC.init
,R.init
andQ.init
. - Number of iterations. Beyond the overall number
n.iter
of iterations that alternate optimization ofC
,R
andQ
, the parametersvol.iter
andc.iter
regulate the number of iterations to updateR
andC
at each alternating step respectively. These parameters control the convergence rate. Additionally,R.majorate
controls the precision of the approximation ofR
at each iteration at the expense of time.