documentation 0.2.1

michalovadek · Aug 28, 2023 · 3ca8b3c · 3ca8b3c
1 parent 66216f8
commit 3ca8b3c
Show file tree

Hide file tree

Showing 8 changed files with 47 additions and 48 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: nmfbin
 Title: Non-negative Matrix Factorization for Binary Data
-Version: 0.1.0
+Version: 0.2.1
 Authors@R: c(person(given = "Michal",
              family = "Ovadek",
              role = c("aut", "cre", "cph"),

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,7 @@
+# nmfbin 0.2.1
+
+* documentation improvements
+
 # nmfbin 0.2.0
 
 * Full rewrite, simplification, improved terminology

diff --git a/R/nmfbin.R b/R/nmfbin.R
@@ -3,22 +3,24 @@
 #' This function performs Logistic Non-negative Matrix Factorization (NMF) on a binary matrix.
 #'
 #' @param X A binary matrix (m x n) to be factorized.
-#' @param k The number of factors or components.
-#' @param optimizer Type of updating algorithm. `update` for NMF multiplicative update rules or `gradient` for gradient descent.
-#' @param init Method for initializing the factorization.
-#' @param max_iter Maximum number of iterations for the gradient descent optimization.
+#' @param k The number of factors (components, topics).
+#' @param optimizer Type of updating algorithm. `mur` for NMF multiplicative update rules, `gradient` for gradient descent, `sgd` for stochastic gradient descent.
+#' @param init Method for initializing the factorization. By default Nonnegative Double Singular Value Decomposition with average densification.
+#' @param max_iter Maximum number of iterations for optimization.
 #' @param tol Convergence tolerance. The optimization stops when the change in loss is less than this value.
 #' @param learning_rate Learning rate (step size) for the gradient descent optimization.
 #' @param verbose Print convergence if `TRUE`.
-#' @param loss_fun Choice of loss function.
-#' @param loss_normalize Normalize loss if `TRUE`.
+#' @param loss_fun Choice of loss function: `logloss` (negative log-likelihood, also known as binary cross-entropy) or `mse` (mean squared error).
+#' @param loss_normalize Normalize loss by matrix dimensions if `TRUE`.
 #' @param epsilon Constant to avoid log(0).
 #'
 #' @return A list containing:
 #' \itemize{
-#'   \item \code{W}: The basis matrix (m x k).
-#'   \item \code{H}: The coefficient matrix (k x n).
+#'   \item \code{W}: The basis matrix (m x k). The document-topic matrix in topic modelling.
+#'   \item \code{H}: The coefficient matrix (k x n). Contribution of features to factors (topics).
 #'   \item \code{c}: The global threshold.
+#'   \item \code{convergence}: Divergence (loss) from `X` at every `iter` until `tol` or `max_iter` is reached.
+#'   \item \code{final_loss}: The final loss before `tol` or `max_iter` was reached.
 #' }
 #'
 #' @examples
@@ -34,8 +36,6 @@
 #'
 #' # Apply the function
 #' result <- nmfbin(X, k)
-#' }
-#'
 #' @export
 
 nmfbin <- function(X, k, optimizer = "mur", init = "nndsvd", max_iter = 1000, tol = 1e-6, learning_rate = 0.001,

diff --git a/README.md b/README.md
@@ -8,9 +8,13 @@
 
 The `nmfbin` R package provides a simple Non-Negative Matrix Factorization (NMF) implementation tailored for binary data matrices. It offers a choice of initialization methods, loss functions and updating algorithms.
 
-Unlike most other NMF packages, this one is focused on (1) binary (Boolean) data and (2) minimizing dependencies.
+NMF is typically used for reducing high-dimensional matrices into lower (k-) rank ones where _k_ is chosen by the user. Given a non-negative matrix _X_ of size $m \times n$, NMF looks for two non-negative matrices _W_ ($m \times k$) and _H_ ($k \times n$), such that:
 
-Note the package is in early stages of development.
+$$X \approx W \times H$$
+
+In topic modelling, _W_ is interpreted as the document-topic matrix and _H_ as the topic-feature matrix.
+
+Unlike most other NMF packages, `nmfbin` is focused on binary (Boolean) data, while keeping the number of dependencies to a minimum.
 
 ## Installation
 
@@ -30,10 +34,10 @@ The input matrix can only contain 0s and 1s.
 library(nmfbin)
 
 # Create a binary matrix for demonstration
-X <- matrix(sample(c(0, 1), 100, replace=TRUE), ncol=10)
+X <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10)
 
-# Perform NMF
-results <- nmfbin(X, k=3, optimizer = "mur", init = "nndsvd")
+# Perform Logistic NMF
+results <- nmfbin(X, k = 3, optimizer = "mur", init = "nndsvd", max_iter = 1000)
 ```
 
 ## Citation
@@ -43,7 +47,7 @@ results <- nmfbin(X, k=3, optimizer = "mur", init = "nndsvd")
   title = {nmfbin: Non-negative Matrix Factorization for Binary Data},
   author = {Michal Ovadek},
   year = {2023},
-  note = {R package version 0.2.0},
+  note = {R package version 0.2.1},
   url = {https://michalovadek.github.io/nmfbin/},
 }
 ```

diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -1,4 +1,3 @@
 url: https://michalovadek.github.io/nmfbin/
 template:
   bootstrap: 5
-
diff --git a/inst/CITATION b/inst/CITATION
@@ -5,7 +5,7 @@ bibentry(
                   family = "Ovadek",
                   email = "michal.ovadek@gmail.com"),
   year = "2023",
-  note = "R package version 0.2.0",
+  note = "R package version 0.2.1",
   url = "https://michalovadek.github.io/nmfbin/",
   header = "To cite nmfbin in publications use:"
 )
diff --git a/man/nmfbin.Rd b/man/nmfbin.Rd
diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd
@@ -14,22 +14,28 @@ knitr::opts_chunk$set(
 )
 ```
 
-The main function `nmfbin()` operates on matrices like so:
+The main function `nmfbin()` operates on binary matrices like so:
 
 ```{r setup}
 library(nmfbin)
 
 # Create a binary matrix for demonstration
-X <- matrix(sample(c(0, 1), 100, replace=TRUE), ncol=10)
+X <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10)
 
 # Perform NMF
-results <- nmfbin(X, k=3, optimizer = "mur", init = "nndsvd")
+results <- nmfbin(X, k = 3, optimizer = "mur", init = "nndsvd", loss_fun = "logloss", max_iter = 500)
 ```
 
 The results include the final loss:
 
-```{r measures}
-
+```{r finalloss}
 print(results$final_loss)
+```
+
+We can also easily plot the optimization process.
 
+```{r convergence}
+plot(results$convergence,
+     xlab = "Iteration",
+     ylab = "Negative log-likelihood loss")
 ```