Add minimal Sphinx documentation

mwydmuch · Oct 26, 2020 · fc7893d · fc7893d
1 parent b65fe05
commit fc7893d
Show file tree

Hide file tree

Showing 10 changed files with 399 additions and 18 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,9 @@
 # Cpp
 nxc
 
+# Docs
+docs/*/*
+
 # Experiments
 /data
 /models*

diff --git a/README.md b/README.md
@@ -1,10 +1,14 @@
-# napkinXC [![Build Status](https://travis-ci.org/mwydmuch/napkinXC.svg?branch=master)](https://travis-ci.org/mwydmuch/napkinXC)
+# napkinXC 
+[![PyPI version](https://badge.fury.io/py/napkinxc.svg)](https://badge.fury.io/py/napkinxc) 
+[![Build Status](https://travis-ci.org/mwydmuch/napkinXC.svg?branch=master)](https://travis-ci.org/mwydmuch/napkinXC) 
+[![Documentation Status](https://readthedocs.org/projects/napkinxc/badge/?version=latest)](https://napkinxc.readthedocs.io/en/latest/?badge=latest)
+[![License](https://img.shields.io/github/license/mwydmuch/napkinXC.svg)](https://github.com/mwydmuch/napkinXC/blob/master/LICENSE)
 
 napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification.
 It allows to train a classifier for very large datasets in few lines of code with minimal resources.
 
 Right now, napkinXC implements the following features both in Python and C++:
-- Probabilistic Label Trees (PLT) and Online Probabilistic Label Trees (OPLT),
+- Probabilistic Label Trees (PLTs) and Online Probabilistic Label Trees (OPLTs),
 - Hierarchical softmax (HSM),
 - Binary Relevance (BR),
 - One Versus Rest (OVR),
@@ -16,6 +20,7 @@ Right now, napkinXC implements the following features both in Python and C++:
 - helpers to download and load data from [XML Repository](http://manikvarma.org/downloads/XC/XMLRepository.html),
 - helpers to measure performance.
 
+Documentation is 
 Please note that this library is still under development and also serves as a base for experiments. 
 Some of the experimental features may not be documented. 
 
@@ -26,20 +31,22 @@ All contributions to the project are welcome!
 ## Roadmap
 
 Coming soon:
-- OPLT available in Python
 - Possibility to use any type of binary classifier from Python
 - Improved dataset loading from Python
 - More datasets from XML Repository
 
 
-## Python quick start
+## Python Quick Start and Documentation
 
-Python version of napkinXC can be easily installed from PyPy repository:
+napkinXC's documentation is available at [https://napkinxc.readthedocs.io](https://napkinxc.readthedocs.io) and is generated from this repository. 
+
+Python version of napkinXC can be easily installed from PyPy repository on Linux and MacOS, 
+it requires modern C++ compiler, CMake and Git installed:
 ```
 pip install napkinxc
 ```
 
-or directly from the GitHub repository:
+or the latest master version directly from the GitHub repository:
 ```
 pip install pip install git+https://github.com/mwydmuch/napkinXC.git
 ```
@@ -63,11 +70,11 @@ More examples can be found under `python/examples` directory.
 
 ## Building executable
 
-napkinXC can be also build as executable using:
+napkinXC can also be built using:
 
 ```
 cmake .
-make -j
+make
 ```
 
 
@@ -80,7 +87,7 @@ Commands:
     train                   Train model on given input data
     test                    Test model on given input data
     predict                 Predict for given data
-    ofo                     Use online f-measure optimalization
+    ofo                     Use online f-measure optimization
     version                 Print napkinXC version
     help                    Print help
 
@@ -91,12 +98,8 @@ Args:
     -m, --model             Model type (default = plt):
                             Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
     --ensemble              Number of models in ensemble (default = 1)
-    -d, --dataFormat        Type of data format (default = libsvm),
-                            Supported data formats: libsvm
     -t, --threads           Number of threads to use (default = 0)
                             Note: -1 to use #cpus - 1, 0 to use #cpus
-    --header                Input contains header (default = 1)
-                            Header format for libsvm: #lines #features #labels
     --hash                  Size of features space (default = 0)
                             Note: 0 to disable hashing
     --featuresThreshold     Prune features below given threshold (default = 0.0)
@@ -157,12 +160,23 @@ Args:
 ```
 
 
+## Data Format
+
+napkinXC supports multi-label svmlight/libsvm format 
+and format of datasets from [The Extreme Classification Repository](https://manikvarma.github.io/downloads/XC/XMLRepository.html), 
+which has an additional header line with a number of data points, features, and labels.
+
+```
+label,label,... feature(:value) feature(:value) ...
+```
+
+
 ## References and acknowledgments
 
 This library implements methods from following papers:
 
+- [Probabilistic Label Trees for Extreme Multi-label Classification](https://arxiv.org/pdf/2009.11218.pdf)
 - [Online Probabilistic Label Trees](https://arxiv.org/abs/1906.08129)
-
 - [Efficient Algorithms for Set-Valued Prediction in Multi-Class Classification](https://arxiv.org/abs/1906.08129)
 
 Another implementation of PLT model is available in [extremeText](https://github.com/mwydmuch/extremeText) library, 

diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,13 @@
+# Documentation
+
+Documentation for napkinXC is generated using [Sphinx](https://www.sphinx-doc.org/)
+After each commit on `master`, documentation is updated and published to [Read the Docs](https://napkinxc.readthedocs.io).
+
+You can build the documentation locally. Just install Sphinx and run in ``docs`` directory:
+
+```
+pip install -r requirements.txt
+make html
+```
+
+Documentation will be created in `docs/_build` directory.
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,92 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+
+import os
+import sys
+sys.path.insert(0, os.path.abspath('../python'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = 'napkinXC'
+copyright = '2020, Marek Wydmuch'
+author = 'Marek Wydmuch'
+
+# The full version, including alpha/beta/rc tags
+release = '0.4.1'
+
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.todo',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.mathjax'
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+
+# -- Autodoc configuration ---------------------------------------------------
+autodoc_mock_imports = [
+    "napkinxc._napkinxc",
+    "numpy",
+    "scipy",
+    "scipy.sparse",
+    "sklearn"
+]
+#autoclass_content = 'both'
+autodoc_default_flags = ['members', 'inherited-members', 'show-inheritance']
+autodoc_default_options = {
+    "members": True,
+    "inherited-members": True,
+    "show-inheritance": True,
+}
+
+# Generate autosummary pages. Output should be set with: `:toctree: pythonapi/`
+autosummary_generate = ['python_api.rst']
+
+# Only the class' docstring is inserted.
+autoclass_content = 'class'
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = False
+
+# The master toctree document.
+master_doc = 'index'
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+
+import sphinx_rtd_theme
+html_theme = 'sphinx_rtd_theme'
+
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
diff --git a/docs/exe_usage.rst b/docs/exe_usage.rst
@@ -0,0 +1,114 @@
+Executable
+==========
+
+napkinXC can also be built and used as an executable that can be used to train and evaluate models and make a prediction.
+
+
+Building
+--------
+
+To build napkinXC, first clone the project repository and run the following commands in the root directory of the project:
+
+.. code:: sh
+
+    cmake .
+    make
+
+
+``-B`` options can be passed to CMake command to specify other build directory.
+After successful compilation, ``nxc`` executable should appear in the root or specified build directory.
+
+
+Data Format
+-----------
+
+napkinXC supports multi-label svmlight/libsvm format
+and format of datasets from `The Extreme Classification Repository <https://manikvarma.github.io/downloads/XC/XMLRepository.html>`_,
+which has an additional header line with a number of data points, features, and labels.
+
+.. code:: sh
+
+    label,label,... feature(:value) feature(:value) ...
+
+
+Command line options
+--------------------
+
+.. code::
+
+    Usage: nxc <command> <args>
+
+    Commands:
+        train                   Train model on given input data
+        test                    Test model on given input data
+        predict                 Predict for given data
+        ofo                     Use online f-measure optimization
+        version                 Print napkinXC version
+        help                    Print help
+
+    Args:
+        General:
+        -i, --input             Input dataset
+        -o, --output            Output (model) dir
+        -m, --model             Model type (default = plt):
+                                Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
+        --ensemble              Number of models in ensemble (default = 1)
+        -t, --threads           Number of threads to use (default = 0)
+                                Note: -1 to use #cpus - 1, 0 to use #cpus
+        --hash                  Size of features space (default = 0)
+                                Note: 0 to disable hashing
+        --featuresThreshold     Prune features below given threshold (default = 0.0)
+        --seed                  Seed (default = system time)
+        --verbose               Verbose level (default = 2)
+
+        Base classifiers:
+        --optimizer             Optimizer used for training binary classifiers (default = libliner)
+                                Optimizers: liblinear, sgd, adagrad, fobos
+        --bias                  Value of the bias features (default = 1)
+        --inbalanceLabelsWeighting     Increase the weight of minority labels in base classifiers (default = 1)
+        --weightsThreshold      Threshold value for pruning models weights (default = 0.1)
+
+        LIBLINEAR:              (more about LIBLINEAR: https://github.com/cjlin1/liblinear)
+        -s, --liblinearSolver   LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
+                                Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
+                                                   L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
+        -c, --liblinearC        LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
+                                smaller values specify stronger regularization (default = 10.0)
+        --eps, --liblinearEps   LIBLINEAR tolerance of termination criterion (default = 0.1)
+
+        SGD/AdaGrad:
+        -l, --lr, --eta         Step size (learning rate) for online optimizers (default = 1.0)
+        --epochs                Number of training epochs for online optimizers (default = 1)
+        --adagradEps            Defines starting step size for AdaGrad (default = 0.001)
+
+        Tree:
+        -a, --arity             Arity of tree nodes (default = 2)
+        --maxLeaves             Maximum degree of pre-leaf nodes. (default = 100)
+        --tree                  File with tree structure
+        --treeType              Type of a tree to build if file with structure is not provided
+                                tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
+                                            balancedInOrder, balancedRandom, onlineComplete
+
+        K-Means tree:
+        --kmeansEps             Tolerance of termination criterion of the k-means clustering
+                                used in hierarchical k-means tree building procedure (default = 0.001)
+        --kmeansBalanced        Use balanced K-Means clustering (default = 1)
+
+        Prediction:
+        --topK                  Predict top-k labels (default = 5)
+        --threshold             Predict labels with probability above the threshold (default = 0)
+        --thresholds            Path to a file with threshold for each label
+        --setUtility            Type of set-utility function for prediction using svbopFull, svbopHf, svbopMips models.
+                                Set-utility functions: uP, uF1, uAlfa, uAlfaBeta, uDeltaGamma
+                                See: https://arxiv.org/abs/1906.08129
+
+        Set-Utility:
+        --alpha
+        --beta
+        --delta
+        --gamma
+
+        Test:
+        --measures              Evaluate test using set of measures (default = "p@1,r@1,c@1,p@3,r@3,c@3,p@5,r@5,c@5")
+                                Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
+                                          p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,37 @@
+.. napkinXC documentation master file
+
+Welcome to napkinXC's documentation!
+====================================
+
+.. note:: Documentation is currently a work in progress!
+
+napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification
+that implements the following methods both in Python and C++:
+
+* Probabilistic Label Trees (PLTs) - for multi-label log-time training and prediction,
+* Hierarchical softmax (HSM) - for multi-class log-time training and prediction,
+* Binary Relevance (BR) - multi-label baseline,
+* One Versus Rest (OVR) - multi-class baseline.
+
+All the methods decompose multi-class and multi-label into the set of binary learning problems.
+
+
+Right now, the detailed descirption of methods and their parameters can be found in this paper:
+`Probabilistic Label Trees for Extreme Multi-label Classification <https://arxiv.org/pdf/2009.11218.pdf>`_
+
+
+
+.. toctree::
+    :maxdepth: 1
+    :caption: Contents:
+
+    quick_start
+    exe_usage
+    python_api
+
+
+Indices and tables
+------------------
+
+* :ref:`genindex`
+* :ref:`search`