Skip to content

Commit

Permalink
Add minimal Sphinx documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mwydmuch committed Oct 26, 2020
1 parent b65fe05 commit fc7893d
Show file tree
Hide file tree
Showing 10 changed files with 399 additions and 18 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
# Cpp
nxc

# Docs
docs/*/*

# Experiments
/data
/models*
Expand Down
42 changes: 28 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# napkinXC [![Build Status](https://travis-ci.org/mwydmuch/napkinXC.svg?branch=master)](https://travis-ci.org/mwydmuch/napkinXC)
# napkinXC
[![PyPI version](https://badge.fury.io/py/napkinxc.svg)](https://badge.fury.io/py/napkinxc)
[![Build Status](https://travis-ci.org/mwydmuch/napkinXC.svg?branch=master)](https://travis-ci.org/mwydmuch/napkinXC)
[![Documentation Status](https://readthedocs.org/projects/napkinxc/badge/?version=latest)](https://napkinxc.readthedocs.io/en/latest/?badge=latest)
[![License](https://img.shields.io/github/license/mwydmuch/napkinXC.svg)](https://github.com/mwydmuch/napkinXC/blob/master/LICENSE)

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification.
It allows to train a classifier for very large datasets in few lines of code with minimal resources.

Right now, napkinXC implements the following features both in Python and C++:
- Probabilistic Label Trees (PLT) and Online Probabilistic Label Trees (OPLT),
- Probabilistic Label Trees (PLTs) and Online Probabilistic Label Trees (OPLTs),
- Hierarchical softmax (HSM),
- Binary Relevance (BR),
- One Versus Rest (OVR),
Expand All @@ -16,6 +20,7 @@ Right now, napkinXC implements the following features both in Python and C++:
- helpers to download and load data from [XML Repository](http://manikvarma.org/downloads/XC/XMLRepository.html),
- helpers to measure performance.

Documentation is
Please note that this library is still under development and also serves as a base for experiments.
Some of the experimental features may not be documented.

Expand All @@ -26,20 +31,22 @@ All contributions to the project are welcome!
## Roadmap

Coming soon:
- OPLT available in Python
- Possibility to use any type of binary classifier from Python
- Improved dataset loading from Python
- More datasets from XML Repository


## Python quick start
## Python Quick Start and Documentation

Python version of napkinXC can be easily installed from PyPy repository:
napkinXC's documentation is available at [https://napkinxc.readthedocs.io](https://napkinxc.readthedocs.io) and is generated from this repository.

Python version of napkinXC can be easily installed from PyPy repository on Linux and MacOS,
it requires modern C++ compiler, CMake and Git installed:
```
pip install napkinxc
```

or directly from the GitHub repository:
or the latest master version directly from the GitHub repository:
```
pip install pip install git+https://github.com/mwydmuch/napkinXC.git
```
Expand All @@ -63,11 +70,11 @@ More examples can be found under `python/examples` directory.

## Building executable

napkinXC can be also build as executable using:
napkinXC can also be built using:

```
cmake .
make -j
make
```


Expand All @@ -80,7 +87,7 @@ Commands:
train Train model on given input data
test Test model on given input data
predict Predict for given data
ofo Use online f-measure optimalization
ofo Use online f-measure optimization
version Print napkinXC version
help Print help
Expand All @@ -91,12 +98,8 @@ Args:
-m, --model Model type (default = plt):
Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
--ensemble Number of models in ensemble (default = 1)
-d, --dataFormat Type of data format (default = libsvm),
Supported data formats: libsvm
-t, --threads Number of threads to use (default = 0)
Note: -1 to use #cpus - 1, 0 to use #cpus
--header Input contains header (default = 1)
Header format for libsvm: #lines #features #labels
--hash Size of features space (default = 0)
Note: 0 to disable hashing
--featuresThreshold Prune features below given threshold (default = 0.0)
Expand Down Expand Up @@ -157,12 +160,23 @@ Args:
```


## Data Format

napkinXC supports multi-label svmlight/libsvm format
and format of datasets from [The Extreme Classification Repository](https://manikvarma.github.io/downloads/XC/XMLRepository.html),
which has an additional header line with a number of data points, features, and labels.

```
label,label,... feature(:value) feature(:value) ...
```


## References and acknowledgments

This library implements methods from following papers:

- [Probabilistic Label Trees for Extreme Multi-label Classification](https://arxiv.org/pdf/2009.11218.pdf)
- [Online Probabilistic Label Trees](https://arxiv.org/abs/1906.08129)

- [Efficient Algorithms for Set-Valued Prediction in Multi-Class Classification](https://arxiv.org/abs/1906.08129)

Another implementation of PLT model is available in [extremeText](https://github.com/mwydmuch/extremeText) library,
Expand Down
13 changes: 13 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Documentation

Documentation for napkinXC is generated using [Sphinx](https://www.sphinx-doc.org/)
After each commit on `master`, documentation is updated and published to [Read the Docs](https://napkinxc.readthedocs.io).

You can build the documentation locally. Just install Sphinx and run in ``docs`` directory:

```
pip install -r requirements.txt
make html
```

Documentation will be created in `docs/_build` directory.
92 changes: 92 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.

import os
import sys
sys.path.insert(0, os.path.abspath('../python'))


# -- Project information -----------------------------------------------------

project = 'napkinXC'
copyright = '2020, Marek Wydmuch'
author = 'Marek Wydmuch'

# The full version, including alpha/beta/rc tags
release = '0.4.1'


# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.todo',
'sphinx.ext.viewcode',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.mathjax'
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']


# -- Autodoc configuration ---------------------------------------------------
autodoc_mock_imports = [
"napkinxc._napkinxc",
"numpy",
"scipy",
"scipy.sparse",
"sklearn"
]
#autoclass_content = 'both'
autodoc_default_flags = ['members', 'inherited-members', 'show-inheritance']
autodoc_default_options = {
"members": True,
"inherited-members": True,
"show-inheritance": True,
}

# Generate autosummary pages. Output should be set with: `:toctree: pythonapi/`
autosummary_generate = ['python_api.rst']

# Only the class' docstring is inserted.
autoclass_content = 'class'

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False

# The master toctree document.
master_doc = 'index'


# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.

import sphinx_rtd_theme
html_theme = 'sphinx_rtd_theme'

html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
114 changes: 114 additions & 0 deletions docs/exe_usage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
Executable
==========

napkinXC can also be built and used as an executable that can be used to train and evaluate models and make a prediction.


Building
--------

To build napkinXC, first clone the project repository and run the following commands in the root directory of the project:

.. code:: sh
cmake .
make
``-B`` options can be passed to CMake command to specify other build directory.
After successful compilation, ``nxc`` executable should appear in the root or specified build directory.


Data Format
-----------

napkinXC supports multi-label svmlight/libsvm format
and format of datasets from `The Extreme Classification Repository <https://manikvarma.github.io/downloads/XC/XMLRepository.html>`_,
which has an additional header line with a number of data points, features, and labels.

.. code:: sh
label,label,... feature(:value) feature(:value) ...
Command line options
--------------------

.. code::
Usage: nxc <command> <args>
Commands:
train Train model on given input data
test Test model on given input data
predict Predict for given data
ofo Use online f-measure optimization
version Print napkinXC version
help Print help
Args:
General:
-i, --input Input dataset
-o, --output Output (model) dir
-m, --model Model type (default = plt):
Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
--ensemble Number of models in ensemble (default = 1)
-t, --threads Number of threads to use (default = 0)
Note: -1 to use #cpus - 1, 0 to use #cpus
--hash Size of features space (default = 0)
Note: 0 to disable hashing
--featuresThreshold Prune features below given threshold (default = 0.0)
--seed Seed (default = system time)
--verbose Verbose level (default = 2)
Base classifiers:
--optimizer Optimizer used for training binary classifiers (default = libliner)
Optimizers: liblinear, sgd, adagrad, fobos
--bias Value of the bias features (default = 1)
--inbalanceLabelsWeighting Increase the weight of minority labels in base classifiers (default = 1)
--weightsThreshold Threshold value for pruning models weights (default = 0.1)
LIBLINEAR: (more about LIBLINEAR: https://github.com/cjlin1/liblinear)
-s, --liblinearSolver LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
-c, --liblinearC LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
smaller values specify stronger regularization (default = 10.0)
--eps, --liblinearEps LIBLINEAR tolerance of termination criterion (default = 0.1)
SGD/AdaGrad:
-l, --lr, --eta Step size (learning rate) for online optimizers (default = 1.0)
--epochs Number of training epochs for online optimizers (default = 1)
--adagradEps Defines starting step size for AdaGrad (default = 0.001)
Tree:
-a, --arity Arity of tree nodes (default = 2)
--maxLeaves Maximum degree of pre-leaf nodes. (default = 100)
--tree File with tree structure
--treeType Type of a tree to build if file with structure is not provided
tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
balancedInOrder, balancedRandom, onlineComplete
K-Means tree:
--kmeansEps Tolerance of termination criterion of the k-means clustering
used in hierarchical k-means tree building procedure (default = 0.001)
--kmeansBalanced Use balanced K-Means clustering (default = 1)
Prediction:
--topK Predict top-k labels (default = 5)
--threshold Predict labels with probability above the threshold (default = 0)
--thresholds Path to a file with threshold for each label
--setUtility Type of set-utility function for prediction using svbopFull, svbopHf, svbopMips models.
Set-utility functions: uP, uF1, uAlfa, uAlfaBeta, uDeltaGamma
See: https://arxiv.org/abs/1906.08129
Set-Utility:
--alpha
--beta
--delta
--gamma
Test:
--measures Evaluate test using set of measures (default = "p@1,r@1,c@1,p@3,r@3,c@3,p@5,r@5,c@5")
Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)
37 changes: 37 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. napkinXC documentation master file
Welcome to napkinXC's documentation!
====================================

.. note:: Documentation is currently a work in progress!

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification
that implements the following methods both in Python and C++:

* Probabilistic Label Trees (PLTs) - for multi-label log-time training and prediction,
* Hierarchical softmax (HSM) - for multi-class log-time training and prediction,
* Binary Relevance (BR) - multi-label baseline,
* One Versus Rest (OVR) - multi-class baseline.

All the methods decompose multi-class and multi-label into the set of binary learning problems.


Right now, the detailed descirption of methods and their parameters can be found in this paper:
`Probabilistic Label Trees for Extreme Multi-label Classification <https://arxiv.org/pdf/2009.11218.pdf>`_



.. toctree::
:maxdepth: 1
:caption: Contents:

quick_start
exe_usage
python_api


Indices and tables
------------------

* :ref:`genindex`
* :ref:`search`

0 comments on commit fc7893d

Please sign in to comment.