Rebuilt documentation

Includes new fasttext module
markgw · Sep 25, 2020 · 14c48fa · 14c48fa
1 parent 3158a39
commit 14c48fa
Show file tree

Hide file tree

Showing 5 changed files with 171 additions and 0 deletions.
diff --git a/docs/modules/pimlico.modules.embeddings.fasttext.rst b/docs/modules/pimlico.modules.embeddings.fasttext.rst
@@ -0,0 +1,122 @@
+fastText embedding trainer
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. py:module:: pimlico.modules.embeddings.fasttext
+
++------------+-------------------------------------+
+| Path       | pimlico.modules.embeddings.fasttext |
++------------+-------------------------------------+
+| Executable | yes                                 |
++------------+-------------------------------------+
+
+Train fastText embeddings on a tokenized corpus.
+
+Uses the `fastText Python package <https://fasttext.cc/docs/en/python-module.html>`.
+
+FastText embeddings store more than just a vector for each word, since they
+also have sub-word representations. We therefore store a standard embeddings
+output, with the word vectors in, and also a special fastText embeddings output.
+
+
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
+Inputs
+======
+
++------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Name | Type(s)                                                                                                                                                                |
++======+========================================================================================================================================================================+
+| text | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`TokenizedDocumentType <pimlico.datatypes.corpora.tokenized.TokenizedDocumentType>`> |
++------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+Outputs
+=======
+
++------------+--------------------------------------------------------------------------------+
+| Name       | Type(s)                                                                        |
++============+================================================================================+
+| embeddings | :class:`embeddings <pimlico.datatypes.embeddings.Embeddings>`                  |
++------------+--------------------------------------------------------------------------------+
+| model      | :class:`fasttext_embeddings <pimlico.datatypes.embeddings.FastTextEmbeddings>` |
++------------+--------------------------------------------------------------------------------+
+
+
+Options
+=======
+
++----------------+----------------------------------------------------------------+--------------------------------+
+| Name           | Description                                                    | Type                           |
++================+================================================================+================================+
+| bucket         | number of buckets. Default: 2,000,000                          | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| dim            | size of word vectors. Default: 100                             | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| epoch          | number of epochs. Default: 5                                   | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| loss           | loss function: ns, hs, softmax, ova. Default: ns               | 'ns', 'hs', 'softmax' or 'ova' |
++----------------+----------------------------------------------------------------+--------------------------------+
+| lr             | learning rate. Default: 0.05                                   | float                          |
++----------------+----------------------------------------------------------------+--------------------------------+
+| lr_update_rate | change the rate of updates for the learning rate. Default: 100 | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| maxn           | max length of char ngram. Default: 6                           | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| min_count      | minimal number of word occurences. Default: 5                  | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| minn           | min length of char ngram. Default: 3                           | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| model          | unsupervised fasttext model: cbow, skipgram. Default: skipgram | 'skipgram' or 'cbow'           |
++----------------+----------------------------------------------------------------+--------------------------------+
+| neg            | number of negatives sampled. Default: 5                        | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| t              | sampling threshold. Default: 0.0001                            | float                          |
++----------------+----------------------------------------------------------------+--------------------------------+
+| verbose        | verbose. Default: 2                                            | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| word_ngrams    | max length of word ngram. Default: 1                           | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+| ws             | size of the context window. Default: 5                         | int                            |
++----------------+----------------------------------------------------------------+--------------------------------+
+
+Example config
+==============
+
+This is an example of how this module can be used in a pipeline config file.
+
+.. code-block:: ini
+   
+   [my_fasttext_module]
+   type=pimlico.modules.embeddings.fasttext
+   input_text=module_a.some_output
+   
+
+This example usage includes more options.
+
+.. code-block:: ini
+   
+   [my_fasttext_module]
+   type=pimlico.modules.embeddings.fasttext
+   input_text=module_a.some_output
+   bucket=2000000
+   dim=100
+   epoch=5
+   loss=ns
+   lr=0.05
+   lr_update_rate=100
+   maxn=6
+   min_count=5
+   minn=3
+   model=skipgram
+   neg=5
+   t=0.00
+   verbose=2
+   word_ngrams=1
+   ws=5
+
+Test pipelines
+==============
+
+This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.
+
+ * :ref:`test-config-embeddings-fasttext.conf`
+
diff --git a/docs/modules/pimlico.modules.embeddings.rst b/docs/modules/pimlico.modules.embeddings.rst
@@ -18,6 +18,7 @@ provided by sklearn.
    :maxdepth: 2
    :titlesonly:
 
+   pimlico.modules.embeddings.fasttext
    pimlico.modules.embeddings.normalize
    pimlico.modules.embeddings.store_embeddings
    pimlico.modules.embeddings.store_tsv

diff --git a/docs/test_config/embeddings.fasttext.conf.rst b/docs/test_config/embeddings.fasttext.conf.rst
@@ -0,0 +1,46 @@
+.. _test-config-embeddings-fasttext.conf:
+
+fasttext\_train
+~~~~~~~~~~~~~~~
+
+
+
+This is one of the test pipelines included in Pimlico's repository.
+See :ref:`test-pipelines` for more details.
+
+Config file
+===========
+
+The complete config file for this test pipeline:
+
+
+.. code-block:: ini
+   
+   # Train fastText embeddings on a tiny corpus
+   [pipeline]
+   name=fasttext_train
+   release=latest
+   
+   # Take tokenized text input from a prepared Pimlico dataset
+   [europarl]
+   type=pimlico.datatypes.corpora.GroupedCorpus
+   data_point_type=TokenizedDocumentType
+   dir=%(test_data_dir)s/datasets/corpora/tokenized
+   
+   [fasttext]
+   type=pimlico.modules.embeddings.fasttext
+   # Set low, since we're training on a tiny corpus
+   min_count=1
+   # Very small vectors: usually this will be more like 100 or 200
+   dim=10
+
+
+Modules
+=======
+
+
+The following Pimlico module types are used in this pipeline:
+
+ * :mod:`pimlico.modules.embeddings.fasttext`
+
+
diff --git a/docs/test_config/index.rst b/docs/test_config/index.rst
@@ -26,6 +26,7 @@ Available pipelines
    embeddings.word2vec.conf.rst
    embeddings.store_word2vec.conf.rst
    embeddings.store_tsv.conf.rst
+   embeddings.fasttext.conf.rst
    corpora.interleave.conf.rst
    corpora.list_filter.conf.rst
    corpora.vocab_unmapper.conf.rst

diff --git a/docs/test_config/module_list.tsv b/docs/test_config/module_list.tsv
@@ -7,6 +7,7 @@ test-config-embeddings-normalize.conf	pimlico.modules.embeddings.normalize
 test-config-embeddings-word2vec.conf	pimlico.modules.embeddings.word2vec
 test-config-embeddings-store_word2vec.conf	pimlico.modules.embeddings.store_word2vec
 test-config-embeddings-store_tsv.conf	pimlico.modules.embeddings.store_tsv
+test-config-embeddings-fasttext.conf	pimlico.modules.embeddings.fasttext
 test-config-corpora-interleave.conf	pimlico.modules.corpora.interleave, pimlico.modules.corpora.format
 test-config-corpora-list_filter.conf	pimlico.modules.corpora.list_filter
 test-config-corpora-vocab_unmapper.conf	pimlico.modules.corpora.vocab_unmapper