Skip to content

Commit

Permalink
Rebuilt documentation
Browse files Browse the repository at this point in the history
Includes new fasttext module
  • Loading branch information
markgw committed Sep 25, 2020
1 parent 3158a39 commit 14c48fa
Show file tree
Hide file tree
Showing 5 changed files with 171 additions and 0 deletions.
122 changes: 122 additions & 0 deletions docs/modules/pimlico.modules.embeddings.fasttext.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
fastText embedding trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. py:module:: pimlico.modules.embeddings.fasttext
+------------+-------------------------------------+
| Path | pimlico.modules.embeddings.fasttext |
+------------+-------------------------------------+
| Executable | yes |
+------------+-------------------------------------+

Train fastText embeddings on a tokenized corpus.

Uses the `fastText Python package <https://fasttext.cc/docs/en/python-module.html>`.

FastText embeddings store more than just a vector for each word, since they
also have sub-word representations. We therefore store a standard embeddings
output, with the word vectors in, and also a special fastText embeddings output.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Name | Type(s) |
+======+========================================================================================================================================================================+
| text | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`TokenizedDocumentType <pimlico.datatypes.corpora.tokenized.TokenizedDocumentType>`> |
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Outputs
=======

+------------+--------------------------------------------------------------------------------+
| Name | Type(s) |
+============+================================================================================+
| embeddings | :class:`embeddings <pimlico.datatypes.embeddings.Embeddings>` |
+------------+--------------------------------------------------------------------------------+
| model | :class:`fasttext_embeddings <pimlico.datatypes.embeddings.FastTextEmbeddings>` |
+------------+--------------------------------------------------------------------------------+


Options
=======

+----------------+----------------------------------------------------------------+--------------------------------+
| Name | Description | Type |
+================+================================================================+================================+
| bucket | number of buckets. Default: 2,000,000 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| dim | size of word vectors. Default: 100 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| epoch | number of epochs. Default: 5 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| loss | loss function: ns, hs, softmax, ova. Default: ns | 'ns', 'hs', 'softmax' or 'ova' |
+----------------+----------------------------------------------------------------+--------------------------------+
| lr | learning rate. Default: 0.05 | float |
+----------------+----------------------------------------------------------------+--------------------------------+
| lr_update_rate | change the rate of updates for the learning rate. Default: 100 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| maxn | max length of char ngram. Default: 6 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| min_count | minimal number of word occurences. Default: 5 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| minn | min length of char ngram. Default: 3 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| model | unsupervised fasttext model: cbow, skipgram. Default: skipgram | 'skipgram' or 'cbow' |
+----------------+----------------------------------------------------------------+--------------------------------+
| neg | number of negatives sampled. Default: 5 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| t | sampling threshold. Default: 0.0001 | float |
+----------------+----------------------------------------------------------------+--------------------------------+
| verbose | verbose. Default: 2 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| word_ngrams | max length of word ngram. Default: 1 | int |
+----------------+----------------------------------------------------------------+--------------------------------+
| ws | size of the context window. Default: 5 | int |
+----------------+----------------------------------------------------------------+--------------------------------+

Example config
==============

This is an example of how this module can be used in a pipeline config file.

.. code-block:: ini
[my_fasttext_module]
type=pimlico.modules.embeddings.fasttext
input_text=module_a.some_output
This example usage includes more options.

.. code-block:: ini
[my_fasttext_module]
type=pimlico.modules.embeddings.fasttext
input_text=module_a.some_output
bucket=2000000
dim=100
epoch=5
loss=ns
lr=0.05
lr_update_rate=100
maxn=6
min_count=5
minn=3
model=skipgram
neg=5
t=0.00
verbose=2
word_ngrams=1
ws=5
Test pipelines
==============

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-embeddings-fasttext.conf`

1 change: 1 addition & 0 deletions docs/modules/pimlico.modules.embeddings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ provided by sklearn.
:maxdepth: 2
:titlesonly:

pimlico.modules.embeddings.fasttext
pimlico.modules.embeddings.normalize
pimlico.modules.embeddings.store_embeddings
pimlico.modules.embeddings.store_tsv
Expand Down
46 changes: 46 additions & 0 deletions docs/test_config/embeddings.fasttext.conf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
.. _test-config-embeddings-fasttext.conf:

fasttext\_train
~~~~~~~~~~~~~~~



This is one of the test pipelines included in Pimlico's repository.
See :ref:`test-pipelines` for more details.

Config file
===========

The complete config file for this test pipeline:


.. code-block:: ini
# Train fastText embeddings on a tiny corpus
[pipeline]
name=fasttext_train
release=latest
# Take tokenized text input from a prepared Pimlico dataset
[europarl]
type=pimlico.datatypes.corpora.GroupedCorpus
data_point_type=TokenizedDocumentType
dir=%(test_data_dir)s/datasets/corpora/tokenized
[fasttext]
type=pimlico.modules.embeddings.fasttext
# Set low, since we're training on a tiny corpus
min_count=1
# Very small vectors: usually this will be more like 100 or 200
dim=10
Modules
=======


The following Pimlico module types are used in this pipeline:

* :mod:`pimlico.modules.embeddings.fasttext`


1 change: 1 addition & 0 deletions docs/test_config/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Available pipelines
embeddings.word2vec.conf.rst
embeddings.store_word2vec.conf.rst
embeddings.store_tsv.conf.rst
embeddings.fasttext.conf.rst
corpora.interleave.conf.rst
corpora.list_filter.conf.rst
corpora.vocab_unmapper.conf.rst
Expand Down
1 change: 1 addition & 0 deletions docs/test_config/module_list.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ test-config-embeddings-normalize.conf pimlico.modules.embeddings.normalize
test-config-embeddings-word2vec.conf pimlico.modules.embeddings.word2vec
test-config-embeddings-store_word2vec.conf pimlico.modules.embeddings.store_word2vec
test-config-embeddings-store_tsv.conf pimlico.modules.embeddings.store_tsv
test-config-embeddings-fasttext.conf pimlico.modules.embeddings.fasttext
test-config-corpora-interleave.conf pimlico.modules.corpora.interleave, pimlico.modules.corpora.format
test-config-corpora-list_filter.conf pimlico.modules.corpora.list_filter
test-config-corpora-vocab_unmapper.conf pimlico.modules.corpora.vocab_unmapper
Expand Down

0 comments on commit 14c48fa

Please sign in to comment.