-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Includes new fasttext module
- Loading branch information
Showing
5 changed files
with
171 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
fastText embedding trainer | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. py:module:: pimlico.modules.embeddings.fasttext | ||
+------------+-------------------------------------+ | ||
| Path | pimlico.modules.embeddings.fasttext | | ||
+------------+-------------------------------------+ | ||
| Executable | yes | | ||
+------------+-------------------------------------+ | ||
|
||
Train fastText embeddings on a tokenized corpus. | ||
|
||
Uses the `fastText Python package <https://fasttext.cc/docs/en/python-module.html>`. | ||
|
||
FastText embeddings store more than just a vector for each word, since they | ||
also have sub-word representations. We therefore store a standard embeddings | ||
output, with the word vectors in, and also a special fastText embeddings output. | ||
|
||
|
||
*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3* | ||
|
||
Inputs | ||
====== | ||
|
||
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+======+========================================================================================================================================================================+ | ||
| text | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`TokenizedDocumentType <pimlico.datatypes.corpora.tokenized.TokenizedDocumentType>`> | | ||
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
Outputs | ||
======= | ||
|
||
+------------+--------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+============+================================================================================+ | ||
| embeddings | :class:`embeddings <pimlico.datatypes.embeddings.Embeddings>` | | ||
+------------+--------------------------------------------------------------------------------+ | ||
| model | :class:`fasttext_embeddings <pimlico.datatypes.embeddings.FastTextEmbeddings>` | | ||
+------------+--------------------------------------------------------------------------------+ | ||
|
||
|
||
Options | ||
======= | ||
|
||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| Name | Description | Type | | ||
+================+================================================================+================================+ | ||
| bucket | number of buckets. Default: 2,000,000 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| dim | size of word vectors. Default: 100 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| epoch | number of epochs. Default: 5 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| loss | loss function: ns, hs, softmax, ova. Default: ns | 'ns', 'hs', 'softmax' or 'ova' | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| lr | learning rate. Default: 0.05 | float | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| lr_update_rate | change the rate of updates for the learning rate. Default: 100 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| maxn | max length of char ngram. Default: 6 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| min_count | minimal number of word occurences. Default: 5 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| minn | min length of char ngram. Default: 3 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| model | unsupervised fasttext model: cbow, skipgram. Default: skipgram | 'skipgram' or 'cbow' | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| neg | number of negatives sampled. Default: 5 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| t | sampling threshold. Default: 0.0001 | float | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| verbose | verbose. Default: 2 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| word_ngrams | max length of word ngram. Default: 1 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
| ws | size of the context window. Default: 5 | int | | ||
+----------------+----------------------------------------------------------------+--------------------------------+ | ||
|
||
Example config | ||
============== | ||
|
||
This is an example of how this module can be used in a pipeline config file. | ||
|
||
.. code-block:: ini | ||
[my_fasttext_module] | ||
type=pimlico.modules.embeddings.fasttext | ||
input_text=module_a.some_output | ||
This example usage includes more options. | ||
|
||
.. code-block:: ini | ||
[my_fasttext_module] | ||
type=pimlico.modules.embeddings.fasttext | ||
input_text=module_a.some_output | ||
bucket=2000000 | ||
dim=100 | ||
epoch=5 | ||
loss=ns | ||
lr=0.05 | ||
lr_update_rate=100 | ||
maxn=6 | ||
min_count=5 | ||
minn=3 | ||
model=skipgram | ||
neg=5 | ||
t=0.00 | ||
verbose=2 | ||
word_ngrams=1 | ||
ws=5 | ||
Test pipelines | ||
============== | ||
|
||
This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage. | ||
|
||
* :ref:`test-config-embeddings-fasttext.conf` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
.. _test-config-embeddings-fasttext.conf: | ||
|
||
fasttext\_train | ||
~~~~~~~~~~~~~~~ | ||
|
||
|
||
|
||
This is one of the test pipelines included in Pimlico's repository. | ||
See :ref:`test-pipelines` for more details. | ||
|
||
Config file | ||
=========== | ||
|
||
The complete config file for this test pipeline: | ||
|
||
|
||
.. code-block:: ini | ||
# Train fastText embeddings on a tiny corpus | ||
[pipeline] | ||
name=fasttext_train | ||
release=latest | ||
# Take tokenized text input from a prepared Pimlico dataset | ||
[europarl] | ||
type=pimlico.datatypes.corpora.GroupedCorpus | ||
data_point_type=TokenizedDocumentType | ||
dir=%(test_data_dir)s/datasets/corpora/tokenized | ||
[fasttext] | ||
type=pimlico.modules.embeddings.fasttext | ||
# Set low, since we're training on a tiny corpus | ||
min_count=1 | ||
# Very small vectors: usually this will be more like 100 or 200 | ||
dim=10 | ||
Modules | ||
======= | ||
|
||
|
||
The following Pimlico module types are used in this pipeline: | ||
|
||
* :mod:`pimlico.modules.embeddings.fasttext` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters