Updated docs

markgw · Jul 7, 2020 · 7dd7a14 · 7dd7a14
1 parent 61e1dcb
commit 7dd7a14
Show file tree

Hide file tree

Showing 30 changed files with 816 additions and 105 deletions.
diff --git a/docs/Makefile b/docs/Makefile
@@ -31,6 +31,7 @@ I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 
 help:
 	@echo "Please use \`make <target>' where <target> is one of"
+	@echo "  all        build all Pimlico docs"
 	@echo "  api        automatically generate API docs in api/"
 	@echo "  modules    automatically generate special API docs for Pimlico modules (in $(MODULEDIR)/)"
 	@echo "  commands   automatically generate special API docs for Pimlico command-line commands (in $(COMMANDDIR)/)"

diff --git a/docs/modules/pimlico.modules.corpora.group.rst b/docs/modules/pimlico.modules.corpora.group.rst
@@ -89,5 +89,5 @@ Test pipelines
 
 This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.
 
- * :ref:`test-config-store.conf`
- * :ref:`test-config-group.conf`
+ * :ref:`test-config-group.conf`
+ * :ref:`test-config-store.conf`
diff --git a/docs/modules/pimlico.modules.corpora.rst b/docs/modules/pimlico.modules.corpora.rst
@@ -20,10 +20,12 @@ Core modules for generic manipulation of mainly iterable corpora.
    pimlico.modules.corpora.interleave
    pimlico.modules.corpora.list_filter
    pimlico.modules.corpora.shuffle
+   pimlico.modules.corpora.shuffle_linear
    pimlico.modules.corpora.split
    pimlico.modules.corpora.store
    pimlico.modules.corpora.subsample
    pimlico.modules.corpora.subset
    pimlico.modules.corpora.vocab_builder
    pimlico.modules.corpora.vocab_counter
    pimlico.modules.corpora.vocab_mapper
+   pimlico.modules.corpora.vocab_unmapper
diff --git a/docs/modules/pimlico.modules.corpora.shuffle.rst b/docs/modules/pimlico.modules.corpora.shuffle.rst
@@ -12,20 +12,28 @@ Random shuffle
 Randomly shuffles all the documents in a grouped corpus, outputting
 them to a new set of archives with the same sizes as the input archives.
 
-It is difficult to do this efficiently for a large corpus.
-We use a strategy where the input documents are read in linear order
-and placed into a temporary set of small archives ("bins"). Then these are
+This was difficult to do this efficiently for a large corpus using the
+old tar storage format. There therefore used to be a strategy implemented
+here where the input documents were read in linear order
+and placed into a temporary set of small archives ("bins") and these were
 concatenated into the larger archives, shuffling the documents in memory
 in each during the process.
 
-The expected average size of the temporary bins can be set using the
-``bin_size`` parameter. Alternatively, the exact total number of
-bins to use can be set using the ``num_bins`` parameter.
+It is no longer necessary to do this, since the standard pipeline-internal
+storage format permits efficient random access. However, it may sometimes
+be necessary to use the linear-reading strategy: for example, if the input
+comes from a filter module, its documents cannot be randomly accessed.
 
-It may be necessary to lower the bin size if, for example, your
-individual documents are very large files. You might also find the
-process is noticeably faster with a higher bin size if your files
-are small.
+.. todo::
+
+   Currently, this accepts any GroupedCorpus as input, but checks at runtime
+   that the input is stored used the pipeline-internal format. It would be
+   much better if this check could be enforced at the level of datatypes, so
+   that the input datatype requirement explicitly rules out grouped corpora
+   coming from input readers, filters or other dynamic sources.
+
+   Since this requires some tricky changes to the datatype system, I'm not
+   implementing it now, but it should be done in future.
 
 
 Inputs
@@ -50,17 +58,13 @@ Outputs
 Options
 =======
 
-+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
-| Name               | Description                                                                                                                                                                                                                                                                                                                | Type   |
-+====================+============================================================================================================================================================================================================================================================================================================================+========+
-| archive_basename   | Basename to use for archives in the output corpus. Default: 'archive'                                                                                                                                                                                                                                                      | string |
-+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
-| bin_size           | Target expected size of temporary bins into which documents are shuffled. The actual size may vary, but they will on average have this size. Default: 100                                                                                                                                                                  | int    |
-+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
-| keep_archive_names | By default, it is assumed that all doc names are unique to the whole corpus, so the same doc names are used once the documents are put into their new archives. If doc names are only unique within the input archives, use this and the input archive names will be included in the output document names. Default: False | bool   |
-+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
-| num_bins           | Directly set the number of temporary bins to put document into. If set, bin_size is ignored                                                                                                                                                                                                                                | int    |
-+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
++------------------+---------------------------------------------------------------------------------------------------+--------+
+| Name             | Description                                                                                       | Type   |
++==================+===================================================================================================+========+
+| archive_basename | Basename to use for archives in the output corpus. Default: 'archive'                             | string |
++------------------+---------------------------------------------------------------------------------------------------+--------+
+| seed             | Seed for the random number generator. The RNG is always seeded, for reproducibility. Default: 999 | int    |
++------------------+---------------------------------------------------------------------------------------------------+--------+
 
 Example config
 ==============
@@ -82,9 +86,7 @@ This example usage includes more options.
    type=pimlico.modules.corpora.shuffle
    input_corpus=module_a.some_output
    archive_basename=archive
-   bin_size=100
-   keep_archive_names=F
-   num_bins=0
+   seed=999
 
 Test pipelines
 ==============

diff --git a/docs/modules/pimlico.modules.corpora.shuffle_linear.rst b/docs/modules/pimlico.modules.corpora.shuffle_linear.rst
@@ -0,0 +1,106 @@
+Random shuffle
+~~~~~~~~~~~~~~
+
+.. py:module:: pimlico.modules.corpora.shuffle_linear
+
++------------+----------------------------------------+
+| Path       | pimlico.modules.corpora.shuffle_linear |
++------------+----------------------------------------+
+| Executable | yes                                    |
++------------+----------------------------------------+
+
+Randomly shuffles all the documents in a grouped corpus, outputting
+them to a new set of archives with the same sizes as the input archives.
+
+It is difficult to do this efficiently for a large corpus when we cannot
+randomly access the input documents. Under the old, now deprecated,
+tar-based storage format, random access was costly. If a corpus is
+produced on the fly, e.g. from a filter or input reader, random access
+is impossible.
+
+We use a strategy where the input documents are read in linear order
+and placed into a temporary set of small archives ("bins"). Then these are
+concatenated into the larger archives, shuffling the documents in memory
+in each during the process.
+
+The expected average size of the temporary bins can be set using the
+``bin_size`` parameter. Alternatively, the exact total number of
+bins to use can be set using the ``num_bins`` parameter.
+
+It may be necessary to lower the bin size if, for example, your
+individual documents are very large files. You might also find the
+process is noticeably faster with a higher bin size if your files
+are small.
+
+.. seealso::
+
+   Module type :mod:`pimlico.modules.corpora.shuffle`
+      If the input corpus is not dynamically produced and is therefore
+      randomly accessible, it is more efficient to use the ``shuffle``
+      module type.
+
+
+Inputs
+======
+
++--------+---------------------------------------------------------------------------+
+| Name   | Type(s)                                                                   |
++========+===========================================================================+
+| corpus | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` |
++--------+---------------------------------------------------------------------------+
+
+Outputs
+=======
+
++--------+-----------------------------------------------------------------------------------------------+
+| Name   | Type(s)                                                                                       |
++========+===============================================================================================+
+| corpus | :class:`grouped corpus with input doc type <pimlico.datatypes.corpora.grouped.GroupedCorpus>` |
++--------+-----------------------------------------------------------------------------------------------+
+
+
+Options
+=======
+
++--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| Name               | Description                                                                                                                                                                                                                                                                                                                | Type   |
++====================+============================================================================================================================================================================================================================================================================================================================+========+
+| archive_basename   | Basename to use for archives in the output corpus. Default: 'archive'                                                                                                                                                                                                                                                      | string |
++--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| bin_size           | Target expected size of temporary bins into which documents are shuffled. The actual size may vary, but they will on average have this size. Default: 100                                                                                                                                                                  | int    |
++--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| keep_archive_names | By default, it is assumed that all doc names are unique to the whole corpus, so the same doc names are used once the documents are put into their new archives. If doc names are only unique within the input archives, use this and the input archive names will be included in the output document names. Default: False | bool   |
++--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| num_bins           | Directly set the number of temporary bins to put document into. If set, bin_size is ignored                                                                                                                                                                                                                                | int    |
++--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+
+Example config
+==============
+
+This is an example of how this module can be used in a pipeline config file.
+
+.. code-block:: ini
+   
+   [my_shuffle_module]
+   type=pimlico.modules.corpora.shuffle_linear
+   input_corpus=module_a.some_output
+   
+
+This example usage includes more options.
+
+.. code-block:: ini
+   
+   [my_shuffle_module]
+   type=pimlico.modules.corpora.shuffle_linear
+   input_corpus=module_a.some_output
+   archive_basename=archive
+   bin_size=100
+   keep_archive_names=F
+   num_bins=0
+
+Test pipelines
+==============
+
+This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.
+
+ * :ref:`test-config-shuffle_linear.conf`
diff --git a/docs/modules/pimlico.modules.corpora.store.rst b/docs/modules/pimlico.modules.corpora.store.rst
@@ -54,5 +54,7 @@ Test pipelines
 
 This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.
 
- * :ref:`test-config-filter_map.conf`
- * :ref:`test-config-filter_tokenize.conf`
+ * :ref:`test-config-filter_tokenize.conf`
+ * :ref:`test-config-europarl.conf`
+ * :ref:`test-config-raw_text_files.conf`
+ * :ref:`test-config-filter_map.conf`
diff --git a/docs/modules/pimlico.modules.corpora.vocab_mapper.rst b/docs/modules/pimlico.modules.corpora.vocab_mapper.rst
@@ -85,5 +85,5 @@ Test pipelines
 
 This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.
 
- * :ref:`test-config-vocab_mapper_longer.conf`
- * :ref:`test-config-vocab_mapper.conf`
+ * :ref:`test-config-vocab_mapper.conf`
+ * :ref:`test-config-vocab_mapper_longer.conf`