Skip to content

Commit

Permalink
Updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
markgw committed Jul 7, 2020
1 parent 61e1dcb commit 7dd7a14
Show file tree
Hide file tree
Showing 30 changed files with 816 additions and 105 deletions.
1 change: 1 addition & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .

help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " all build all Pimlico docs"
@echo " api automatically generate API docs in api/"
@echo " modules automatically generate special API docs for Pimlico modules (in $(MODULEDIR)/)"
@echo " commands automatically generate special API docs for Pimlico command-line commands (in $(COMMANDDIR)/)"
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/pimlico.modules.corpora.group.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,5 +89,5 @@ Test pipelines

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-store.conf`
* :ref:`test-config-group.conf`
* :ref:`test-config-group.conf`
* :ref:`test-config-store.conf`
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ Core modules for generic manipulation of mainly iterable corpora.
pimlico.modules.corpora.interleave
pimlico.modules.corpora.list_filter
pimlico.modules.corpora.shuffle
pimlico.modules.corpora.shuffle_linear
pimlico.modules.corpora.split
pimlico.modules.corpora.store
pimlico.modules.corpora.subsample
pimlico.modules.corpora.subset
pimlico.modules.corpora.vocab_builder
pimlico.modules.corpora.vocab_counter
pimlico.modules.corpora.vocab_mapper
pimlico.modules.corpora.vocab_unmapper
50 changes: 26 additions & 24 deletions docs/modules/pimlico.modules.corpora.shuffle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,28 @@ Random shuffle
Randomly shuffles all the documents in a grouped corpus, outputting
them to a new set of archives with the same sizes as the input archives.

It is difficult to do this efficiently for a large corpus.
We use a strategy where the input documents are read in linear order
and placed into a temporary set of small archives ("bins"). Then these are
This was difficult to do this efficiently for a large corpus using the
old tar storage format. There therefore used to be a strategy implemented
here where the input documents were read in linear order
and placed into a temporary set of small archives ("bins") and these were
concatenated into the larger archives, shuffling the documents in memory
in each during the process.

The expected average size of the temporary bins can be set using the
``bin_size`` parameter. Alternatively, the exact total number of
bins to use can be set using the ``num_bins`` parameter.
It is no longer necessary to do this, since the standard pipeline-internal
storage format permits efficient random access. However, it may sometimes
be necessary to use the linear-reading strategy: for example, if the input
comes from a filter module, its documents cannot be randomly accessed.

It may be necessary to lower the bin size if, for example, your
individual documents are very large files. You might also find the
process is noticeably faster with a higher bin size if your files
are small.
.. todo::

Currently, this accepts any GroupedCorpus as input, but checks at runtime
that the input is stored used the pipeline-internal format. It would be
much better if this check could be enforced at the level of datatypes, so
that the input datatype requirement explicitly rules out grouped corpora
coming from input readers, filters or other dynamic sources.

Since this requires some tricky changes to the datatype system, I'm not
implementing it now, but it should be done in future.


Inputs
Expand All @@ -50,17 +58,13 @@ Outputs
Options
=======

+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| Name | Description | Type |
+====================+============================================================================================================================================================================================================================================================================================================================+========+
| archive_basename | Basename to use for archives in the output corpus. Default: 'archive' | string |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| bin_size | Target expected size of temporary bins into which documents are shuffled. The actual size may vary, but they will on average have this size. Default: 100 | int |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| keep_archive_names | By default, it is assumed that all doc names are unique to the whole corpus, so the same doc names are used once the documents are put into their new archives. If doc names are only unique within the input archives, use this and the input archive names will be included in the output document names. Default: False | bool |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| num_bins | Directly set the number of temporary bins to put document into. If set, bin_size is ignored | int |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+------------------+---------------------------------------------------------------------------------------------------+--------+
| Name | Description | Type |
+==================+===================================================================================================+========+
| archive_basename | Basename to use for archives in the output corpus. Default: 'archive' | string |
+------------------+---------------------------------------------------------------------------------------------------+--------+
| seed | Seed for the random number generator. The RNG is always seeded, for reproducibility. Default: 999 | int |
+------------------+---------------------------------------------------------------------------------------------------+--------+

Example config
==============
Expand All @@ -82,9 +86,7 @@ This example usage includes more options.
type=pimlico.modules.corpora.shuffle
input_corpus=module_a.some_output
archive_basename=archive
bin_size=100
keep_archive_names=F
num_bins=0
seed=999
Test pipelines
==============
Expand Down
106 changes: 106 additions & 0 deletions docs/modules/pimlico.modules.corpora.shuffle_linear.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
Random shuffle
~~~~~~~~~~~~~~

.. py:module:: pimlico.modules.corpora.shuffle_linear
+------------+----------------------------------------+
| Path | pimlico.modules.corpora.shuffle_linear |
+------------+----------------------------------------+
| Executable | yes |
+------------+----------------------------------------+

Randomly shuffles all the documents in a grouped corpus, outputting
them to a new set of archives with the same sizes as the input archives.

It is difficult to do this efficiently for a large corpus when we cannot
randomly access the input documents. Under the old, now deprecated,
tar-based storage format, random access was costly. If a corpus is
produced on the fly, e.g. from a filter or input reader, random access
is impossible.

We use a strategy where the input documents are read in linear order
and placed into a temporary set of small archives ("bins"). Then these are
concatenated into the larger archives, shuffling the documents in memory
in each during the process.

The expected average size of the temporary bins can be set using the
``bin_size`` parameter. Alternatively, the exact total number of
bins to use can be set using the ``num_bins`` parameter.

It may be necessary to lower the bin size if, for example, your
individual documents are very large files. You might also find the
process is noticeably faster with a higher bin size if your files
are small.

.. seealso::

Module type :mod:`pimlico.modules.corpora.shuffle`
If the input corpus is not dynamically produced and is therefore
randomly accessible, it is more efficient to use the ``shuffle``
module type.


Inputs
======

+--------+---------------------------------------------------------------------------+
| Name | Type(s) |
+========+===========================================================================+
| corpus | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` |
+--------+---------------------------------------------------------------------------+

Outputs
=======

+--------+-----------------------------------------------------------------------------------------------+
| Name | Type(s) |
+========+===============================================================================================+
| corpus | :class:`grouped corpus with input doc type <pimlico.datatypes.corpora.grouped.GroupedCorpus>` |
+--------+-----------------------------------------------------------------------------------------------+


Options
=======

+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| Name | Description | Type |
+====================+============================================================================================================================================================================================================================================================================================================================+========+
| archive_basename | Basename to use for archives in the output corpus. Default: 'archive' | string |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| bin_size | Target expected size of temporary bins into which documents are shuffled. The actual size may vary, but they will on average have this size. Default: 100 | int |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| keep_archive_names | By default, it is assumed that all doc names are unique to the whole corpus, so the same doc names are used once the documents are put into their new archives. If doc names are only unique within the input archives, use this and the input archive names will be included in the output document names. Default: False | bool |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| num_bins | Directly set the number of temporary bins to put document into. If set, bin_size is ignored | int |
+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+

Example config
==============

This is an example of how this module can be used in a pipeline config file.

.. code-block:: ini
[my_shuffle_module]
type=pimlico.modules.corpora.shuffle_linear
input_corpus=module_a.some_output
This example usage includes more options.

.. code-block:: ini
[my_shuffle_module]
type=pimlico.modules.corpora.shuffle_linear
input_corpus=module_a.some_output
archive_basename=archive
bin_size=100
keep_archive_names=F
num_bins=0
Test pipelines
==============

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-shuffle_linear.conf`
6 changes: 4 additions & 2 deletions docs/modules/pimlico.modules.corpora.store.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,7 @@ Test pipelines

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-filter_map.conf`
* :ref:`test-config-filter_tokenize.conf`
* :ref:`test-config-filter_tokenize.conf`
* :ref:`test-config-europarl.conf`
* :ref:`test-config-raw_text_files.conf`
* :ref:`test-config-filter_map.conf`
4 changes: 2 additions & 2 deletions docs/modules/pimlico.modules.corpora.vocab_mapper.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,5 +85,5 @@ Test pipelines

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-vocab_mapper_longer.conf`
* :ref:`test-config-vocab_mapper.conf`
* :ref:`test-config-vocab_mapper.conf`
* :ref:`test-config-vocab_mapper_longer.conf`

0 comments on commit 7dd7a14

Please sign in to comment.