Skip to content

Commit

Permalink
Rebuilt docs
Browse files Browse the repository at this point in the history
  • Loading branch information
markgw committed Apr 2, 2020
1 parent ed0ed4e commit 000bd9d
Show file tree
Hide file tree
Showing 22 changed files with 301 additions and 5 deletions.
7 changes: 7 additions & 0 deletions docs/api/pimlico.cli.pimarc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
pimarc
======

.. automodule:: pimlico.cli.pimarc
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs/api/pimlico.cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Submodules
pimlico.cli.locations
pimlico.cli.main
pimlico.cli.newmodule
pimlico.cli.pimarc
pimlico.cli.pyshell
pimlico.cli.recover
pimlico.cli.reset
Expand Down
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.pimarc.index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
index
=====

.. automodule:: pimlico.utils.pimarc.index
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.pimarc.reader.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
reader
======

.. automodule:: pimlico.utils.pimarc.reader
:members:
:undoc-members:
:show-inheritance:
22 changes: 22 additions & 0 deletions docs/api/pimlico.utils.pimarc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
pimarc
======

Submodules
----------

.. toctree::

pimlico.utils.pimarc.index
pimlico.utils.pimarc.reader
pimlico.utils.pimarc.tar
pimlico.utils.pimarc.tools
pimlico.utils.pimarc.utils
pimlico.utils.pimarc.writer

Module contents
---------------

.. automodule:: pimlico.utils.pimarc
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.pimarc.tar.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
tar
===

.. automodule:: pimlico.utils.pimarc.tar
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.pimarc.tools.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
tools
=====

.. automodule:: pimlico.utils.pimarc.tools
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.pimarc.utils.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
utils
=====

.. automodule:: pimlico.utils.pimarc.utils
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.pimarc.writer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
writer
======

.. automodule:: pimlico.utils.pimarc.writer
:members:
:undoc-members:
:show-inheritance:
2 changes: 2 additions & 0 deletions docs/api/pimlico.utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Subpackages
.. toctree::

pimlico.utils.docs
pimlico.utils.pimarc

Submodules
----------
Expand All @@ -30,6 +31,7 @@ Submodules
pimlico.utils.system
pimlico.utils.timeout
pimlico.utils.urwid
pimlico.utils.varint
pimlico.utils.web

Module contents
Expand Down
7 changes: 7 additions & 0 deletions docs/api/pimlico.utils.varint.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
varint
======

.. automodule:: pimlico.utils.varint
:members:
:undoc-members:
:show-inheritance:
3 changes: 3 additions & 0 deletions docs/commands/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ command line.
+-------------------+----------------------------------------------------------------------------------------------+
| :doc:`stores` | List named Pimlico stores |
+-------------------+----------------------------------------------------------------------------------------------+
| :doc:`tar2pimarc` | Convert grouped corpora from the old tar-based storage format to pimarc |
+-------------------+----------------------------------------------------------------------------------------------+
| :doc:`unlock` | Forcibly remove an execution lock from a module |
+-------------------+----------------------------------------------------------------------------------------------+
| :doc:`variants` | List the available variants of a pipeline config |
Expand Down Expand Up @@ -98,3 +100,4 @@ command line.
visualize
email
jupyter
tar2pimarc
38 changes: 38 additions & 0 deletions docs/commands/tar2pimarc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
.. _command_tar2pimarc:

tar2pimarc
~~~~~~~~~~


*Command-line tool subcommand*


Convert grouped corpora from the old tar-based storage format to Pimarc
archives.


Usage:

::

pimlico.sh [...] tar2pimarc [outputs [outputs ...]] [-h] [--run]


Positional arguments
====================

+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Arg | Description |
+=============================+=================================================================================================================================================================================================================================================================+
| ``[outputs [outputs ...]]`` | Specification of module outputs to convert. Specific datasets can be given as 'module_name.output_name'. All grouped corpus outputs of a module can be converted by just giving 'module_name'. Or, if nothing's given, all outputs of all modules are converted |
+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Options
=======

+-----------+------------------------------------------------------------------------------+
| Option | Description |
+===========+==============================================================================+
| ``--run`` | Run conversion. Without this option, just checks what format the corpora use |
+-----------+------------------------------------------------------------------------------+

1 change: 1 addition & 0 deletions docs/modules/pimlico.modules.corpora.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Core modules for generic manipulation of mainly iterable corpora.
pimlico.modules.corpora.shuffle
pimlico.modules.corpora.split
pimlico.modules.corpora.store
pimlico.modules.corpora.subsample
pimlico.modules.corpora.subset
pimlico.modules.corpora.vocab_builder
pimlico.modules.corpora.vocab_counter
Expand Down
75 changes: 75 additions & 0 deletions docs/modules/pimlico.modules.corpora.subsample.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Random subsample
~~~~~~~~~~~~~~~~

.. py:module:: pimlico.modules.corpora.subsample
+------------+-----------------------------------+
| Path | pimlico.modules.corpora.subsample |
+------------+-----------------------------------+
| Executable | yes |
+------------+-----------------------------------+

Randomly subsample documents of a corpus at a given rate to create a smaller corpus.


Inputs
======

+--------+---------------------------------------------------------------------------+
| Name | Type(s) |
+========+===========================================================================+
| corpus | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` |
+--------+---------------------------------------------------------------------------+

Outputs
=======

+--------+--------------------------------------------------------------------------------------------------------+
| Name | Type(s) |
+========+========================================================================================================+
| corpus | :class:`corpus with data-point from input <pimlico.datatypes.corpora.grouped.CorpusWithTypeFromInput>` |
+--------+--------------------------------------------------------------------------------------------------------+


Options
=======

+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| Name | Description | Type |
+==============+==============================================================================================================================================================+=======+
| p | (required) Probability of including any given document. The resulting corpus will be roughly this proportion of the size of the input | float |
+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| seed | Random seed. We always set a random seed before starting to ensure some level of reproducability | int |
+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| skip_invalid | Skip over any invalid documents so that the output subset contains just valid document and no invalid ones. By default, invalid documents are passed through | bool |
+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

Example config
==============

This is an example of how this module can be used in a pipeline config file.

.. code-block:: ini
[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1
This example usage includes more options.

.. code-block:: ini
[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1
seed=1234
skip_invalid=T
Test pipelines
==============

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-subsample.conf`
2 changes: 1 addition & 1 deletion docs/modules/pimlico.modules.corpora.vocab_builder.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ This example usage includes more options.
input_text=module_a.some_output
include=word1,word2,...
limit=10k
oov=value
oov=text
prune_at=2000000
threshold=100
Expand Down
1 change: 1 addition & 0 deletions docs/modules/pimlico.modules.corpora.vocab_mapper.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,5 @@ Test pipelines

This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage.

* :ref:`test-config-vocab_mapper_longer.conf`
* :ref:`test-config-vocab_mapper.conf`
8 changes: 4 additions & 4 deletions docs/modules/pimlico.modules.input.xml.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Raw text files
~~~~~~~~~~~~~~
XML files
~~~~~~~~~

.. py:module:: pimlico.modules.input.xml
Expand Down Expand Up @@ -66,15 +66,15 @@ This is an example of how this module can be used in a pipeline config file.

.. code-block:: ini
[my_raw_text_files_reader_module]
[my_xml_files_reader_module]
type=pimlico.modules.input.xml
files=path1,path2,...
This example usage includes more options.

.. code-block:: ini
[my_raw_text_files_reader_module]
[my_xml_files_reader_module]
type=pimlico.modules.input.xml
archive_basename=archive
archive_size=1000
Expand Down
43 changes: 43 additions & 0 deletions docs/test_config/corpora.subsample.conf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _test-config-subsample.conf:

subsample
~~~~~~~~~



This is one of the test pipelines included in Pimlico's repository.
See :ref:`test-pipelines` for more details.

Config file
===========

The complete config file for this test pipeline:


.. code-block:: ini
[pipeline]
name=subsample
release=latest
# Take input from a prepared Pimlico dataset
[europarl]
type=pimlico.datatypes.corpora.GroupedCorpus
data_point_type=RawTextDocumentType
dir=%(test_data_dir)s/datasets/text_corpora/europarl
[subsample]
type=pimlico.modules.corpora.subsample
p=0.8
seed=1
Modules
=======


The following Pimlico module types are used in this pipeline:

* :mod:`~pimlico.modules.corpora.subsample`


50 changes: 50 additions & 0 deletions docs/test_config/corpora.vocab_mapper_longer.conf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
.. _test-config-vocab_mapper_longer.conf:

vocab\_mapper
~~~~~~~~~~~~~



This is one of the test pipelines included in Pimlico's repository.
See :ref:`test-pipelines` for more details.

Config file
===========

The complete config file for this test pipeline:


.. code-block:: ini
[pipeline]
name=vocab_mapper
release=latest
# Take input from a prepared Pimlico dataset
[europarl]
type=pimlico.datatypes.corpora.GroupedCorpus
data_point_type=TokenizedDocumentType
dir=%(test_data_dir)s/datasets/corpora/tokenized_longer
# Load the prepared vocabulary
# (created by the vocab_builder test pipeline)
[vocab]
type=pimlico.datatypes.dictionary.Dictionary
dir=%(test_data_dir)s/datasets/vocab
# Perform the mapping from words to IDs
[ids]
type=pimlico.modules.corpora.vocab_mapper
input_vocab=vocab
input_text=europarl
Modules
=======


The following Pimlico module types are used in this pipeline:

* :mod:`~pimlico.modules.corpora.vocab_mapper`


0 comments on commit 000bd9d

Please sign in to comment.