-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added test pipeline. Updated list of test pipelines in "all" test suite.
- Loading branch information
Showing
9 changed files
with
144 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,77 @@ | ||
\!\! text\_normalize | ||
~~~~~~~~~~~~~~~~~~~~ | ||
Normalize raw text | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
.. py:module:: pimlico.modules.text.text_normalize | ||
.. note:: | ||
+------------+-------------------------------------+ | ||
| Path | pimlico.modules.text.text_normalize | | ||
+------------+-------------------------------------+ | ||
| Executable | yes | | ||
+------------+-------------------------------------+ | ||
|
||
This module has not yet been updated to the new datatype system, so cannot be used yet. Soon it will be updated. | ||
Text normalization for raw text documents. | ||
|
||
Similar to :mod:`~pimlico.modules.text.normalize` module, but operates on raw text, | ||
not pre-tokenized text, so provides a slightly different set of tools. | ||
|
||
|
||
Inputs | ||
====== | ||
|
||
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+========+================================================================================================================================================================+ | ||
| corpus | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`TextDocumentType <pimlico.datatypes.corpora.data_points.TextDocumentType>`> | | ||
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
Outputs | ||
======= | ||
|
||
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+========+======================================================================================================================================================================+ | ||
| corpus | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`RawTextDocumentType <pimlico.datatypes.corpora.data_points.RawTextDocumentType>`> | | ||
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
Options | ||
======= | ||
|
||
+-------------+-------------------------------------------------------------------------------------------------------------------------+------------------------+ | ||
| Name | Description | Type | | ||
+=============+=========================================================================================================================+========================+ | ||
| blank_lines | Remove all blank lines (after whitespace stripping, if requested) | bool | | ||
+-------------+-------------------------------------------------------------------------------------------------------------------------+------------------------+ | ||
| case | Transform all text to upper or lower case. Choose from 'upper' or 'lower', or leave blank to not perform transformation | 'upper', 'lower' or '' | | ||
+-------------+-------------------------------------------------------------------------------------------------------------------------+------------------------+ | ||
| strip | Strip whitespace from the start and end of lines | bool | | ||
+-------------+-------------------------------------------------------------------------------------------------------------------------+------------------------+ | ||
|
||
Example config | ||
============== | ||
|
||
This is an example of how this module can be used in a pipeline config file. | ||
|
||
.. code-block:: ini | ||
[my_text_normalize_module] | ||
type=pimlico.modules.text.text_normalize | ||
input_corpus=module_a.some_output | ||
This example usage includes more options. | ||
|
||
.. code-block:: ini | ||
[my_text_normalize_module] | ||
type=pimlico.modules.text.text_normalize | ||
input_corpus=module_a.some_output | ||
blank_lines=T | ||
case= | ||
strip=T | ||
Test pipelines | ||
============== | ||
|
||
This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage. | ||
|
||
* :ref:`test-config-text_normalize.conf` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
.. _test-config-text_normalize.conf: | ||
|
||
normalize | ||
~~~~~~~~~ | ||
|
||
|
||
|
||
This is one of the test pipelines included in Pimlico's repository. | ||
See :ref:`test-pipelines` for more details. | ||
|
||
Config file | ||
=========== | ||
|
||
The complete config file for this test pipeline: | ||
|
||
|
||
.. code-block:: ini | ||
[pipeline] | ||
name=normalize | ||
release=latest | ||
# Take input from a prepared Pimlico dataset | ||
[europarl] | ||
type=pimlico.datatypes.corpora.GroupedCorpus | ||
data_point_type=RawTextDocumentType | ||
dir=%(test_data_dir)s/datasets/text_corpora/europarl | ||
[norm] | ||
type=pimlico.modules.text.text_normalize | ||
case=lower | ||
strip=T | ||
blank_lines=T | ||
Modules | ||
======= | ||
|
||
|
||
The following Pimlico module types are used in this pipeline: | ||
|
||
* :mod:`~pimlico.modules.text.text_normalize` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +0,0 @@ | ||
AWAITING_UPDATE = True | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
[pipeline] | ||
name=normalize | ||
release=latest | ||
|
||
# Take input from a prepared Pimlico dataset | ||
[europarl] | ||
type=pimlico.datatypes.corpora.GroupedCorpus | ||
data_point_type=RawTextDocumentType | ||
dir=%(test_data_dir)s/datasets/text_corpora/europarl | ||
|
||
[norm] | ||
type=pimlico.modules.text.text_normalize | ||
case=lower | ||
strip=T | ||
blank_lines=T |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters