-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Executes full spacy pipeline up to parsing, so includes tokenization, sentence splitting and POS tagging
- Loading branch information
Showing
16 changed files
with
335 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
licenses | ||
======== | ||
|
||
.. automodule:: pimlico.core.dependencies.licenses | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
.. _command_licenses: | ||
|
||
licenses | ||
~~~~~~~~ | ||
|
||
|
||
*Command-line tool subcommand* | ||
|
||
|
||
Output a list of the licenses for all software depended on. | ||
|
||
|
||
Usage: | ||
|
||
:: | ||
|
||
pimlico.sh [...] licenses [modules [modules ...]] [-h] | ||
|
||
|
||
Positional arguments | ||
==================== | ||
|
||
+-----------------------------+----------------------------------------------------------------------------------------------------------------+ | ||
| Arg | Description | | ||
+=============================+================================================================================================================+ | ||
| ``[modules [modules ...]]`` | Check dependencies of modules and their datatypes. Use 'all' to list licenses for dependencies for all modules | | ||
+-----------------------------+----------------------------------------------------------------------------------------------------------------+ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
Text parser | ||
~~~~~~~~~~~ | ||
|
||
.. py:module:: pimlico.modules.spacy.parse_text | ||
+------------+----------------------------------+ | ||
| Path | pimlico.modules.spacy.parse_text | | ||
+------------+----------------------------------+ | ||
| Executable | yes | | ||
+------------+----------------------------------+ | ||
|
||
Parsing using spaCy | ||
|
||
Entire parsing pipeline from raw text using the same spaCy model. | ||
|
||
The word annotations in the output contain the information from the spaCy parser | ||
and the documents are split into sentences following the spaCy's sentence segmentation. | ||
|
||
The annotation fields follow those produced by the Malt parser: pos, head and deprel. | ||
|
||
|
||
Inputs | ||
====== | ||
|
||
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+======+======================================================================================================================================================================+ | ||
| text | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`RawTextDocumentType <pimlico.datatypes.corpora.data_points.RawTextDocumentType>`> | | ||
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
Outputs | ||
======= | ||
|
||
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+========+===========================================================================================================================================================================================+ | ||
| parsed | :class:`grouped_corpus <pimlico.datatypes.corpora.grouped.GroupedCorpus>` <:class:`WordAnnotationsDocumentType <pimlico.datatypes.corpora.word_annotations.WordAnnotationsDocumentType>`> | | ||
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
|
||
Options | ||
======= | ||
|
||
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | ||
| Name | Description | Type | | ||
+=========+==================================================================================================================================================================================================================+========+ | ||
| model | spaCy model to use. This may be a name of a standard spaCy model or a path to the location of a trained model on disk, if on_disk=T. If it's not a path, the spaCy download command will be run before execution | string | | ||
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | ||
| on_disk | Load the specified model from a location on disk (the model parameter gives the path) | bool | | ||
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | ||
|
||
Example config | ||
============== | ||
|
||
This is an example of how this module can be used in a pipeline config file. | ||
|
||
.. code-block:: ini | ||
[my_spacy_text_parser_module] | ||
type=pimlico.modules.spacy.parse_text | ||
input_text=module_a.some_output | ||
This example usage includes more options. | ||
|
||
.. code-block:: ini | ||
[my_spacy_text_parser_module] | ||
type=pimlico.modules.spacy.parse_text | ||
input_text=module_a.some_output | ||
model=en_core_web_sm | ||
on_disk=T | ||
Test pipelines | ||
============== | ||
|
||
This module is used by the following :ref:`test pipelines <test-pipelines>`. They are a further source of examples of the module's usage. | ||
|
||
* :ref:`test-config-spacy-parse_text.conf` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
.. _test-config-spacy-parse_text.conf: | ||
|
||
spacy\_parse\_text | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
|
||
|
||
This is one of the test pipelines included in Pimlico's repository. | ||
See :ref:`test-pipelines` for more details. | ||
|
||
Config file | ||
=========== | ||
|
||
The complete config file for this test pipeline: | ||
|
||
|
||
.. code-block:: ini | ||
[pipeline] | ||
name=spacy_parse_text | ||
release=latest | ||
# Prepared tarred corpus | ||
[europarl] | ||
type=pimlico.datatypes.corpora.GroupedCorpus | ||
data_point_type=RawTextDocumentType | ||
dir=%(test_data_dir)s/datasets/text_corpora/europarl | ||
[tokenize] | ||
type=pimlico.modules.spacy.parse_text | ||
model=en_core_web_sm | ||
Modules | ||
======= | ||
|
||
|
||
The following Pimlico module types are used in this pipeline: | ||
|
||
* :mod:`pimlico.modules.spacy.parse_text` | ||
|
||
|
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# This file is part of Pimlico | ||
# Copyright (C) 2020 Mark Granroth-Wilding | ||
# Licensed under the GNU LGPL v3.0 - https://www.gnu.org/licenses/lgpl-3.0.en.html | ||
from pimlico.core.modules.map import skip_invalid | ||
from pimlico.core.modules.map.singleproc import single_process_executor_factory | ||
from ..utils import load_spacy_model | ||
|
||
|
||
def preprocess(worker): | ||
model = worker.info.options["model"] | ||
nlp = load_spacy_model(model, worker.executor.log, local=worker.info.options["on_disk"]) | ||
|
||
pipeline = ["tagger", "parser"] | ||
for pipe_name in nlp.pipe_names: | ||
if pipe_name not in pipeline: | ||
# Remove any components other than the tagger and parser that might be in the model | ||
nlp.remove_pipe(pipe_name) | ||
worker.nlp = nlp | ||
|
||
# Check the order of the fields in the output | ||
output_dt = worker.info.get_output_datatype("parsed")[1] | ||
fields_list = output_dt.data_point_type.fields | ||
# This little function will put the annotations in the right order | ||
def output(token, pos, head, deprel): | ||
fields = {"word": token, "pos": pos, "head": head, "deprel": deprel} | ||
return [fields[field] for field in fields_list] | ||
worker.output_fields = output | ||
|
||
|
||
@skip_invalid | ||
def process_document(worker, archive, filename, doc): | ||
# Apply tagger and parser to the raw text | ||
doc = worker.nlp(doc.text) | ||
# Now doc.sents contains the separated sentences | ||
# and each word should have a POS tag and head+dep type | ||
return { | ||
"word_annotations": [ | ||
[ | ||
worker.output_fields(token.text, token.pos_, str(token.head.i - sentence.start), token.dep_) | ||
for token in sentence | ||
] for sentence in doc.sents | ||
] | ||
} | ||
|
||
|
||
ModuleExecutor = single_process_executor_factory(process_document, worker_set_up_fn=preprocess) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# This file is part of Pimlico | ||
# Copyright (C) 2020 Mark Granroth-Wilding | ||
# Licensed under the GNU LGPL v3.0 - https://www.gnu.org/licenses/lgpl-3.0.en.html | ||
|
||
"""Parsing using spaCy | ||
Entire parsing pipeline from raw text using the same spaCy model. | ||
The word annotations in the output contain the information from the spaCy parser | ||
and the documents are split into sentences following the spaCy's sentence segmentation. | ||
The annotation fields follow those produced by the Malt parser: pos, head and deprel. | ||
""" | ||
from pimlico.core.dependencies.python import spacy_dependency | ||
from pimlico.core.modules.map import DocumentMapModuleInfo | ||
from pimlico.core.modules.options import str_to_bool | ||
from pimlico.datatypes import GroupedCorpus | ||
from pimlico.datatypes.corpora.data_points import RawTextDocumentType | ||
from pimlico.datatypes.corpora.word_annotations import WordAnnotationsDocumentType | ||
|
||
|
||
class ModuleInfo(DocumentMapModuleInfo): | ||
module_type_name = "spacy_text_parser" | ||
module_readable_name = "Text parser" | ||
module_inputs = [("text", GroupedCorpus(RawTextDocumentType()))] | ||
module_outputs = [("parsed", GroupedCorpus(WordAnnotationsDocumentType(["word", "pos", "head", "deprel"])))] | ||
module_options = { | ||
"model": { | ||
"help": "spaCy model to use. This may be a name of a standard spaCy model or a path to the " | ||
"location of a trained model on disk, if on_disk=T. " | ||
"If it's not a path, the spaCy download command will be run before execution", | ||
"default": "en_core_web_sm", | ||
}, | ||
"on_disk": { | ||
"help": "Load the specified model from a location on disk (the model parameter gives the path)", | ||
"type": str_to_bool, | ||
} | ||
} | ||
module_supports_python2 = True | ||
|
||
def get_software_dependencies(self): | ||
return super(ModuleInfo, self).get_software_dependencies() + [spacy_dependency] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.