-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New datatypes and modules to produce them. These provide functions to map words to embeddings, allowing various different types of embeddings that aren't restricted to a fixed vocabulary to be used.
- Loading branch information
Showing
13 changed files
with
400 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
50 changes: 50 additions & 0 deletions
50
docs/modules/pimlico.modules.embeddings.mappers.fasttext.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
fastText to doc\-embedding mapper | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. py:module:: pimlico.modules.embeddings.mappers.fasttext | ||
+------------+---------------------------------------------+ | ||
| Path | pimlico.modules.embeddings.mappers.fasttext | | ||
+------------+---------------------------------------------+ | ||
| Executable | yes | | ||
+------------+---------------------------------------------+ | ||
|
||
Use trained fastText embeddings to map words to their embeddings, | ||
including OOVs, using sub-word information. | ||
|
||
First train a fastText model using the fastText training module. Then | ||
use this module to produce a doc-embeddings mapper. | ||
|
||
|
||
*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3* | ||
|
||
Inputs | ||
====== | ||
|
||
+------------+--------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+============+================================================================================+ | ||
| embeddings | :class:`fasttext_embeddings <pimlico.datatypes.embeddings.FastTextEmbeddings>` | | ||
+------------+--------------------------------------------------------------------------------+ | ||
|
||
Outputs | ||
======= | ||
|
||
+--------+------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+========+==========================================================================================+ | ||
| mapper | :class:`fasttext_doc_embeddings_mapper <pimlico.datatypes.embeddings.FastTextDocMapper>` | | ||
+--------+------------------------------------------------------------------------------------------+ | ||
|
||
Example config | ||
============== | ||
|
||
This is an example of how this module can be used in a pipeline config file. | ||
|
||
.. code-block:: ini | ||
[my_fasttext_doc_mapper_module] | ||
type=pimlico.modules.embeddings.mappers.fasttext | ||
input_embeddings=module_a.some_output | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
Fixed embeddings to doc\-embedding mapper | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. py:module:: pimlico.modules.embeddings.mappers.fixed | ||
+------------+------------------------------------------+ | ||
| Path | pimlico.modules.embeddings.mappers.fixed | | ||
+------------+------------------------------------------+ | ||
| Executable | yes | | ||
+------------+------------------------------------------+ | ||
|
||
Use trained fixed word embeddings to map words to their embeddings. | ||
Does nothing with OOVs, which we don't have any way to map. | ||
|
||
First train or load embeddings using another module. | ||
Then use this module to produce a doc-embeddings mapper. | ||
|
||
|
||
*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3* | ||
|
||
Inputs | ||
====== | ||
|
||
+------------+---------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+============+===============================================================+ | ||
| embeddings | :class:`embeddings <pimlico.datatypes.embeddings.Embeddings>` | | ||
+------------+---------------------------------------------------------------+ | ||
|
||
Outputs | ||
======= | ||
|
||
+--------+---------------------------------------------------------------------------------------------------------+ | ||
| Name | Type(s) | | ||
+========+=========================================================================================================+ | ||
| mapper | :class:`fixed_embeddings_doc_embeddings_mapper <pimlico.datatypes.embeddings.FixedEmbeddingsDocMapper>` | | ||
+--------+---------------------------------------------------------------------------------------------------------+ | ||
|
||
Example config | ||
============== | ||
|
||
This is an example of how this module can be used in a pipeline config file. | ||
|
||
.. code-block:: ini | ||
[my_fixed_embeddings_doc_mapper_module] | ||
type=pimlico.modules.embeddings.mappers.fixed | ||
input_embeddings=module_a.some_output | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Doc embedding mappers | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
||
.. py:module:: pimlico.modules.embeddings.mappers | ||
Produce datatypes that can map tokens in documents to their embeddings. | ||
|
||
|
||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:titlesonly: | ||
|
||
pimlico.modules.embeddings.mappers.fasttext | ||
pimlico.modules.embeddings.mappers.fixed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
"""Doc embedding mappers | ||
Produce datatypes that can map tokens in documents to their embeddings. | ||
""" |
Empty file.
13 changes: 13 additions & 0 deletions
13
src/python/pimlico/modules/embeddings/mappers/fasttext/execute.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# This file is part of Pimlico | ||
# Copyright (C) 2016 Mark Granroth-Wilding | ||
# Licensed under the GNU GPL v3.0 - http://www.gnu.org/licenses/gpl-3.0.en.html | ||
|
||
from pimlico.core.modules.base import BaseModuleExecutor | ||
|
||
|
||
class ModuleExecutor(BaseModuleExecutor): | ||
def execute(self): | ||
input_embeddings = self.info.get_input("embeddings") | ||
|
||
with self.info.get_output_writer("mapper") as writer: | ||
writer.save_model(input_embeddings.load_model()) |
Oops, something went wrong.