Skip to content

Commit

Permalink
Make explicit Py2 support for modules & datatypes
Browse files Browse the repository at this point in the history
Previously, everything in the core modules was assumed to be Py 2+3
compatible, using future. However, Gensim no longer supports Py2, so the
modules that depend on Gensim cannot be Py2 compatible. The same will
surely happen for other dependencies.

Now all modules and datatypes declare whether they support Python 2.
Most of the core modules do. The documentation for modules states
explicitly where a module does not.

When running test pipelines under Python 2, tests are skipped for any
modules that don't support Py 2.

Now, all test pipelines are succeeding in both Py 2 (exlucding the ones
that are skipped) and Py 3.
  • Loading branch information
markgw committed Aug 6, 2020
1 parent 95eb720 commit 4b808f4
Show file tree
Hide file tree
Showing 105 changed files with 313 additions and 22 deletions.
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.concat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ They must have the same data point type, or one must be a subtype of the other.

This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.format.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ formatting operations are designed for display, this is generally only useful to
consumption.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.group.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ and the grouping will be preserved as the corpus passes through the pipeline.

This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.interleave.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ not currently implemented and may not be worth the trouble. Perhaps we will add

This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.list_filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ Similar to :mod:`~pimlico.modules.corpora.split`, but instead of taking a random
according to a given list of documents, putting those in the list in one set and the rest in another.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
5 changes: 5 additions & 0 deletions docs/modules/pimlico.modules.corpora.shuffle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ comes from a filter module, its documents cannot be randomly accessed.
Since this requires some tricky changes to the datatype system, I'm not
implementing it now, but it should be done in future.

It will be implemented as part of the replacement of ``GroupedCorpus``
by ``StoredIterableCorpus``: `https://github.com/markgw/pimlico/issues/24`_


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======
Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.shuffle_linear.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ are small.
module type.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.split.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ e.g. in a training-test split, store only the test document list, as the trainin
a case, just put the smaller set first and don't request the optional output `doc_list2`.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.store.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ produced corpus for further use, rather than always running the filters/readers
each time the corpus' documents are needed.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.subsample.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Random subsample
Randomly subsample documents of a corpus at a given rate to create a smaller corpus.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.corpora.subset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ over the data to count them up.

This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
4 changes: 4 additions & 0 deletions docs/modules/pimlico.modules.embeddings.word2vec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ Find out more about `word2vec <https://code.google.com/archive/p/word2vec/>`_.
This module is simply a wrapper to call `Gensim Python (+C) <https://radimrehurek.com/gensim/models/word2vec.html>`_'s
implementation of word2vec on a Pimlico corpus.

Does not support Python 2 since Gensim has dropped Python 2 support.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======
Expand Down
4 changes: 4 additions & 0 deletions docs/modules/pimlico.modules.gensim.lda.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,15 @@ LDA trainer
Trains LDA using Gensim's `basic LDA implementation <https://radimrehurek.com/gensim/models/ldamodel.html>`_,
or `the multicore version <https://radimrehurek.com/gensim/models/ldamulticore.html>`_.

Does not support Python 2, since Gensim has dropped Python 2 support.

.. todo::

Add test pipeline and test


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
4 changes: 4 additions & 0 deletions docs/modules/pimlico.modules.gensim.lda_doc_topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,15 @@ in each sentence of each document. It is assumed that the corpus uses the same v
to map to integer IDs as the LDA model's training corpus, so no further mapping needs to
be done.

Does not support Python 2 since Gensim has dropped Python 2 support.

.. todo::

Add test pipeline and test


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ offered by Facebook AI.

Reads only the binary format (``.bin``), not the text format (``.vec``).

Does not support Python 2, since Gensim has dropped Python 2 support.

.. seealso::

:mod:`pimlico.modules.input.embeddings.fasttext`:
Expand All @@ -30,6 +32,8 @@ Reads only the binary format (``.bin``), not the text format (``.vec``).
file, which is harder to produce, since you can't easily just truncate a big file.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.input.embeddings.glove.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ data structure. This is not enforced by the dependency check, since we're not ab
to require a specific version yet.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
2 changes: 2 additions & 0 deletions docs/modules/pimlico.modules.input.embeddings.word2vec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Can be used, for example, to read the pre-trained embeddings
`offered by Google <https://code.google.com/archive/p/word2vec/>`_.


*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*

Inputs
======

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ class ModuleInfo(DocumentMapModuleInfo):
module_inputs = [("corpus", GroupedCorpus(TokenizedDocumentType()))]
module_outputs = [("corpus", GroupedCorpus(TokenizedDocumentType()))]
module_options = {}
module_supports_python2 = True
50 changes: 40 additions & 10 deletions src/python/pimlico/core/modules/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,38 +53,50 @@ class BaseModuleInfo(object):
module_type_name = None
module_readable_name = None
module_options = {}
""" Specifies a list of (name, datatype class) pairs for inputs that are always required """
module_inputs = []
""" Specifies a list of (name, datatype instance) pairs for inputs that are always required """
module_optional_inputs = []
"""
Specifies a list of (name, datatype class) pairs for optional inputs. The module's execution may
Specifies a list of (name, datatype instance) pairs for optional inputs. The module's execution may
vary depending on what is provided. If these are not given, None is returned from get_input()
"""
module_optional_inputs = []
""" Specifies a list of (name, datatype class) pairs for outputs that are always written """
module_outputs = []
""" Specifies a list of (name, datatype instance) pairs for outputs that are always written """
module_optional_outputs = []
"""
Specifies a list of (name, datatype class) pairs for outputs that are written only if they're specified
Specifies a list of (name, datatype instance) pairs for outputs that are written only if they're specified
in the "output" option or used by another module
"""
module_optional_outputs = []
module_output_groups = []
"""
List of output groups: (group_name, [output_name1, ...]).
Further groups may be added by build_output_groups().
"""
module_output_groups = []
module_executable = True
"""
Whether the module should be executed
Typically True for almost all modules, except input modules (though some of them may also require execution) and
filters
"""
module_executable = True
""" If specified, this ModuleExecutor class will be used instead of looking one up in the exec Python module """
module_executor_override = None
""" If specified, this ModuleExecutor class will be used instead of looking one up in the exec Python module """
main_module = None
"""
Usually None. In the case of stages of a multi-stage module, stores a pointer to the main module.
"""
main_module = None
module_supports_python2 = False
"""
Most core Pimlico modules support use in Python 2 and 3. Modules that do should set
this to True. If it is False, the module is assumed to work only in Python 3.
Since Python 2 compatibility requires extra work from the programmer, this is
False by default.
To check whether a module can be used in Python 2, call ``supports_python2()``,
which will check this and also input and output datatypes.
"""

def __init__(self, module_name, pipeline, inputs={}, options={}, optional_outputs=[],
docstring="", include_outputs=[], alt_expanded_from=None, alt_param_settings=[], module_variables={}):
Expand Down Expand Up @@ -131,6 +143,24 @@ def __init__(self, module_name, pipeline, inputs={}, options={}, optional_output
def __repr__(self):
return "%s(%s)" % (self.module_type_name, self.module_name)

@classmethod
def supports_python2(cls):
"""
:return: True if the module can be run in Python 2 and 3, False if it
only supports Python 3.
"""
if not cls.module_supports_python2:
# The module itself does not support Python 2
return False
# Also check all the input and output datatypes
for inout_list in [cls.module_inputs, cls.module_optional_inputs, cls.module_outputs, cls.module_optional_outputs]:
for inout_name, datatype in inout_list:
if not datatype.supports_python2():
return False
# Everything supports Python 2 and 3
return True

def load_executor(self):
"""
Loads a ModuleExecutor for this Pimlico module. Usually, this just involves calling
Expand Down
6 changes: 5 additions & 1 deletion src/python/pimlico/core/modules/inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,9 @@ class DatatypeInputModuleInfo(InputModuleInfo):
"required": True,
},
}
# Set module to support Python 2, since it doesn't do anything
# If the datatype doesn't support Python 2, this will get checked anyway
module_supports_python2 = True

def instantiate_output_reader_setup(self, output_name, datatype):
# Create a reader setup that just has the given directory as a possible location for the data
Expand All @@ -280,7 +283,7 @@ def instantiate_output_reader_setup(self, output_name, datatype):
def iterable_input_reader(input_module_options, data_point_type,
data_ready_fn, len_fn=None, iter_fn=None,
module_type_name=None, module_readable_name=None,
software_dependencies=None, execute_count=False, no_group=False):
software_dependencies=None, execute_count=False, no_group=False, python2=False):
"""
Factory for creating an input reader module info.
This is a (typically) non-executable module that has no
Expand Down Expand Up @@ -402,6 +405,7 @@ class IterableInputReaderModuleInfo(InputModuleInfo):
module_readable_name = mr_name
module_outputs = [("corpus", output_datatype)]
module_options = input_module_options
module_supports_python2 = python2

# Special behaviour if we're making this an executable module in order to count the data
module_executable = execute_count
Expand Down
1 change: 1 addition & 0 deletions src/python/pimlico/core/modules/map/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ class ModuleInfo(BaseModuleInfo):
module_outputs = module_info_instance.module_outputs
module_optional_outputs = []
module_executable = False
module_supports_python2 = module_info_instance.module_supports_python2

def instantiate_output_reader_setup(self, output_name, datatype):
return FilterModuleOutputReader.Setup(datatype, module_info_instance, output_name)
Expand Down
2 changes: 2 additions & 0 deletions src/python/pimlico/datatypes/arrays.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ class NumpyArray(NamedFileCollection):
"""
datatype_name = "numpy_array"
datatype_supports_python2 = True

def __init__(self, *args, **kwargs):
super(NumpyArray, self).__init__(["array.npy"], *args, **kwargs)
Expand Down Expand Up @@ -57,6 +58,7 @@ class ScipySparseMatrix(NamedFileCollection):
"""
datatype_name = "scipy_sparse_array"
datatype_supports_python2 = True

def __init__(self, *args, **kwargs):
super(ScipySparseMatrix, self).__init__(["array.mtx"], *args, **kwargs)
Expand Down
39 changes: 39 additions & 0 deletions src/python/pimlico/datatypes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,19 @@ class PimlicoDatatype(with_metaclass(PimlicoDatatypeMeta, object)):
shell_commands = []
"""
Override to provide shell commands specific to this datatype. Should include the superclass' list.
"""
datatype_supports_python2 = True
"""
Most core Pimlico datatypes support use in Python 2 and 3. Datatypes that do should set
this to True. If it is False, the datatype is assumed to work only in Python 3.
Python 2 compatibility requires extra work from the programmer. Datatypes should
generally declare whether or not they provide this support by overriding this
explicitly.
Use ``supports_python2()`` to check whether a datatype instance supports Python 2.
(There may be reasons for a datatype's instance to override this class-level setting.)
"""

def __init__(self, *args, **kwargs):
Expand Down Expand Up @@ -287,6 +300,13 @@ def __init__(self, *args, **kwargs):
# Build a better name out of the class name
self.datatype_name = _class_name_word_boundary.sub(r"\1_\2", type(self).__name__).lower()

def supports_python2(self):
"""
By default, just returns cls.datatype_supports_python2. Subclasses might override this.
"""
return self.datatype_supports_python2

def get_software_dependencies(self):
"""
Get a list of all software required to **read** this datatype. This is
Expand Down Expand Up @@ -883,6 +903,14 @@ class DynamicOutputDatatype(object):
The dynamic type must provide certain pieces of information needed for typechecking.
If a base datatype is available (i.e. indication of the datatype before the module is
instantiated), we take the information regarding whether the datatype supports
Python 2 from there. If not, we assume it does. This may seems the opposite to other
places: for example, the base datatype says it does **not** support Python 2 and subclasses
must declare if they do. However, dynamic output datatypes are often used with modules
that work with a broad range of input datatypes. It is therefore wrong to say that they
do not support Python 2, since they will provided the input module does.
"""
"""
Must be provided by subclasses: can be a noncommittal string giving some idea of what types may be provided.
Expand All @@ -904,6 +932,14 @@ def get_base_datatype(self):
"""
return None

def supports_python2(self):
base_dt = self.get_base_datatype()
if base_dt is None:
# Can't say whether this supports Py2 or not, so we say it does
return True
else:
return base_dt.supports_python2()


class DynamicInputDatatypeRequirement(object):
"""
Expand Down Expand Up @@ -1029,6 +1065,9 @@ class MultipleInputs(object):
def __init__(self, datatype_requirements):
self.datatype_requirements = datatype_requirements

def supports_python2(self):
return self.datatype_requirements.supports_python2()


class TypeFromInput(DynamicOutputDatatype):
"""
Expand Down

0 comments on commit 4b808f4

Please sign in to comment.