Make explicit Py2 support for modules & datatypes

Previously, everything in the core modules was assumed to be Py 2+3 compatible, using future. However, Gensim no longer supports Py2, so the modules that depend on Gensim cannot be Py2 compatible. The same will surely happen for other dependencies. Now all modules and datatypes declare whether they support Python 2. Most of the core modules do. The documentation for modules states explicitly where a module does not. When running test pipelines under Python 2, tests are skipped for any modules that don't support Py 2. Now, all test pipelines are succeeding in both Py 2 (exlucding the ones that are skipped) and Py 3.
markgw · Aug 6, 2020 · 4b808f4 · 4b808f4
1 parent 95eb720
commit 4b808f4
Show file tree

Hide file tree

Showing 105 changed files with 313 additions and 22 deletions.
diff --git a/docs/modules/pimlico.modules.corpora.concat.rst b/docs/modules/pimlico.modules.corpora.concat.rst
@@ -16,6 +16,8 @@ They must have the same data point type, or one must be a subtype of the other.
 
 This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.format.rst b/docs/modules/pimlico.modules.corpora.format.rst
@@ -20,6 +20,8 @@ formatting operations are designed for display, this is generally only useful to
 consumption.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.group.rst b/docs/modules/pimlico.modules.corpora.group.rst
@@ -32,6 +32,8 @@ and the grouping will be preserved as the corpus passes through the pipeline.
 
 This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.interleave.rst b/docs/modules/pimlico.modules.corpora.interleave.rst
@@ -25,6 +25,8 @@ not currently implemented and may not be worth the trouble. Perhaps we will add
 
 This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.list_filter.rst b/docs/modules/pimlico.modules.corpora.list_filter.rst
@@ -13,6 +13,8 @@ Similar to :mod:`~pimlico.modules.corpora.split`, but instead of taking a random
 according to a given list of documents, putting those in the list in one set and the rest in another.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.shuffle.rst b/docs/modules/pimlico.modules.corpora.shuffle.rst
@@ -35,6 +35,11 @@ comes from a filter module, its documents cannot be randomly accessed.
    Since this requires some tricky changes to the datatype system, I'm not
    implementing it now, but it should be done in future.
 
+   It will be implemented as part of the replacement of ``GroupedCorpus``
+   by ``StoredIterableCorpus``: `https://github.com/markgw/pimlico/issues/24`_
+
+
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
 
 Inputs
 ======

diff --git a/docs/modules/pimlico.modules.corpora.shuffle_linear.rst b/docs/modules/pimlico.modules.corpora.shuffle_linear.rst
@@ -40,6 +40,8 @@ are small.
       module type.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.split.rst b/docs/modules/pimlico.modules.corpora.split.rst
@@ -23,6 +23,8 @@ e.g. in a training-test split, store only the test document list, as the trainin
 a case, just put the smaller set first and don't request the optional output `doc_list2`.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.store.rst b/docs/modules/pimlico.modules.corpora.store.rst
@@ -19,6 +19,8 @@ produced corpus for further use, rather than always running the filters/readers
 each time the corpus' documents are needed.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.subsample.rst b/docs/modules/pimlico.modules.corpora.subsample.rst
@@ -12,6 +12,8 @@ Random subsample
 Randomly subsample documents of a corpus at a given rate to create a smaller corpus.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.corpora.subset.rst b/docs/modules/pimlico.modules.corpora.subset.rst
@@ -23,6 +23,8 @@ over the data to count them up.
 
 This is a filter module. It is not executable, so won't appear in a pipeline's list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.embeddings.word2vec.rst b/docs/modules/pimlico.modules.embeddings.word2vec.rst
@@ -16,6 +16,10 @@ Find out more about `word2vec <https://code.google.com/archive/p/word2vec/>`_.
 This module is simply a wrapper to call `Gensim Python (+C) <https://radimrehurek.com/gensim/models/word2vec.html>`_'s
 implementation of word2vec on a Pimlico corpus.
 
+Does not support Python 2 since Gensim has dropped Python 2 support.
+
+
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
 
 Inputs
 ======

diff --git a/docs/modules/pimlico.modules.gensim.lda.rst b/docs/modules/pimlico.modules.gensim.lda.rst
@@ -12,11 +12,15 @@ LDA trainer
 Trains LDA using Gensim's `basic LDA implementation <https://radimrehurek.com/gensim/models/ldamodel.html>`_,
 or `the multicore version <https://radimrehurek.com/gensim/models/ldamulticore.html>`_.
 
+Does not support Python 2, since Gensim has dropped Python 2 support.
+
 .. todo::
 
    Add test pipeline and test
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.gensim.lda_doc_topics.rst b/docs/modules/pimlico.modules.gensim.lda_doc_topics.rst
@@ -16,11 +16,15 @@ in each sentence of each document. It is assumed that the corpus uses the same v
 to map to integer IDs as the LDA model's training corpus, so no further mapping needs to
 be done.
 
+Does not support Python 2 since Gensim has dropped Python 2 support.
+
 .. todo::
 
    Add test pipeline and test
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.input.embeddings.fasttext_gensim.rst b/docs/modules/pimlico.modules.input.embeddings.fasttext_gensim.rst
@@ -19,6 +19,8 @@ offered by Facebook AI.
 
 Reads only the binary format (``.bin``), not the text format (``.vec``).
 
+Does not support Python 2, since Gensim has dropped Python 2 support.
+
 .. seealso::
 
    :mod:`pimlico.modules.input.embeddings.fasttext`:
@@ -30,6 +32,8 @@ Reads only the binary format (``.bin``), not the text format (``.vec``).
    file, which is harder to produce, since you can't easily just truncate a big file.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.input.embeddings.glove.rst b/docs/modules/pimlico.modules.input.embeddings.glove.rst
@@ -23,6 +23,8 @@ data structure. This is not enforced by the dependency check, since we're not ab
 to require a specific version yet.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/docs/modules/pimlico.modules.input.embeddings.word2vec.rst b/docs/modules/pimlico.modules.input.embeddings.word2vec.rst
@@ -17,6 +17,8 @@ Can be used, for example, to read the pre-trained embeddings
 `offered by Google <https://code.google.com/archive/p/word2vec/>`_.
 
 
+*This module does not support Python 2, so can only be used when Pimlico is being run under Python 3*
+
 Inputs
 ======
 

diff --git a/examples/simple/src/pim_example/modules/filter_prop_nns/info.py b/examples/simple/src/pim_example/modules/filter_prop_nns/info.py
@@ -26,3 +26,4 @@ class ModuleInfo(DocumentMapModuleInfo):
     module_inputs = [("corpus", GroupedCorpus(TokenizedDocumentType()))]
     module_outputs = [("corpus", GroupedCorpus(TokenizedDocumentType()))]
     module_options = {}
+    module_supports_python2 = True
diff --git a/src/python/pimlico/core/modules/base.py b/src/python/pimlico/core/modules/base.py
@@ -53,38 +53,50 @@ class BaseModuleInfo(object):
     module_type_name = None
     module_readable_name = None
     module_options = {}
-    """ Specifies a list of (name, datatype class) pairs for inputs that are always required """
     module_inputs = []
+    """ Specifies a list of (name, datatype instance) pairs for inputs that are always required """
+    module_optional_inputs = []
     """ 
-    Specifies a list of (name, datatype class) pairs for optional inputs. The module's execution may 
+    Specifies a list of (name, datatype instance) pairs for optional inputs. The module's execution may 
     vary depending on what is provided. If these are not given, None is returned from get_input() 
     """
-    module_optional_inputs = []
-    """ Specifies a list of (name, datatype class) pairs for outputs that are always written """
     module_outputs = []
+    """ Specifies a list of (name, datatype instance) pairs for outputs that are always written """
+    module_optional_outputs = []
     """
-    Specifies a list of (name, datatype class) pairs for outputs that are written only if they're specified
+    Specifies a list of (name, datatype instance) pairs for outputs that are written only if they're specified
     in the "output" option or used by another module
     """
-    module_optional_outputs = []
+    module_output_groups = []
     """
     List of output groups: (group_name, [output_name1, ...]).
     Further groups may be added by build_output_groups().
     """
-    module_output_groups = []
+    module_executable = True
     """
     Whether the module should be executed
     Typically True for almost all modules, except input modules (though some of them may also require execution) and
     filters
     """
-    module_executable = True
-    """ If specified, this ModuleExecutor class will be used instead of looking one up in the exec Python module """
     module_executor_override = None
+    """ If specified, this ModuleExecutor class will be used instead of looking one up in the exec Python module """
+    main_module = None
     """
     Usually None. In the case of stages of a multi-stage module, stores a pointer to the main module.
 
     """
-    main_module = None
+    module_supports_python2 = False
+    """
+    Most core Pimlico modules support use in Python 2 and 3. Modules that do should set 
+    this to True. If it is False, the module is assumed to work only in Python 3.
+    
+    Since Python 2 compatibility requires extra work from the programmer, this is 
+    False by default.
+    
+    To check whether a module can be used in Python 2, call ``supports_python2()``, 
+    which will check this and also input and output datatypes.
+    
+    """
 
     def __init__(self, module_name, pipeline, inputs={}, options={}, optional_outputs=[],
                  docstring="", include_outputs=[], alt_expanded_from=None, alt_param_settings=[], module_variables={}):
@@ -131,6 +143,24 @@ def __init__(self, module_name, pipeline, inputs={}, options={}, optional_output
     def __repr__(self):
         return "%s(%s)" % (self.module_type_name, self.module_name)
 
+    @classmethod
+    def supports_python2(cls):
+        """
+        :return: True if the module can be run in Python 2 and 3, False if it
+           only supports Python 3.
+
+        """
+        if not cls.module_supports_python2:
+            # The module itself does not support Python 2
+            return False
+        # Also check all the input and output datatypes
+        for inout_list in [cls.module_inputs, cls.module_optional_inputs, cls.module_outputs, cls.module_optional_outputs]:
+            for inout_name, datatype in inout_list:
+                if not datatype.supports_python2():
+                    return False
+        # Everything supports Python 2 and 3
+        return True
+
     def load_executor(self):
         """
         Loads a ModuleExecutor for this Pimlico module. Usually, this just involves calling

diff --git a/src/python/pimlico/core/modules/inputs.py b/src/python/pimlico/core/modules/inputs.py
@@ -269,6 +269,9 @@ class DatatypeInputModuleInfo(InputModuleInfo):
                 "required": True,
             },
         }
+        # Set module to support Python 2, since it doesn't do anything
+        # If the datatype doesn't support Python 2, this will get checked anyway
+        module_supports_python2 = True
 
         def instantiate_output_reader_setup(self, output_name, datatype):
             # Create a reader setup that just has the given directory as a possible location for the data
@@ -280,7 +283,7 @@ def instantiate_output_reader_setup(self, output_name, datatype):
 def iterable_input_reader(input_module_options, data_point_type,
                           data_ready_fn, len_fn=None, iter_fn=None,
                           module_type_name=None, module_readable_name=None,
-                          software_dependencies=None, execute_count=False, no_group=False):
+                          software_dependencies=None, execute_count=False, no_group=False, python2=False):
     """
     Factory for creating an input reader module info.
     This is a (typically) non-executable module that has no
@@ -402,6 +405,7 @@ class IterableInputReaderModuleInfo(InputModuleInfo):
         module_readable_name = mr_name
         module_outputs = [("corpus", output_datatype)]
         module_options = input_module_options
+        module_supports_python2 = python2
 
         # Special behaviour if we're making this an executable module in order to count the data
         module_executable = execute_count

diff --git a/src/python/pimlico/core/modules/map/filter.py b/src/python/pimlico/core/modules/map/filter.py
@@ -172,6 +172,7 @@ class ModuleInfo(BaseModuleInfo):
         module_outputs = module_info_instance.module_outputs
         module_optional_outputs = []
         module_executable = False
+        module_supports_python2 = module_info_instance.module_supports_python2
 
         def instantiate_output_reader_setup(self, output_name, datatype):
             return FilterModuleOutputReader.Setup(datatype, module_info_instance, output_name)

diff --git a/src/python/pimlico/datatypes/arrays.py b/src/python/pimlico/datatypes/arrays.py
@@ -25,6 +25,7 @@ class NumpyArray(NamedFileCollection):
 
     """
     datatype_name = "numpy_array"
+    datatype_supports_python2 = True
 
     def __init__(self, *args, **kwargs):
         super(NumpyArray, self).__init__(["array.npy"], *args, **kwargs)
@@ -57,6 +58,7 @@ class ScipySparseMatrix(NamedFileCollection):
 
     """
     datatype_name = "scipy_sparse_array"
+    datatype_supports_python2 = True
 
     def __init__(self, *args, **kwargs):
         super(ScipySparseMatrix, self).__init__(["array.mtx"], *args, **kwargs)

diff --git a/src/python/pimlico/datatypes/base.py b/src/python/pimlico/datatypes/base.py
@@ -256,6 +256,19 @@ class PimlicoDatatype(with_metaclass(PimlicoDatatypeMeta, object)):
     shell_commands = []
     """
     Override to provide shell commands specific to this datatype. Should include the superclass' list.
+    """
+    datatype_supports_python2 = True
+    """
+    Most core Pimlico datatypes support use in Python 2 and 3. Datatypes that do should set 
+    this to True. If it is False, the datatype is assumed to work only in Python 3.
+    
+    Python 2 compatibility requires extra work from the programmer. Datatypes should 
+    generally declare whether or not they provide this support by overriding this
+    explicitly.
+    
+    Use ``supports_python2()`` to check whether a datatype instance supports Python 2. 
+    (There may be reasons for a datatype's instance to override this class-level setting.)
+    
     """
 
     def __init__(self, *args, **kwargs):
@@ -287,6 +300,13 @@ def __init__(self, *args, **kwargs):
             # Build a better name out of the class name
             self.datatype_name = _class_name_word_boundary.sub(r"\1_\2", type(self).__name__).lower()
 
+    def supports_python2(self):
+        """
+        By default, just returns cls.datatype_supports_python2. Subclasses might override this.
+
+        """
+        return self.datatype_supports_python2
+
     def get_software_dependencies(self):
         """
         Get a list of all software required to **read** this datatype. This is
@@ -883,6 +903,14 @@ class DynamicOutputDatatype(object):
 
     The dynamic type must provide certain pieces of information needed for typechecking.
 
+    If a base datatype is available (i.e. indication of the datatype before the module is
+    instantiated), we take the information regarding whether the datatype supports
+    Python 2 from there. If not, we assume it does. This may seems the opposite to other
+    places: for example, the base datatype says it does **not** support Python 2 and subclasses
+    must declare if they do. However, dynamic output datatypes are often used with modules
+    that work with a broad range of input datatypes. It is therefore wrong to say that they
+    do not support Python 2, since they will provided the input module does.
+
     """
     """
     Must be provided by subclasses: can be a noncommittal string giving some idea of what types may be provided.
@@ -904,6 +932,14 @@ def get_base_datatype(self):
         """
         return None
 
+    def supports_python2(self):
+        base_dt = self.get_base_datatype()
+        if base_dt is None:
+            # Can't say whether this supports Py2 or not, so we say it does
+            return True
+        else:
+            return base_dt.supports_python2()
+
 
 class DynamicInputDatatypeRequirement(object):
     """
@@ -1029,6 +1065,9 @@ class MultipleInputs(object):
     def __init__(self, datatype_requirements):
         self.datatype_requirements = datatype_requirements
 
+    def supports_python2(self):
+        return self.datatype_requirements.supports_python2()
+
 
 class TypeFromInput(DynamicOutputDatatype):
     """