Added new intro to docs

Added an introductory document, describing the key concepts of Pimlico, based on the system description paper currently under (non-anonymous) review. Updated some bits of docs here and there.
markgw · Aug 10, 2020 · bc98505 · bc98505
1 parent 125b762
commit bc98505
Show file tree

Hide file tree

Showing 8 changed files with 288 additions and 9 deletions.
diff --git a/docs/guides/bootstrap.rst b/docs/guides/bootstrap.rst
@@ -1,5 +1,5 @@
-Running a pipeline
-==================
+Running someone else's pipeline
+===============================
 
 This guide takes you through what to do if you have received someone else's
 code for a Pimlico project and would like to run it.

diff --git a/docs/guides/index.rst b/docs/guides/index.rst
@@ -7,6 +7,7 @@ Step-by-step guides through common tasks while using Pimlico.
 .. toctree::
    :maxdepth: 2
 
+   intro
    fast_setup
    setup
    bootstrap

diff --git a/docs/guides/intro.rst b/docs/guides/intro.rst
@@ -0,0 +1,272 @@
+=======================
+Introduction to Pimlico
+=======================
+
+Motivation
+==========
+It is becoming more and more common for conferences and journals in NLP and other
+computational areas to encourage, or even require, authors to make publicly
+available the code and data required to reproduce their reported results. It is
+now widely acknowledged that such practices lie at the center of open science
+and are essential to ensuring that
+research contributions are verifiable, extensible and useable in applications.
+
+However, this requires extensive additional work. And,
+even when researchers do this, it is all too common for others
+to have to spend large amounts of time and effort preparing data, downloading
+and installing tools, configuring execution environments and picking through
+instructions and scripts before they can reproduce the original results, never mind
+apply the code to new datasets or build upon it in novel research.
+
+Introducing Pimlico
+===================
+Pimlico (**Pi**\ pelined **M**\ odular **Li**\ nguistic **Co**\ rpus processing) addresses these problems.
+It allows users to write and run potentially complex processing pipelines, with
+the key goals of making it easy to:
+
+  - clearly document what was done;
+  - incorporate standard NLP and data-processing tasks with minimal effort;
+  - integrate non-standard code, specific to the task at hand, into the same pipeline; and
+  - distribute code for later reproduction or application to other datasets or experiments.
+
+It comes with pre-defined **module types** to wrap a number of existing **NLP toolkits**
+(including non-Python code) and carry out many other common pre-processing or
+data manipulation tasks.
+
+Building pipelines
+==================
+Pimlico addresses the task of **building of pipelines to process large datasets**.
+It allows you to run one or several steps of
+processing at a time, with high-level control over how each step is run,
+manages the data produced by each step,
+and lets you observe these intermediate outputs.
+Pimlico provides simple, powerful
+tools to give this kind of control, without needing to write any code.
+
+Developing a pipeline with Pimlico involves defining the **structure of the pipeline**
+itself in terms of **modules** to be executed, and
+**connections between their inputs and outputs** describing the flow of data.
+
+Modules correspond to some data-processing code, with some parameters.
+They may be of a standard type, so-called **core module types**, for which
+code is provided as part of Pimlico.
+
+A pipeline may also incorporate
+**custom module types**, for which metadata and data-processing code must be
+provided by the author.
+
+Pipeline configuration
+======================
+
+   See :doc:`/core/config` for more on pipeline configuration.
+
+
+At the heart of Pimlico is the concept of a **pipeline configuration**, defined
+by a configuration (or *conf*) file,
+which can be loaded and executed.
+
+This specifies
+some general parameters and metadata regarding the pipeline and then a sequence of
+modules to be executed.
+
+Each **pipeline module** is defined by a named section in
+the file, which specifies the module type, inputs to be read from the outputs
+of other, previous modules, and parameters.
+
+For example, the following configuration section defines
+a module called ``split``. Its type is the core Pimlico module
+type :mod:`corpus split <pimlico.modules.corpora.split>`,
+which splits a corpus by documents
+into two randomly sampled subsets (as is typically done to produce training
+and test sets).
+
+.. code-block:: ini
+
+   [split]
+   type=pimlico.modules.corpora.split
+   input=tokenized_corpus
+   set1_size=0.8
+
+The option ``input`` specifies where the
+module's only input comes from and refers by name to a module defined
+earlier in the pipeline whose output provides the data.
+The option ``set1_size`` tells the module
+to put 80% of documents into the first set and 20% in the second. Two
+outputs are produced, which can be referred to later in the pipeline as
+``split.set1`` and ``split.set2``.
+
+Input modules
+-------------
+The first module(s) of a pipeline have no inputs, but load datasets,
+with parameters to specify where the input data can be found on the filesystem.
+
+A number of :mod:`standard input readers <pimlico.modules.input>`
+are among Pimlico's core module types to
+support reading of simple datasets, such as text files in a directory, and
+some standard input formats for data such as word embeddings. The toolkit
+also provides a factory to make it easy to define custom
+routines for reading other types of input data.
+
+Module type
+-----------
+The **type** of a module is given as a fully qualified Python path to a
+Python package.
+
+The package provides separately the module type's metadata,
+referred to as its *module info* –
+input datatypes, options, etc. – and the code that is executed when
+it is run, the *module executor*.
+The example above uses one of
+Pimlico's core module types.
+
+A pipeline will
+usually also include non-standard module types, distributed
+together with the conf file. These are defined and used in exactly
+the same way as the core module types.
+Where custom module types are used, the pipeline conf file specifies a directory
+where the source code can be found.
+
+.. seealso:: :doc:`Full worked example </example_config/simple.custom_module>`
+
+   An example of a complete pipeline conf, using both core and
+   custom module types
+
+
+
+Datatypes
+=========
+
+When a module is run, its output is
+stored ready for use by subsequent modules. Pimlico takes care
+of storing each module's output in separate locations and providing the
+correct data as input.
+
+The module info for a module type defines a **datatype**
+for each input and each output.
+Pimlico includes a system of datatypes for the datasets
+that are passed between modules.
+
+When a pipeline is loaded, type-checking is performed on
+the connections between modules' outputs and subsequent modules' inputs to
+ensure that appropriate datatypes are provided.
+
+For example, a module may
+require as one of its inputs a vocabulary, for which Pimlico provides a
+:class:`standard datatype <pimlico.datatypes.dictionary.Dictionary>`.
+The pipeline will only be loaded if this input is
+connected to an output that supplies a compatible type. The supplying
+module does not need to define how to store a vocabulary, since the datatype
+defines the necessary routines for **writing a vocabulary to disk**. The
+subsequent module does not need to define how to **read the data** either, since
+the datatype takes care of that too, providing the module executor
+with suitable Python data structures.
+
+Corpora
+-------
+Often modules read and write **corpora**, consisting of a large number
+of documents. Pimlico provides a datatype for representing such corpora and
+a further type system for the **types of the documents**
+stored within a corpus
+(rather like Java's *generic* types).
+
+For example, a module may
+specify that it requires as input a corpus whose documents contain
+tokenized text. All tokenizer modules (of which there are several)
+provide output corpora with this document type. The corpus
+datatype takes care of reading and writing large
+corpora, preserving the order of documents, storing corpus metadata, and
+much more.
+
+The datatype system is also extensible in custom code. As well
+as defining custom module types, a pipeline author may wish to define new
+datatypes to represent the data required as
+input to the modules or provided as output.
+
+.. seealso::
+
+   :class:`~pimlico.datatypes.corpora.base.IterableCorpus`: datatype for corpora.
+
+
+Running a pipeline
+==================
+
+Pimlico provides a command-line interface for parsing and executing
+pipelines. The interface provides sub-commands to
+perform different operations relating to a given pipeline.
+The conf file defining the pipeline is always given as
+an argument and the first operation is therefore to parse the
+pipeline and check it for validity.
+We describe here a few of the most important sub-commands.
+
+.. seealso:: :doc:`/commands/index`
+
+   A complete list of the available commands
+
+status
+------
+   :doc:`The status subcommand </commands/status>`
+
+Outputs a list of all of the modules in the pipeline,
+reporting the execution status of each. This indicates whether the
+module has been run; if so, whether it completed successfully or
+failed; if not, whether it is ready to be run (i.e. all of its
+input data is available).
+
+Each of the modules is numbered in the list, and this number can
+be used instead of the module's full name in arguments to all
+sub-commands.
+
+Given the name of a module, the command outputs a detailed
+report on the status of that module and its input and output datasets.
+
+run
+---
+   :doc:`The run subcommand </commands/run>`
+
+Executes a module.
+
+An option ``--dry`` runs all pre-execution
+checks for the module, without running it. These include checking
+that required software is installed
+and performing automatic installation if not.
+
+If all requirements are satisfied, the module will be executed, outputting
+its progress to the terminal and to module-specific log files. Output datasets
+are written to module-specific directories, ready to be used by subsequent
+modules later.
+
+Multiple modules can be run in sequence, or even the entire
+pipeline. A switch ``--all-deps`` causes any unexecuted modules upon
+whose output the specified module(s) depend to be run.
+
+browse
+------
+   :doc:`The browse subcommand </commands/browse>`
+
+Inspects the data output by a module,
+stored in its pipeline-internal storage. Inspecting output data by
+loading the files output by the module would require
+knowledge of both the Pimlico data storage system and the specific storage
+formats used by the output datatypes. Instead,
+this command lets the user inspect the data from
+a given module (and a given output, if there are multiple).
+
+Datatypes, as part of their definition, along with specification of storage
+format reading and writing, define how the data can be
+formatted for display.
+Multiple formatters may be defined, giving alternative ways to inspect the same data.
+
+For some datatypes, browsing is as simple as outputting some statistics
+about the data, or a string representing its contents. For corpora, a
+document-by-document browser is provided, using the `Urwid <http://urwid.org/>`_
+library. Furthermore, the definition of corpus document types
+determines how an individual document should be
+displayed in the corpus browser. For example, the tokenized text type shows
+each sentence on a separate line, with spaces between tokens.
+
+Where next?
+===========
+
+For a practical quick-start guide to building pipelines, see :doc:`fast_setup`.
+
+Or for a bit more detail, see :doc:`setup`.
diff --git a/docs/guides/module.rst b/docs/guides/module.rst
@@ -1,6 +1,6 @@
-===========================
-  Writing Pimlico modules
-===========================
+=================================
+  Writing Pimlico module types
+=================================
 
 Pimlico comes with a fairly large number of :mod:`module types <pimlico.modules>`
 that you can use to run many standard NLP, data processing

diff --git a/docs/guides/multiple_servers.rst b/docs/guides/multiple_servers.rst
@@ -1,5 +1,5 @@
 ==========================================
-Running one pipeline on multiple computers
+      Running on multiple computers
 ==========================================
 
 Multiple servers

diff --git a/docs/index.rst b/docs/index.rst
@@ -26,7 +26,9 @@ pipeline and checking that everything's executed in the right order.
 
 The core toolkit is written in Python. Pimlico is open source, released under the GPLv3 license. It is
 available from `its Github repository <https://github.com/markgw/pimlico>`_.
-To get started with a Pimlico project, follow the :doc:`getting-started guide <guides/setup>`.
+
+* For a broad introduction to Pimlico's key concepts, read :doc:`guides/intro`.
+* To get started with a Pimlico project, follow the :doc:`getting-started guide <guides/setup>`.
 
 Pimlico is written in Python and can be run using Python >=2.7 or >=3.6. This means
 you can write your own processing modules using either Python 2 or 3.

diff --git a/docs/plans/index.rst b/docs/plans/index.rst
@@ -43,7 +43,6 @@ features). These do not take long to update and include in the main library.
 .. toctree::
    :maxdepth: 2
 
-   wishlist
    berkeley
    cherry_picker
    drawing

diff --git a/src/python/pimlico/datatypes/dictionary.py b/src/python/pimlico/datatypes/dictionary.py
@@ -34,7 +34,7 @@
 from pimlico.datatypes.base import PimlicoDatatype
 
 
-__all__ = ["Dictionary"]
+__all__ = ["Dictionary", "DictionaryData"]
 
 
 class Dictionary(PimlicoDatatype):
@@ -51,11 +51,16 @@ class Dictionary(PimlicoDatatype):
 
     class Reader(object):
         def get_data(self):
+            """
+            Load the dictionary and return a :class:`DictionaryData` object.
+
+            """
             with open(os.path.join(self.data_dir, "dictionary"), "rb") as f:
                 return pickle.load(f)
 
         class Setup(object):
             def get_required_paths(self):
+                """Require the dictionary file to be written"""
                 return ["dictionary"]
 
         def get_detailed_status(self):