Merge pull request #168 from pagreene/schema-doc

Overhaul Documentation
gyorilab · May 11, 2021 · 5bc59fc · 5bc59fc
2 parents 5717000 + a65e194
commit 5bc59fc
Show file tree

Hide file tree

Showing 22 changed files with 2,504 additions and 1,308 deletions.
diff --git a/doc/conf.py b/doc/conf.py
@@ -41,7 +41,7 @@
     'IPython.sphinxext.ipython_directive',
     'IPython.sphinxext.ipython_console_highlighting',
     'citations',
-    'm2r'
+    'm2r2'
 ]
 
 # Add any paths that contain templates here, relative to this directory.
@@ -314,7 +314,7 @@
     'functools32', 'ndex2', 'ndex2.client', 'ndex2.niceCXNetwork',
     'nltk', 'reportlab', 'reportlab.lib', 'reportlab.lib.enums',
     'reportlab.lib.pagesizes', 'reportlab.platypus', 'reportlab.lib.styles',
-    'reportlab.lib.units'
+    'reportlab.lib.units', 'indra.tools.assemble_corpus', 'indra.ontology.bio',
     ]
 for mod_name in MOCK_MODULES:
     sys.modules[mod_name] = mock.MagicMock()

diff --git a/doc/modules/client/readonly/index.rst b/doc/modules/client/readonly/index.rst
@@ -12,24 +12,13 @@ our database to access it even as we perform daily updates on the principal
 database, without worrying about queries interfering.
 
 
-Get Pre-Assembled Statements (:py:mod:`indra_db.client.readonly.pa_statements`)
+Construct composable queries (:py:mod:`indra_db.client.readonly.query`)
 -------------------------------------------------------------------------------
 
-Here are the tools used to get PA Statements from the readonly database, with
-the goal of retrieving at least 1,000 Statements with 10 evidence each in under
-30 seconds.
+This is a sophisticated system of classes that can be used to form queires
+for preassembled statements from the readonly database.
 
-.. automodule:: indra_db.client.readonly.pa_statements
+.. automodule:: indra_db.client.readonly.query
    :members:
+   :member-order: bysource
 
-
-Get Simple Interactions from Metadata (:py:mod:`indra_db.client.readonly.interactions`)
----------------------------------------------------------------------------------------
-
-This provides an API to get somewhat less detailed data than above, using just
-the metadata of the database (not looking into the Statement JSONs), but is
-much faster. These tools can be sufficient if, for example, all that is needed
-is an interactome.
-
-.. automodule::indra_db.client.readonly.interactions
-   :memebrs:
diff --git a/doc/modules/index.rst b/doc/modules/index.rst
@@ -8,6 +8,7 @@ INDRA Database modules
    util/index.rst
    managers/index.rst
    reading/index.rst
+   preassembly/index.rst
    schemas/index.rst
    misc.rst
 
diff --git a/doc/modules/managers/index.rst b/doc/modules/managers/index.rst
@@ -52,11 +52,12 @@ handling.
    :members:
 
 
-Readonly Manager (:py:mod:`indra_db.managers.readonly_manager`)
----------------------------------------------------------------
+Static Dump Manager (:py:mod:`indra_db.managers.dump_manager`)
+--------------------------------------------------------------
 
-This handles the generation of the content for the readonly database from the
-principal database.
+This handles the generation of static dumps, including the readonly database 
+from the principal database.
 
-.. automodule:: indra_db.managers.readonly_manager
+.. automodule:: indra_db.managers.dump_manager
    :members:
+   :member-order: bysource
diff --git a/doc/modules/misc.rst b/doc/modules/misc.rst
@@ -14,6 +14,7 @@ access to SQLAlchemy's API.
 
 .. automodule:: indra_db.databases
    :members:
+   :member-order: bysource
 
 
 Belief Calculator (:py:mod:`indra_db.belief`)

diff --git a/doc/modules/preassembly/index.rst b/doc/modules/preassembly/index.rst
@@ -0,0 +1,30 @@
+Database Integrated Preassembly Tools
+=====================================
+
+The database runs incremental preassembly on the raw statements to generate
+the preassembled (PA) Statements. The code to accomplish this task is defined
+here, principally in :class:`DbPreassembler
+<indra_db.preassembly.preassemble_db.DbPreassembler>`. This module also
+defines proceedures for running these jobs on AWS.
+
+Database Preassembly (:py:mod:`indra_db.preassembly.preassemble_db`)
+--------------------------------------------------------------------
+
+This module defines a class that manages preassembly for a given list of
+statement types on the local machine.
+
+.. automodule:: indra_db.preassembly.preassemble_db
+   :members:
+   :member-order: bysource
+
+
+A Class to Manage and Monitor AWS Batch Jobs (:py:mod:`indra_db.preassembly.submitter`)
+---------------------------------------------------------------------------------------
+
+Allow a manager to monitor the Batch jobs to prevent runaway jobs, and smooth
+out job runs and submissions.
+
+.. automodule:: indra_db.preassembly.submitter
+   :members:
+   :member-order: bysource
+
diff --git a/doc/modules/reading/index.rst b/doc/modules/reading/index.rst
@@ -15,7 +15,8 @@ to a standard interface, which then allows readers to be run in a plug-and-play
 manner.
 
 .. automodule:: indra_db.reading.read_db
-    :members:
+   :members:
+   :member-order: bysource
 
 
 The Database Script for Running on AWS (:py:mod:`indra_db.reading.read_db_aws`)
@@ -25,23 +26,17 @@ This is the script used to run reading on AWS Batch, generally run from an
 AWS Lambda function.
 
 .. automodule:: indra_db.reading.read_db_aws
-    :members:
+   :members:
+   :member-order: bysource
 
-The Database Reporter (:py:mod:`indra_db.reading.report_db_aws`)
-----------------------------------------------------------------
 
-Create an object that is used to aggregate and report on the reading process,
-allowing for effective monitoring.
-
-.. automodule:: indra_db.reading.report_db_aws
-    :members:
-
-A Class to Manage and Monitor AWS Batch Jobs (:py:mod:`indra_db.reading.submit_reading_pipeline`)
--------------------------------------------------------------------------------------------------
+A Class to Manage and Monitor AWS Batch Jobs (:py:mod:`indra_db.reading.submitter`)
+-----------------------------------------------------------------------------------
 
 Allow a manager to monitor the Batch jobs to prevent runaway jobs, and smooth
 out job runs and submissions.
 
-.. automodule:: indra_db.reading.submit_reading_pipeline
-    :members:
+.. automodule:: indra_db.reading.submitter
+   :members:
+   :member-order: bysource
 
diff --git a/doc/modules/schemas/index.rst b/doc/modules/schemas/index.rst
@@ -7,11 +7,9 @@ as some useful mixin classes.
 Principal Database Schema (:py:mod:`indra_db.schemas.principal_schema`)
 -----------------------------------------------------------------------
 
-Defines the `get_schema` function for the principal database, which represents
-the "ground truth" of the knowledge we aggregate.
-
 .. automodule:: indra_db.schemas.principal_schema
    :members:
+   :member-order: bysource
 
 Readonly Database Schema (:py:mod:`indra_db.schemas.readonly_schema`)
 ---------------------------------------------------------------------
@@ -21,6 +19,7 @@ external services to access the Statement knowledge we acquire.
 
 .. automodule:: indra_db.schemas.readonly_schema
    :members:
+   :member-order: bysource
 
 Class Mix-ins (:py:mod:`indra_db.schemas.mixins`)
 -------------------------------------------------
@@ -30,6 +29,7 @@ table objects via multiple inheritance.
 
 .. automodule:: indra_db.schemas.mixins
    :members:
+   :member-order: bysource
 
 Indexes (:py:mod:`indra_db.schemas.indexes`)
 --------------------------------------------
@@ -40,3 +40,4 @@ class mixin definition.
 
 .. automodule:: indra_db.schemas.indexes
    :members:
+   :member-order: bysource
diff --git a/indra_db/client/datasets.py b/indra_db/client/datasets.py
@@ -26,9 +26,6 @@ def get_statement_essentials(clauses, count=1000, db=None, preassembled=True):
         list of sqlalchemy WHERE clauses to pass to the filter query.
     count : int
         Number of statements to retrieve and process in each batch.
-    do_stmt_count : bool
-        Whether or not to perform an initial statement counting step to give
-        more meaningful progress messages.
     db : :py:class:`DatabaseManager`
         Optionally specify a database manager that attaches to something
         besides the primary database, for example a local database instance.
@@ -148,18 +145,19 @@ def export_relation_dict_to_tsv(relation_dict, out_base, out_types=None):
     """Export a relation dict (from get_relation_dict) to a tsv.
 
     Available output types are:
+
     - "full_tsv" : get a tsv with directed pairs of entities (e.g. HGNC
-        symbols), the type of relation (e.g. Phosphorylation) and the hash
-        of the preassembled statement. Columns are agent_1, agent_2 (where
-        agent_1 affects agent_2), type, hash.
+      symbols), the type of relation (e.g. Phosphorylation) and the hash
+      of the preassembled statement. Columns are agent_1, agent_2 (where
+      agent_1 affects agent_2), type, hash.
     - "short_tsv" : like the above, but without the hashes, so only one
-        instance of each pair and type trio occurs. However, the information
-        cannot be traced. Columns are agent_1, agent_2, type, where agent_1
-        affects agent_2.
+      instance of each pair and type trio occurs. However, the information
+      cannot be traced. Columns are agent_1, agent_2, type, where agent_1
+      affects agent_2.
     - "pairs_tsv" : like the above, but without the relation type. Similarly,
-        each row is unique. In addition, the agents are undirected. Thus this
-        is purely a list of pairs of related entities. The columns are just
-        agent_1 and agent_2, where nothing is implied by the ordering.
+      each row is unique. In addition, the agents are undirected. Thus this
+      is purely a list of pairs of related entities. The columns are just
+      agent_1 and agent_2, where nothing is implied by the ordering.
 
     Parameters
     ----------

diff --git a/indra_db/client/readonly/query.py b/indra_db/client/readonly/query.py
@@ -691,8 +691,8 @@ def get_interactions(self, ro=None, limit=None, offset=None,
                          sort_by='ev_count') -> Optional[QueryResult]:
         """Get the simple interaction information from the Statements metadata.
 
-       Each entry in the result corresponds to a single preassembled Statement,
-       distinguished by its hash.
+        Each entry in the result corresponds to a single preassembled Statement,
+        distinguished by its hash.
 
         Parameters
         ----------
@@ -1859,6 +1859,10 @@ def get_clause(ro):
 class FromMeshIds(_TextRefCore):
     """Find Statements whose text sources were given one of a list of MeSH IDs.
 
+    This object can be constructed from a list of mixed "D" and "C" type mesh
+    IDs, but for reasons of querying, those IDs will be separated into two
+    separate classes and a :class:`Union <Union>` of the two classes returned.
+
     Parameters
     ----------
     mesh_ids : list
@@ -1867,9 +1871,11 @@ class FromMeshIds(_TextRefCore):
     Attributes
     ----------
     mesh_ids : tuple
-        The mesh IDs.
+        The immutable tuple of mesh IDs, on their original string form.
     _mesh_type : str
         "C" or "D" indicating which types of IDs are held in this object.
+    _mesh_nums : list[int]
+        The mesh IDs converted to integers, stripped of their prefix.
     """
     list_name = 'mesh_ids'
 
@@ -1901,7 +1907,6 @@ def __new__(cls, mesh_ids: list):
     def __init__(self, mesh_ids):
         self.mesh_ids = tuple(set(mesh_ids))
         self._mesh_nums = []
-        self._mesh_concept_nums = []
         self._mesh_type = None
         for mesh_id in self.mesh_ids:
             if self._mesh_type is None:
@@ -2965,29 +2970,24 @@ class EvidenceFilter:
     We need to be able to perform logical operations between evidence to handle
     important cases:
 
-    HasSource(['reach']) & FromMeshIds(['D0001'])
-    -> we might reasonably want to filter evidence for the second subquery but
-       not the first.
-
-    HasOnlySource(['reach']) & FromMeshIds(['D00001'])
-    -> Here we would likely want to filter the evidence for both sub queries.
-
-    HasOnlySource(['reach']) | FromMeshIds(['D000001'])
-    -> Not sure what this even means (its purpose)....not sure what we'd do for
-       evidence filtering when the original statements are or'ed
-
-    HasDatabases() & FromMeshIds(['D000001'])
-    -> Here you COULDN'T perform an & on the evidence, because the two sources
-       are mutually exclusive (only readings connect to mesh annotations).
-       However it could make sense you would want to do an "or" between the
-       evidence, so the evidence is either from a database or from a mesh
-       annotated document.
-
-    "filter all the evidence" and "filter none of the evidence" should
-    definitely be options. Although "Filter for all" might run into usues with
-    the "HasDatabase and FromMeshIds" scenario. I think no evidence filter should
-    be the default, and if you attempt a bogus "filter all evidence" (as with
-    that scenario) you get an error.
+    - ``HasSource(['reach']) & FromMeshIds(['D0001'])``: we might reasonably
+      want to filter evidence for the second subquery but not the first.
+    - ``HasOnlySource(['reach']) & FromMeshIds(['D00001'])``: Here we would
+      likely want to filter the evidence for both sub queries.
+    - ``HasOnlySource(['reach']) | FromMeshIds(['D000001'])``: It is not clear
+      what this even means (its purpose) or what we'd do for evidence filtering
+      when the original statements are or'ed
+    - ``HasDatabases() & FromMeshIds(['D000001'])``: Here you COULDN'T perform
+      an & on the evidence, because the two sources are mutually exclusive
+      (only readings connect to mesh annotations). However it could make sense
+      you would want to do an "or" between the evidence, so the evidence is
+      either from a database or from a mesh annotated document.
+
+    Both "filter all the evidence" and "filter none of the evidence" should
+    definitely be options. Although "Filter for all" might run into uses with
+    the "HasDatabase and FromMeshIds" scenario. I think no evidence filter
+    should be the default, and if you attempt a bogus "filter all evidence" (as
+    with that scenario) you get an error.
     """
 
     def __init__(self, filters=None, joiner='and'):

diff --git a/indra_db/client/readonly/relation.py b/indra_db/client/readonly/relation.py