Merge pull request #32 from openeventdata/clean

Clean
openeventdata · Oct 31, 2016 · b849348 · b849348
2 parents 7139eba + 343e424
commit b849348
Show file tree

Hide file tree

Showing 7 changed files with 75 additions and 64 deletions.
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -18,7 +18,7 @@
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
-sys.path.insert(0, os.path.abspath('../../petrarch/'))
+sys.path.insert(0, os.path.abspath('../../petrarch2/'))
 
 # -- General configuration ------------------------------------------------
 
@@ -46,17 +46,17 @@
 master_doc = 'index'
 
 # General information about the project.
-project = u'PETRARCH'
+project = u'PETRARCH2'
 copyright = u'2014, Open Event Data Alliance'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
 # built documents.
 #
 # The short X.Y version.
-version = '.01a'
+version = '1.0.0'
 # The full version, including alpha/beta/rc tags.
-release = '.01a'
+release = '1.0.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/source/dictionaries.rst b/docs/source/dictionaries.rst
@@ -29,8 +29,6 @@ Everything after this symbol and before the next newline will be ignored by the
 	
 	something I want # followed by a Python-like comment
 
- 
-
 The program is *not* set up to handle clever variations like nested comments,  multiple 
 comments on a line, or non-comment information in multi-line comments: yes, we are
 perfectly capable of writing code that could handle these contingencies, but it 
@@ -46,15 +44,17 @@ adds to the memory overhead and can be somewhat confusing if you don't know what
 with. This data structure stores each word at a node, and following a path in the tree will lead
 to a pattern. Let's take a small part of the discard list as an example:
 
-::
+.. code-block:: none
+
     WORLD BOXING ASSOCIATION
     WORLD BOXING COUNCIL
     WORLD CUP
 
 
 These three entries would be stored in the following Trie:
 
-::
+.. code-block:: none
+
                 PETRglobals.DiscardList
                             |
                             |
@@ -95,14 +95,13 @@ Synonym sets (synsets) are labelled with a string beginning with & and defined u
 the label followed by a series of lines beginning with ``+`` containing words or phrases.
 The phrases are interpreted as requiring consecutive words; the words can be separated 
 with underscores (they are converted to spaces). Synset phrases can
-only contain words, not ``$``, ``+``, ``%`` or ``^`` tokens.
- Synsets be used anywhere in a
-pattern that a word or phrase can be used. A synset must be defined before it is used:  
-a pattern containing an undefined synset will be ignored.
+only contain words, not ``$``, ``+``, ``%`` or ``^`` tokens. Synsets can be used anywhere in a pattern that a word or phrase can be used. A synset must be defined before it is used: a pattern containing an undefined synset will be ignored.
 
-Regular plurals are generated automatically  by adding 'S' to the root, adding 'IES' if the root ends in 'Y', and added 'ES' if the root ends in 'SS'.  Plurals are not created when [1]_
-
-.. [1] The method for handling irregular plurals is currently different for the verbs and agents dictionaries: these will be reconciled in the future, probably using the agents syntax. 
+Regular plurals are generated automatically  by adding 'S' to the root, adding 'IES' if the root ends in 'Y', and added 'ES' if the root ends in 'SS'.
+The method for handling irregular plurals is currently different for the verbs
+and agents dictionaries: these will be reconciled in the future, probably using
+the agents syntax.
+Plurals are not created when:
 
 * The phrase ends with ``_``. 
 
@@ -118,19 +117,19 @@ just enter these as additional synonyms.
 A verb synonym block is a set of verbs which are synonymous (or close enough) with 
 respect to the patterns. The program automatically generates the regular forms of the 
 verb if it is regular (and, implicitly, English); otherwise the irregular forms can be 
-specified in {...} following the primary verb. An optional code for the isolated verb 
-can	follow in [...].  
+specified in ``{...}`` following the primary verb. An optional code for the isolated verb 
+can	follow in ``[...]``.  
 
 The verb block begins with a comment of the form 
 
 ::
 
 --- <GENERAL DESCRIPTION> [<CODE>] ---
 
-where the "---" signals the beginning of a new block. The code in [...] is the 
+where the ``---`` signals the beginning of a new block. The code in ``[...]`` is the 
 primary code -- typically a two-digit+0 cue-category code -- for the block, and this 
 will be used for all other verbs unless these have their own code. If no code is 
-present, this defaults to the null code "---"  which indicates that the isolated verb 
+present, this defaults to the null code ``---`` which indicates that the isolated verb 
 does not generate an event. The null code also can be used as a secondary code.	
 
 
@@ -139,9 +138,9 @@ does not generate an event. The null code also can be used as a secondary code.
 Multiple-word "verbs" such as "CONDON OFF", "WIRE TAP" and "BEEF UP" are entered by
 connecting the words with an underscore and putting a '+'
 in front of the word in the phrase that is going to be identified as a verb.
-If there is no {...}, regular 
+If there is no ``{...}``, regular 
 forms are constructed for the word designated by '+'; otherwise all of the irregular 
-forms are given in {...}. If you can't figure out which part of the phrase is the 
+forms are given in ``{...}``. If you can't figure out which part of the phrase is the 
 verb, the phrase you are looking at is probably a noun, not a verb. Multi-word verbs 
 are treated in patterns just as single-word verbs are treated.
 
@@ -164,8 +163,9 @@ are more frequently parsed correctly.
 
 
 
-** Patterns **
-This is followed by a set of patterns -- these begin with '-' -- which are based roughly on
+**Patterns**
+
+This is followed by a set of patterns -- these begin with ``-`` -- which are based roughly on
 the syntax from TABARI patterns, but the patterns in Petrarch's dictionaries also contain
 some syntactic annotation. Pattern lines begin with a
 -, and are followed by a five-part pattern:
@@ -176,7 +176,7 @@ some syntactic annotation. Pattern lines begin with a
 
 Any of these can be left empty. Singular nouns are left bare, and should be the "head" of the phrase
 they are a member of, e.g. the head of "Much-needed financial aid" would be "aid." If multiple nouns or
-adjectives are needed, then that phrase is put in braces as in {FINANCIAL AID}, where the last word is the
+adjectives are needed, then that phrase is put in braces as in ``{FINANCIAL AID}``, where the last word is the
 head. Prepositional phrases are put in parentheses where the first element is the preposition, and the second
 element is a noun, or a braced noun phrase.
 
@@ -192,16 +192,20 @@ Note that these patterns do not contain other verbs. This is different from TABA
 versions of Petrarch. This is to simplify the verbs dictionary, and make the pattern matching
 faster and more effective.
 
-** Combinations **
+**Combinations**
+
 Petrarch handles many verb-verb interactions automatically through its reformatting of CAMEO's semantic
 heirarchy (See utilities.convert_code for more). For instance, if it were parsing the phrase
- " A will [help B]", it would code "to help B" first, then the phrase would become "A will [_ B 0x0040]".
-And then since help=0x0040 is a subcategory of will=0x3000, then it just adds them together,
-ending with the code [A B 0x3040]. This code is translated back into CAMEO for the final output,
-yielding [A B 033]. This process works for most instances where the idea of the phrase as a whole
+
+" A will [help B]"
+
+it would code "to help B" first, then the phrase would become "A will [_ B 0x0040]".
+And then since ``help=0x0040`` is a subcategory of ``will=0x3000``, then it just adds them together,
+ending with the code ``[A B 0x3040]``. This code is translated back into CAMEO for the final output,
+yielding ``[A B 033]``. This process works for most instances where the idea of the phrase as a whole
 is a combination of the ideas of its children.
 
-** Transformations **
+**Transformations**
 
 Sometimes these verb-vertb interactions aren't represented in the
 ontology. It is possible to specify what happens when one verb finds that it is acting on another verb.
@@ -217,19 +221,20 @@ The first element is the topmost source actor, the last element is the topmost v
 are converted to codes, so synonyms also match). The inner parenthetical has the same format, with the
 first element being the lower source, the second the lower target, and the third the lower verb. It
 is possible to replace letter variables with a period '.' to represent "non-specified actor", or with
-an underscore '_' to specify "non-present actor." Verbs can also be replaced with "Q" to mean "any verb."
+an underscore ``_`` to specify "non-present actor." Verbs can also be replaced with "Q" to mean "any verb."
 
 These transformations are sometimes necessary, but most cases can be handled by the combination process.
 
 
-** Storage in Memory **
+**Storage in Memory**
+
 The verb dictionary, when stored into memory, has three subdictionaries: words, patterns, and transformations.
 
-The words portion contains the base verbs. They are stored as VERB--STUFF BEFORE--#--STUFF AFTER--#--INFO. For
-most verbs (i.e. those that are not compounds), The entry just goes VERB -- # -- # -- INFO.
+The words portion contains the base verbs. They are stored as ``VERB--STUFF BEFORE--#--STUFF AFTER--#--INFO``. For
+most verbs (i.e. those that are not compounds), The entry just goes ``VERB -- # -- # -- INFO``.
 
 The transformation contains almost a literal transcription of the pattern, ordered
-VERB1--SOURCE1--VERB2--SOURCE2--TARGET2--INFO.
+``VERB1--SOURCE1--VERB2--SOURCE2--TARGET2--INFO``.
 
 The verb patterns in memory have extra annotative symbols after every word to indicate the type of
 word that comes next. The very first word encountered is always a noun. Then it follows a series of rules
@@ -394,11 +399,11 @@ for organizations, e.g. ``NGO~``)
 Regular plurals -- those formed by adding 'S' to the root, adding 'IES' if the
 root ends in 'Y', and added 'ES' if the root ends in 'SS' -- are generated automatically
 
-If the plural has some other form, it follows the root inside {...}  [1]_
+If the plural has some other form, it follows the root inside ``{...}``  [1]_
 
 If a plural should not be formed -- that is, the root is only singular or only
 plural, or the singular and plural have the same form (e.g. "police"), use a null
-string inside {}.
+string inside ``{}``.
 
 If there is more than one form of the plural -- "attorneys general" and "attorneys
 generals" are both in use -- just make a second entry with one of the plural forms
@@ -422,7 +427,7 @@ and used in the form
         CONGRESS!PERSON! [~LEG}
         !MINIST!_OF_INTERNAL_AFFAIRS
 
-The marker for the substitution set is of the form !...! and is followed by an =
+The marker for the substitution set is of the form ``!...!`` and is followed by an =
 and a comma-delimited list; spaces are stripped from the elements of the list so
 these can be added for clarity. Every item in the list is substituted for the marker,
 with no additional plural formation, so the first construction would generate
@@ -460,9 +465,10 @@ with no additional plural formation, so the first construction would generate
 Discard List
 ------------
 
-The discard list is used to identify sentences that should not be coded, for example sports events and historical chronologies.[2]_ If the string, prefixed with ' ', is found in the ``<Text>...</Text>`` sentence, the
-sentence is not coded. Prefixing the string with a '+' means the entire story is not
-coded with the string is found. If the string ends with '_', the matched string must also end with
+The discard list is used to identify sentences that should not be coded, for example sports events and historical chronologies. [2]_
+If the string, prefixed with ``' '``, is found in the ``<Text>...</Text>`` sentence, the
+sentence is not coded. Prefixing the string with a ``+`` means the entire story is not
+coded with the string is found. If the string ends with ``_``, the matched string must also end with
 a blank or punctuation mark; otherwise it is treated as a stem. The matching is not
 case sensitive.
 
@@ -503,7 +509,7 @@ The optional ``Issues`` dictionary is used to do simple string matching and retu
 
         ``<string> [<code>]``
 
-For purposes of matching, a ' ' is added to the beginning and end of the string: at
+For purposes of matching, a ``' '`` is added to the beginning and end of the string: at
 present there are no wild cards, though that is easily added.
 
 The following expansions can be used (these apply to the string that follows up to

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -7,12 +7,15 @@
 Introduction
 ============
 
-A Python Engine for Text Resolution And Related Coding Hierarchy.
+A Python Engine for Text Resolution And Related Coding Hierarchy part 2.
+This is the documentation for PETRARCH2, though PETRARCH is used throughtout
+this documentation as interchangeable with PETRARCH2. The difference between
+the programs lies in the coding engine rather than the API; more details can be
+seen in the `Comparison <status.html>`_.
 
 
     One of my students put it this way: "Francesco Petrarch was Kayne West. He jumps up on stage, says 
-    'Yo, welcome to the Renaissance, bitches!' And then drops the mike." <br\>
-    Dorsey Armstrong <br\>
+    'Yo, welcome to the Renaissance, bitches!' And then drops the mike." -- Dorsey Armstrong
     *Great Minds of the Medieval World* (Great Courses Series), lecture 20
 
 PETRARCH is a natural language processing tool for machine-coding events data.
@@ -21,10 +24,10 @@ from which 'whom-did-what-to-whom' relations are extracted.
 
 PETRARCH is the next-generation successor to the `TABARI
 <http://eventdata.parusanalytics.com/software.dir/tabari.html>`_ event-data
-coding software. More information about the key differences between PETRARCH
-and TABARI can be be found `here <current.html>`_ .
+coding software. A description of the differences between TABARI and
+PETRARCH-generation software is available `here <tabari_vs_petrarch.html>`_.
 
-This software is MIT Licensed (MIT) Copyright &copy; 2014 Open Event Data Alliance
+This software is MIT Licensed (MIT) Copyright 2014 Open Event Data Alliance
 
 
 Events Data
@@ -59,12 +62,8 @@ Installing
 ----------
 If you do decide you want to work with Petrarch as a standalone program, it is possible to install:
 
-1) Clone the repo
 
-  - For example, download the zip file into ``~/Downloads``.
-  - This will put the repo into something like ``~/Downloads/petrarch``.
-
-2) Run ``pip install -e ~/Downloads/petrarch``
+1) Run ``pip install git+https://github.com/openeventdata/petrarch2.git``
 
 
 This will install the program with a command-line hook. You can now run the program using:
@@ -90,6 +89,14 @@ If not installed:
 
 ``python petrarch.py batch -i <INPUT FILE> -o <OUTPUT FILE>``
 
+You can see a sample of the input/output by running (assuming you're in the
+PETRARCH2 directory):
+
+``petrarch2 batch -i ./petrarch2/data/text/GigaWord.sample.PETR.xml -o
+test.txt``
+
+This will return a file named `evts.test.txt`.
+
 There's also the option to specify a configuration file using the ``-c <CONFIG
 FILE>`` flag, but the program will default to using ``PETR_config.ini``.
 
@@ -106,9 +113,7 @@ Unexpected conditions where the program encountered a potentially fatal error ar
 
 The one common error -- not included in those counts -- is the ``Dateline`` pattern, which is a particular pattern in the parse tree that occurs when the parsed material starts with a dateline such as "Beirut:'' or "Beijing (Xinhua News Agency):" rather than the actual start of the sentence. We probably aren't catching all dateline errors with this pattern but it gets a lot of them, and if you are seeing frequent occurrences of this warning you need to modify your pre-filters to remove the datelines.
 
-The remaining errors are due to very odd sentence constructions which either have confused CoreNLP so that the phrase structure is incorrect, or otherwise were not anticipated in the PETRARCH processing. Some of this
-    can be fixed if brought to our attention, but some of it is on the side of CoreNLP, which we aren't
-    even going to attempt to touch.
+The remaining errors are due to very odd sentence constructions which either have confused CoreNLP so that the phrase structure is incorrect, or otherwise were not anticipated it the PETRARCH processing. Some of this can be fixed if brought to our attention, but some of it is on the side of CoreNLP, which we aren't even going to attempt to touch.
 
 Contents:
 ---------
@@ -117,7 +122,7 @@ Contents:
    :maxdepth: 2
 
    status.rst
-   petrarch.rst
+   petrarch2.rst
    dictionaries.rst
    inputs.rst
    contributing.rst

diff --git a/docs/source/inputs.rst b/docs/source/inputs.rst
@@ -58,8 +58,6 @@ the text in the entry is from a single sentence or a block of sentences, such
 as from the lead paragraph of a news story. Finally, the ``source`` attribute
 indicates what source the material came from, such as Agence-France Presse.
 
-,
-
 **General record fields:**
 
 All of these tags should occur on their own lines.

diff --git a/docs/source/modules.rst b/docs/source/modules.rst
@@ -1,10 +1,10 @@
 PETRARCH Package
 ========================
 
-:mod:`petrarch` Module
+:mod:`petrarch2` Module
 ----------------------
 
-.. automodule:: petrarch
+.. automodule:: petrarch2
     :members:
     :undoc-members:
     :show-inheritance:

diff --git a/docs/source/petrarch.rst → docs/source/petrarch2.rst b/docs/source/petrarch.rst → docs/source/petrarch2.rst
@@ -1,5 +1,5 @@
-PETRARCH
-========
+PETRARCH2
+=========
 
 This page contains some general notes about PETRARCH such as how the data is
 stored internally, how the configuration file is organized, and an outline of
@@ -24,6 +24,7 @@ Command Line Interface
   ``-c``, configuration will be read from that file; default config file is  ``PETR_config.ini``.
 
 ``parse``
+  **NOTE:** This command is deprecated in PETRARCH2.
   Run the PETRARCH parser specifying files in the command line
 
 

diff --git a/docs/source/status.rst b/docs/source/status.rst
@@ -1,5 +1,6 @@
-Status of the program 1-September-2015
+PETRARCH2 v. PETRARCH
 ======================================
+
 PETRARCH has been totally redone. The logic now more strongly follows the tree structure
 provided to us by the TreeBank parse.