Skip to content

Commit

Permalink
Merge pull request #32 from openeventdata/clean
Browse files Browse the repository at this point in the history
Clean
  • Loading branch information
johnb30 committed Oct 31, 2016
2 parents 7139eba + 343e424 commit b849348
Show file tree
Hide file tree
Showing 7 changed files with 75 additions and 64 deletions.
8 changes: 4 additions & 4 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('../../petrarch/'))
sys.path.insert(0, os.path.abspath('../../petrarch2/'))

# -- General configuration ------------------------------------------------

Expand Down Expand Up @@ -46,17 +46,17 @@
master_doc = 'index'

# General information about the project.
project = u'PETRARCH'
project = u'PETRARCH2'
copyright = u'2014, Open Event Data Alliance'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '.01a'
version = '1.0.0'
# The full version, including alpha/beta/rc tags.
release = '.01a'
release = '1.0.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
82 changes: 44 additions & 38 deletions docs/source/dictionaries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@ Everything after this symbol and before the next newline will be ignored by the
something I want # followed by a Python-like comment


The program is *not* set up to handle clever variations like nested comments, multiple
comments on a line, or non-comment information in multi-line comments: yes, we are
perfectly capable of writing code that could handle these contingencies, but it
Expand All @@ -46,15 +44,17 @@ adds to the memory overhead and can be somewhat confusing if you don't know what
with. This data structure stores each word at a node, and following a path in the tree will lead
to a pattern. Let's take a small part of the discard list as an example:

::
.. code-block:: none
WORLD BOXING ASSOCIATION
WORLD BOXING COUNCIL
WORLD CUP
These three entries would be stored in the following Trie:

::
.. code-block:: none
PETRglobals.DiscardList
|
|
Expand Down Expand Up @@ -95,14 +95,13 @@ Synonym sets (synsets) are labelled with a string beginning with & and defined u
the label followed by a series of lines beginning with ``+`` containing words or phrases.
The phrases are interpreted as requiring consecutive words; the words can be separated
with underscores (they are converted to spaces). Synset phrases can
only contain words, not ``$``, ``+``, ``%`` or ``^`` tokens.
Synsets be used anywhere in a
pattern that a word or phrase can be used. A synset must be defined before it is used:
a pattern containing an undefined synset will be ignored.
only contain words, not ``$``, ``+``, ``%`` or ``^`` tokens. Synsets can be used anywhere in a pattern that a word or phrase can be used. A synset must be defined before it is used: a pattern containing an undefined synset will be ignored.

Regular plurals are generated automatically by adding 'S' to the root, adding 'IES' if the root ends in 'Y', and added 'ES' if the root ends in 'SS'. Plurals are not created when [1]_

.. [1] The method for handling irregular plurals is currently different for the verbs and agents dictionaries: these will be reconciled in the future, probably using the agents syntax.
Regular plurals are generated automatically by adding 'S' to the root, adding 'IES' if the root ends in 'Y', and added 'ES' if the root ends in 'SS'.
The method for handling irregular plurals is currently different for the verbs
and agents dictionaries: these will be reconciled in the future, probably using
the agents syntax.
Plurals are not created when:

* The phrase ends with ``_``.

Expand All @@ -118,19 +117,19 @@ just enter these as additional synonyms.
A verb synonym block is a set of verbs which are synonymous (or close enough) with
respect to the patterns. The program automatically generates the regular forms of the
verb if it is regular (and, implicitly, English); otherwise the irregular forms can be
specified in {...} following the primary verb. An optional code for the isolated verb
can follow in [...].
specified in ``{...}`` following the primary verb. An optional code for the isolated verb
can follow in ``[...]``.

The verb block begins with a comment of the form

::

--- <GENERAL DESCRIPTION> [<CODE>] ---

where the "---" signals the beginning of a new block. The code in [...] is the
where the ``---`` signals the beginning of a new block. The code in ``[...]`` is the
primary code -- typically a two-digit+0 cue-category code -- for the block, and this
will be used for all other verbs unless these have their own code. If no code is
present, this defaults to the null code "---" which indicates that the isolated verb
present, this defaults to the null code ``---`` which indicates that the isolated verb
does not generate an event. The null code also can be used as a secondary code.


Expand All @@ -139,9 +138,9 @@ does not generate an event. The null code also can be used as a secondary code.
Multiple-word "verbs" such as "CONDON OFF", "WIRE TAP" and "BEEF UP" are entered by
connecting the words with an underscore and putting a '+'
in front of the word in the phrase that is going to be identified as a verb.
If there is no {...}, regular
If there is no ``{...}``, regular
forms are constructed for the word designated by '+'; otherwise all of the irregular
forms are given in {...}. If you can't figure out which part of the phrase is the
forms are given in ``{...}``. If you can't figure out which part of the phrase is the
verb, the phrase you are looking at is probably a noun, not a verb. Multi-word verbs
are treated in patterns just as single-word verbs are treated.

Expand All @@ -164,8 +163,9 @@ are more frequently parsed correctly.



** Patterns **
This is followed by a set of patterns -- these begin with '-' -- which are based roughly on
**Patterns**

This is followed by a set of patterns -- these begin with ``-`` -- which are based roughly on
the syntax from TABARI patterns, but the patterns in Petrarch's dictionaries also contain
some syntactic annotation. Pattern lines begin with a
-, and are followed by a five-part pattern:
Expand All @@ -176,7 +176,7 @@ some syntactic annotation. Pattern lines begin with a

Any of these can be left empty. Singular nouns are left bare, and should be the "head" of the phrase
they are a member of, e.g. the head of "Much-needed financial aid" would be "aid." If multiple nouns or
adjectives are needed, then that phrase is put in braces as in {FINANCIAL AID}, where the last word is the
adjectives are needed, then that phrase is put in braces as in ``{FINANCIAL AID}``, where the last word is the
head. Prepositional phrases are put in parentheses where the first element is the preposition, and the second
element is a noun, or a braced noun phrase.

Expand All @@ -192,16 +192,20 @@ Note that these patterns do not contain other verbs. This is different from TABA
versions of Petrarch. This is to simplify the verbs dictionary, and make the pattern matching
faster and more effective.

** Combinations **
**Combinations**

Petrarch handles many verb-verb interactions automatically through its reformatting of CAMEO's semantic
heirarchy (See utilities.convert_code for more). For instance, if it were parsing the phrase
" A will [help B]", it would code "to help B" first, then the phrase would become "A will [_ B 0x0040]".
And then since help=0x0040 is a subcategory of will=0x3000, then it just adds them together,
ending with the code [A B 0x3040]. This code is translated back into CAMEO for the final output,
yielding [A B 033]. This process works for most instances where the idea of the phrase as a whole

" A will [help B]"

it would code "to help B" first, then the phrase would become "A will [_ B 0x0040]".
And then since ``help=0x0040`` is a subcategory of ``will=0x3000``, then it just adds them together,
ending with the code ``[A B 0x3040]``. This code is translated back into CAMEO for the final output,
yielding ``[A B 033]``. This process works for most instances where the idea of the phrase as a whole
is a combination of the ideas of its children.

** Transformations **
**Transformations**

Sometimes these verb-vertb interactions aren't represented in the
ontology. It is possible to specify what happens when one verb finds that it is acting on another verb.
Expand All @@ -217,19 +221,20 @@ The first element is the topmost source actor, the last element is the topmost v
are converted to codes, so synonyms also match). The inner parenthetical has the same format, with the
first element being the lower source, the second the lower target, and the third the lower verb. It
is possible to replace letter variables with a period '.' to represent "non-specified actor", or with
an underscore '_' to specify "non-present actor." Verbs can also be replaced with "Q" to mean "any verb."
an underscore ``_`` to specify "non-present actor." Verbs can also be replaced with "Q" to mean "any verb."

These transformations are sometimes necessary, but most cases can be handled by the combination process.


** Storage in Memory **
**Storage in Memory**

The verb dictionary, when stored into memory, has three subdictionaries: words, patterns, and transformations.

The words portion contains the base verbs. They are stored as VERB--STUFF BEFORE--#--STUFF AFTER--#--INFO. For
most verbs (i.e. those that are not compounds), The entry just goes VERB -- # -- # -- INFO.
The words portion contains the base verbs. They are stored as ``VERB--STUFF BEFORE--#--STUFF AFTER--#--INFO``. For
most verbs (i.e. those that are not compounds), The entry just goes ``VERB -- # -- # -- INFO``.

The transformation contains almost a literal transcription of the pattern, ordered
VERB1--SOURCE1--VERB2--SOURCE2--TARGET2--INFO.
``VERB1--SOURCE1--VERB2--SOURCE2--TARGET2--INFO``.

The verb patterns in memory have extra annotative symbols after every word to indicate the type of
word that comes next. The very first word encountered is always a noun. Then it follows a series of rules
Expand Down Expand Up @@ -394,11 +399,11 @@ for organizations, e.g. ``NGO~``)
Regular plurals -- those formed by adding 'S' to the root, adding 'IES' if the
root ends in 'Y', and added 'ES' if the root ends in 'SS' -- are generated automatically

If the plural has some other form, it follows the root inside {...} [1]_
If the plural has some other form, it follows the root inside ``{...}`` [1]_

If a plural should not be formed -- that is, the root is only singular or only
plural, or the singular and plural have the same form (e.g. "police"), use a null
string inside {}.
string inside ``{}``.

If there is more than one form of the plural -- "attorneys general" and "attorneys
generals" are both in use -- just make a second entry with one of the plural forms
Expand All @@ -422,7 +427,7 @@ and used in the form
CONGRESS!PERSON! [~LEG}
!MINIST!_OF_INTERNAL_AFFAIRS

The marker for the substitution set is of the form !...! and is followed by an =
The marker for the substitution set is of the form ``!...!`` and is followed by an =
and a comma-delimited list; spaces are stripped from the elements of the list so
these can be added for clarity. Every item in the list is substituted for the marker,
with no additional plural formation, so the first construction would generate
Expand Down Expand Up @@ -460,9 +465,10 @@ with no additional plural formation, so the first construction would generate
Discard List
------------

The discard list is used to identify sentences that should not be coded, for example sports events and historical chronologies.[2]_ If the string, prefixed with ' ', is found in the ``<Text>...</Text>`` sentence, the
sentence is not coded. Prefixing the string with a '+' means the entire story is not
coded with the string is found. If the string ends with '_', the matched string must also end with
The discard list is used to identify sentences that should not be coded, for example sports events and historical chronologies. [2]_
If the string, prefixed with ``' '``, is found in the ``<Text>...</Text>`` sentence, the
sentence is not coded. Prefixing the string with a ``+`` means the entire story is not
coded with the string is found. If the string ends with ``_``, the matched string must also end with
a blank or punctuation mark; otherwise it is treated as a stem. The matching is not
case sensitive.

Expand Down Expand Up @@ -503,7 +509,7 @@ The optional ``Issues`` dictionary is used to do simple string matching and retu

``<string> [<code>]``

For purposes of matching, a ' ' is added to the beginning and end of the string: at
For purposes of matching, a ``' '`` is added to the beginning and end of the string: at
present there are no wild cards, though that is easily added.

The following expansions can be used (these apply to the string that follows up to
Expand Down
35 changes: 20 additions & 15 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,15 @@
Introduction
============

A Python Engine for Text Resolution And Related Coding Hierarchy.
A Python Engine for Text Resolution And Related Coding Hierarchy part 2.
This is the documentation for PETRARCH2, though PETRARCH is used throughtout
this documentation as interchangeable with PETRARCH2. The difference between
the programs lies in the coding engine rather than the API; more details can be
seen in the `Comparison <status.html>`_.


One of my students put it this way: "Francesco Petrarch was Kayne West. He jumps up on stage, says
'Yo, welcome to the Renaissance, bitches!' And then drops the mike." <br\>
Dorsey Armstrong <br\>
'Yo, welcome to the Renaissance, bitches!' And then drops the mike." -- Dorsey Armstrong
*Great Minds of the Medieval World* (Great Courses Series), lecture 20

PETRARCH is a natural language processing tool for machine-coding events data.
Expand All @@ -21,10 +24,10 @@ from which 'whom-did-what-to-whom' relations are extracted.

PETRARCH is the next-generation successor to the `TABARI
<http://eventdata.parusanalytics.com/software.dir/tabari.html>`_ event-data
coding software. More information about the key differences between PETRARCH
and TABARI can be be found `here <current.html>`_ .
coding software. A description of the differences between TABARI and
PETRARCH-generation software is available `here <tabari_vs_petrarch.html>`_.

This software is MIT Licensed (MIT) Copyright &copy; 2014 Open Event Data Alliance
This software is MIT Licensed (MIT) Copyright 2014 Open Event Data Alliance


Events Data
Expand Down Expand Up @@ -59,12 +62,8 @@ Installing
----------
If you do decide you want to work with Petrarch as a standalone program, it is possible to install:

1) Clone the repo

- For example, download the zip file into ``~/Downloads``.
- This will put the repo into something like ``~/Downloads/petrarch``.

2) Run ``pip install -e ~/Downloads/petrarch``
1) Run ``pip install git+https://github.com/openeventdata/petrarch2.git``


This will install the program with a command-line hook. You can now run the program using:
Expand All @@ -90,6 +89,14 @@ If not installed:

``python petrarch.py batch -i <INPUT FILE> -o <OUTPUT FILE>``

You can see a sample of the input/output by running (assuming you're in the
PETRARCH2 directory):

``petrarch2 batch -i ./petrarch2/data/text/GigaWord.sample.PETR.xml -o
test.txt``

This will return a file named `evts.test.txt`.

There's also the option to specify a configuration file using the ``-c <CONFIG
FILE>`` flag, but the program will default to using ``PETR_config.ini``.

Expand All @@ -106,9 +113,7 @@ Unexpected conditions where the program encountered a potentially fatal error ar

The one common error -- not included in those counts -- is the ``Dateline`` pattern, which is a particular pattern in the parse tree that occurs when the parsed material starts with a dateline such as "Beirut:'' or "Beijing (Xinhua News Agency):" rather than the actual start of the sentence. We probably aren't catching all dateline errors with this pattern but it gets a lot of them, and if you are seeing frequent occurrences of this warning you need to modify your pre-filters to remove the datelines.

The remaining errors are due to very odd sentence constructions which either have confused CoreNLP so that the phrase structure is incorrect, or otherwise were not anticipated in the PETRARCH processing. Some of this
can be fixed if brought to our attention, but some of it is on the side of CoreNLP, which we aren't
even going to attempt to touch.
The remaining errors are due to very odd sentence constructions which either have confused CoreNLP so that the phrase structure is incorrect, or otherwise were not anticipated it the PETRARCH processing. Some of this can be fixed if brought to our attention, but some of it is on the side of CoreNLP, which we aren't even going to attempt to touch.

Contents:
---------
Expand All @@ -117,7 +122,7 @@ Contents:
:maxdepth: 2

status.rst
petrarch.rst
petrarch2.rst
dictionaries.rst
inputs.rst
contributing.rst
Expand Down
2 changes: 0 additions & 2 deletions docs/source/inputs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,6 @@ the text in the entry is from a single sentence or a block of sentences, such
as from the lead paragraph of a news story. Finally, the ``source`` attribute
indicates what source the material came from, such as Agence-France Presse.

,

**General record fields:**

All of these tags should occur on their own lines.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/modules.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
PETRARCH Package
========================

:mod:`petrarch` Module
:mod:`petrarch2` Module
----------------------

.. automodule:: petrarch
.. automodule:: petrarch2
:members:
:undoc-members:
:show-inheritance:
Expand Down
5 changes: 3 additions & 2 deletions docs/source/petrarch.rst → docs/source/petrarch2.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
PETRARCH
========
PETRARCH2
=========

This page contains some general notes about PETRARCH such as how the data is
stored internally, how the configuration file is organized, and an outline of
Expand All @@ -24,6 +24,7 @@ Command Line Interface
``-c``, configuration will be read from that file; default config file is ``PETR_config.ini``.

``parse``
**NOTE:** This command is deprecated in PETRARCH2.
Run the PETRARCH parser specifying files in the command line


Expand Down
3 changes: 2 additions & 1 deletion docs/source/status.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Status of the program 1-September-2015
PETRARCH2 v. PETRARCH
======================================

PETRARCH has been totally redone. The logic now more strongly follows the tree structure
provided to us by the TreeBank parse.

Expand Down

0 comments on commit b849348

Please sign in to comment.