In [12]:
from IPython.display import HTML
HTML("""<style>{}</style>""".format(open("assets/css/custom.css").read()))

In [1]:
import processors
print(processors.__version__)

3.0.3


In [2]:
from processors import *

API = ProcessorsAPI(port=8881, keep_alive=True)

INFO - Using path given via $PROCESSORS_SERVER
INFO - Connection with server established!
INFO - Server version meets recommendations (v3.0.2)


# Rule-based information extraction with Odin


The Odin manual can be found here: https://arxiv.org/abs/1509.07513

In [142]:
example_doc = API.annotate("Julia-Louis Dreyfus and Brad Hall were married in June of 1987.")

from processors.visualization import JupyterVisualizer as viz

viz.display_graph(example_doc.sentences[0])

# Capturing entities

TODO

## Capturing entities with surface patterns

TODO

## Reusing mentions from an earlier rule

TODO. Give an example.  Titles?
## Challenge: chunking text (part 2)

Write a rule set that captures this simple phrase structure grammar for linguistic constituents:
```
Verb -> (by PoS tag)
Noun -> (by PoS tag)
Adjective -> (by PoS tag)
NP -> determiner (by tag) + zero or more Adjective + one or more Noun
```


## Challenge: chunking text (part 2)

Modify the PSG rules provided in the previous challenge to include one for a verb phrase (VP).  Extend your grammar to cover your `VP` additions.  

## Token contraints

| Field | Description |
|:-----:|:----------|
| `word` | The actual token. |
| `lemma` | The lemma form of the token |
| `tag` | The part-of-speech (PoS) tag assigned to the token |
| `incoming` | Incoming relations from the dependency graph for the token |
| `outgoing` | Outgoing relations from the dependency graph for the token |
| `chunk` | The shallow constituent type (ex. NP, VP) immediately containing the token |
| `entity` | The NER label of the token |
| `mention` | The label of any Mention(s) (i.e., rule output) that contains the token. |

## Regex or exact

TODO

## Case-insensitive patterns

TODO

## Combining token constraints

TODO

## Challenge: nouns with a certain suffix

TODO

## Negating token constraints

TODO

`[!fieldname=pattern]`

## Challenge: no verbs!

Using a single token constraint, match all tokens in the following sentence that are not verbs:

>If you wish to make an apple pie from scratch, you must first invent the universe.

In [154]:
text = "If you wish to make an apple pie from scratch, you must first invent the universe."

d = API.annotate(text)

viz.display_graph(d.sentences[0])

challenge_rules = """
rules:
    - name: "no-verbs"
      label: NotVerb
      # req. 1: This pattern should involve a single token constraint
      # req. 2: The token constraint should use a negated pattern
      pattern: | ???
"""

# mentions = API.odin.extract_from_document(d, challenge_rules)
# for m in mentions: print(m)

## Wildcard

Sometimes any token will suffice to complete a pattern.  In such cases where token constraints are unnecessary, the `[]` wildcard can be used.

Example pattern: `[] people` 
  - Example matches
      - I see **dead people**  
      - All the **lonely people**
      - The are a **strange people**
  
    

## Quantifiers

Token constraints, [arguments](#Quantifiers-for-dependency-patterns), and [graph edges](#Quantifiers-in-graph-traversals) can all be quantified.


| Symbol    | Description | Lazy form |
| ------------- |:-------------:| -----:|
|      `?` | The quantified pattern is optional. | `??` |
|      `*` | Repeat the quantified pattern zero or more times. | `*?` |
|      `+` | Repeat the quantified pattern one or more times. | `+?` |
|      `{n}` | Exact repetition. Repeat the quantified pattern n times. | |
|      `{n,m}` | Ranged repetition. Repeat the quantified pattern between *n* and *m* times, where *n* < *m*. | `{n,m}?` |
|      `{,m}` | Open start ranged repetition. Repeat the quantified pattern between 0 and m times, where *m* > 0. | `{,m}?` |
| `{n,}` | Open end ranged repetition. Repeat the quantified pattern at least *n* times, where *n* > 0. | `{n,}?` |
      
      

## Lookarounds and other zero-width assertions

TODO

| Symbol        | Description   | Example Pattern | Match (in bold) |
| ------------- |:-------------:| -----:|:-------------:|
| `^`     | beginning of sentence | `^ My` | **My** name is Inigo Montoya . |
| `$`      | end of sentence      | `"." $` | My name is Inigo Montoya **.** |
| `(?=...)`      | postive lookahead      | `Inigo (?= Montoya)` | My name is **Inigo** Montoya . |
| `(?!...)`      | negative lookahead      | `Inigo (?! Arocena)` | My name is **Inigo** Montoya . |
| `(?<=...)`      | positive lookbehind     | `(?<= Inigo) Montoya` | My name is Inigo **Montoya** . |
| `(?<!...)`      | negative lookbehind     | `(?<! Carlos) Montoya` | My name is Inigo **Montoya** . |

# Refining rules: an example

In [131]:
entity_rule_v1 = """
rules: 
    - name: "person"
      label: Person
      priority: 1
      type: token
      pattern: |
        [tag=NNP]+
"""

In [132]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v1)
for m in mentions: print(m)

[1m[31mPerson[0m: [43m[1m[42mJulia-Louis[0m[0m[43m [0m[43m[1m[42mDreyfus[0m[0m and Brad Hall were married in June of 1987 .
[1m[31mPerson[0m: Julia-Louis Dreyfus and [43m[1m[42mBrad[0m[0m[43m [0m[43m[1m[42mHall[0m[0m were married in June of 1987 .
[1m[31mPerson[0m: Julia-Louis Dreyfus and Brad Hall were married in [43m[1m[42mJune[0m[0m of 1987 .


In [133]:
entity_rule_v2 = """
rules:
    - name: "person"
      label: Person
      priority: 1
      type: token
      pattern: |
        [entity=PERSON]+
"""

In [134]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v2)
for m in mentions: print(m)

[1m[31mPerson[0m: Julia-Louis Dreyfus and [43m[1m[42mBrad[0m[0m[43m [0m[43m[1m[42mHall[0m[0m were married in June of 1987 .
[1m[31mPerson[0m: [43m[1m[42mJulia-Louis[0m[0m[43m [0m[43m[1m[42mDreyfus[0m[0m and Brad Hall were married in June of 1987 .


# Capturing events and relations

TODO

## Capturing events and relations with surface patterns

TODO

## Capturing events and relations with dependency patterns

TODO

In [135]:
rules_v1 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person = nsubjpass
"""

In [136]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v1)
for m in mentions: 
    if m.matches("Marriage"):
        print(m)

[1m[31mMarriage[0m: [43m[1m[42mJulia-Louis[0m[0m[43m [0m[43m[1m[42mDreyfus[0m[0m[43m [0m[43mand[0m[43m [0m[43mBrad[0m[43m [0m[43mHall[0m[43m [0m[43mwere[0m[43m [0m[43m[1m[44mmarried[0m[0m in June of 1987 .
[1m[31mMarriage[0m: Julia-Louis Dreyfus and [43m[1m[42mBrad[0m[0m[43m [0m[43m[1m[42mHall[0m[0m[43m [0m[43mwere[0m[43m [0m[43m[1m[44mmarried[0m[0m in June of 1987 .


We end up with two `Marriage` event mentions, each containing only one spouse.  Wouldn't it be great if we had a way to specify how many of each argument were required for a single mention?

## Quantifiers for dependency patterns

We know it takes two to tango, so let's try to get those arguments in the same mention.

In [137]:
rules_v2 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person+ = nsubjpass
"""

In [138]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v2)
for m in mentions:
    if m.matches("Marriage"):
        print(m)

[1m[31mMarriage[0m: [43m[1m[42mJulia-Louis[0m[0m[43m [0m[43m[1m[42mDreyfus[0m[0m[43m [0m[43mand[0m[43m [0m[43m[1m[42mBrad[0m[0m[43m [0m[43m[1m[42mHall[0m[0m[43m [0m[43mwere[0m[43m [0m[43m[1m[44mmarried[0m[0m in June of 1987 .


We can even specify an exact number for each argument.

In [139]:
rules_v3 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person{2} = nsubjpass
"""

mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v3)
for m in mentions:
    if m.matches("Marriage"):
        print(m)

[1m[31mMarriage[0m: [43m[1m[42mJulia-Louis[0m[0m[43m [0m[43m[1m[42mDreyfus[0m[0m[43m [0m[43mand[0m[43m [0m[43m[1m[42mBrad[0m[0m[43m [0m[43m[1m[42mHall[0m[0m[43m [0m[43mwere[0m[43m [0m[43m[1m[44mmarried[0m[0m in June of 1987 .


## Challenge: no more than four!

Imagine a polyandrous society where a woman can have at most four husbands.

>In a parallel universe, Marge is married to Homer, Ned, and Troy McClure.


Complete the grammar rule set below to satisfy the conditions specified in the challenge.

In [141]:
text = "In a parallel universe, Marge married Homer, Ned Flanders, and Troy McClure."

d = API.annotate(text)

viz.display_graph(d.sentences[0], css=viz.mention_style)

challenge_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: ???
"""

#mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
#for m in mentions: print(m)

## Challenge: optional arguments

Modify the grammar below to include two optional arguments in the `Marriage` event: "date" of type `Date` and "location" of type `Location`.  Remember that you'll need additional to capture `Date` and `Location` in order for them to be available to the event rule.


In [146]:
text = "Gonzo and Camilla were married in October.  Barack and Michelle were married in Chicago."
d = API.annotate(text)


challenge_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    # TODO: add a rule for Date
    
    # TODO: add a rule for Location
    
    # TODO: add optional args to "marriage-event"
    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person{2} = nsubjpass
"""

mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
for m in mentions:
    if m.matches("Marriage"):
        print(m)

[1m[31mMarriage[0m: [43m[1m[42mBarack[0m[0m[43m [0m[43mand[0m[43m [0m[43m[1m[42mMichelle[0m[0m[43m [0m[43mwere[0m[43m [0m[43m[1m[44mmarried[0m[0m in Chicago .


## Quantifiers in graph traversals

TODO

# Variables and rule templates

TODO

Write a couple of events that rely on the same predicate.

Use gist urls for rule sets with vars

File imports are also possible in Odin, but currently for that you'll need to go beyond `py-processors`.  For more complex cases of template using involving multiple files, see the [odin examples]() sbt project or [Reach]().

# Defining a taxonomy

TODO

# Priorities for rules

The `priority` field allows you to specify when a rule should be applied.  By default, a rule will continue to be executed until no rule has produced a new match.  This means that you usually don't need to worry about setting the priority, but the power is there if you need it.  

Note that [quantifiers](#Quantifiers) can be applied to priorities.

# Debugging rules

## Making sense of errors

Here we describe some common errors you may encounter as you learn to write rules.

### *A mispelled or missing `label` field...*

Every rule must have either a `label` or `labels` field.  

This field tells Odin what the type of the Mention is that you're trying to capture.  

Remember that these types can be "reused" in subsequent rules (ex. find a `Person` and then find events involving some `Person`).

In [30]:
bad_rules = """
rules: 
    - name: "person"
      type: token
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=doc, rules=bad_rules)

[1m[31mOdinError[0m: rule 'person' has no labels

rules: 
    - name: "person"
      type: token
      pattern: |
        [entity=PERSON]+



### *A mispelled or missing `name` field...*

Every rule needs a **name**.  B shur 2 spel it write two!

In [14]:
bad_rules = """
rules: 
    - nme: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=doc, rules=bad_rules)

[1m[31mOdinError[0m: unnamed rule

rules: 
    - nme: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+



### *An invalid rule `type`*...

By default, rules are assumed to be of type `dependency`.  If you're writing a *dependency* pattern, you can actually leave out the `type` field.  Wow, talk about convenient!

If you're writing a `token` pattern, however, you'll need to specify `type: token`.

In [15]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: tken
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=doc, rules=bad_rules)

[1m[31mOdinError[0m: type 'tken' not recognized for rule 'person'

rules: 
    - name: "person"
      label: Person
      type: tken
      pattern: |
        [entity=PERSON]+



### *An invalid token `field`...*

In the current version of Odin, you are restricted to a predefined set of token fields for use in your patterns.  

See the [token constraints table](#Token-contraints) for a comprehensive list of valid token fields. 

In [16]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [nonexistentfield=BLARG]+
"""

API.odin.extract_from_document(doc=doc, rules=bad_rules)

[1m[31mOdinError[0m: Error parsing rule 'person': unrecognized token field

rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [nonexistentfield=BLARG]+



### ***Avoid single line patterns...***

In [32]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      priority: 1+
      type: token
      pattern: [entity=PERSON]+
"""

API.odin.extract_from_document(doc=doc, rules=bad_rules)

[1m[31mOdinError[0m: while parsing a block mapping
 in 'string', line 3, column 7:
        - name: "person"
          ^
expected <block end>, but found Scalar
 in 'string', line 7, column 31:
          pattern: [entity=PERSON]+
                                  ^


rules: 
    - name: "person"
      label: Person
      priority: 1+
      type: token
      pattern: [entity=PERSON]+



While the error message is cryptic, the solution is to simply make the pattern multiline (ex. `pattern: |`).  

#### Great, but what's really happening here?
This pattern never makes it Odin, because it fails to parse as valid `YAML`.  `|` denotes a `YAML` [scalar](https://en.wikipedia.org/wiki/Variable_%28computer_science%29), which `YAML` will read without complaint and pass along to Odin.  

Without the `|`, the `YAML` parser assumes that it's dealing with a list until it sees the `+`, which blows its mind with a wave of Cthulu madness, upends its conception of the reality, and sends it to an ashram for a period of convalescence and deep introspection.  

### *Every rule must have a unique name...*

We keep track of what rule found each Mention, so rule names need to be unique to avoid ambiguities of provenance.

In [29]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
        
    - name: "person"
      label: Person
      type: token
      pattern: |
        [tag=NNP]+
"""

API.odin.extract_from_document(doc=doc, rules=bad_rules)

[1m[31mOdinError[0m: rule name 'person' is not unique

rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
        
    - name: "person"
      label: Person
      type: token
      pattern: |
        [tag=NNP]+

