## Detecting Sequences

Natural language is encoded in patterns, which are themselves comprised of sequences of tokens. Therefore, we want to be able to write rules that can detect these sequences.

For example, noun phrases are often constructed with a determiner*, such as "a" or "the", followed by an optional sequence of adjectives, followed by a noun. This is just one possible pattern for noun phrases, but useful for illustration purposes here.

*In some texts a "determiner" is also known as an "article", and some would consider it to be another type of adjective. We're considering it as a separate category here to help illustrate different ways that you can build up sequences.

In [1]:
import os, sys
sys.path.insert(1, os.path.abspath('..\\..\\..'))
from thoughts.rules_engine import RulesEngine
import pprint

# start a new engine
engine = RulesEngine()

## Review: Tokenizing Text and Enhancing the Tokens

In the last tutorial on Tokens, we generated tokens from a sentence and then used the #lookup command to enhance those tokens with information from the KB. Let's run that code now to catch up with the previous tutorial.

In [2]:
rules = [
  
    {"lemma": "the", "cat": "det"},
    {"lemma": "quick", "cat": "adj"},
    {"lemma": "brown", "cat": "adj"},
    {"lemma": "fox", "cat": "noun"},
    {"lemma": "jumped", "cat": "verb"},
    {"lemma": "over", "cat": "prep"},
    {"lemma": "lazy", "cat": "adj"},
    {"lemma": "dog", "cat": "noun"},

    {"#when": {"text": "?sentence"},
     "#then": [{"#tokenize": "?sentence", "assert": {"#lookup": {"lemma": "#"}}}]
    }
]

engine.add_rules(rules)

results = engine.process({"text": "the quick fox jumped over the lazy dog"})
pprint.pprint(results, sort_dicts=False)

[{'lemma': 'the', 'cat': 'det', '#seq-start': 0, '#seq-end': 1},
 {'lemma': 'quick', 'cat': 'adj', '#seq-start': 1, '#seq-end': 2},
 {'lemma': 'fox', 'cat': 'noun', '#seq-start': 2, '#seq-end': 3},
 {'lemma': 'jumped', 'cat': 'verb', '#seq-start': 3, '#seq-end': 4},
 {'lemma': 'over', 'cat': 'prep', '#seq-start': 4, '#seq-end': 5},
 {'lemma': 'the', 'cat': 'det', '#seq-start': 5, '#seq-end': 6},
 {'lemma': 'lazy', 'cat': 'adj', '#seq-start': 6, '#seq-end': 7},
 {'lemma': 'dog', 'cat': 'noun', '#seq-start': 7, '#seq-end': 8}]


# Sequence-Based Rules

Let's write a rule to detect the sequence where the "cat" (category) is det-adj-noun. The #when clause supports this by wrapping the terms in a list: "#when": [{term1}, {term2}, {term3}].

When the engine detects the first term (term1), it will hold on to that rule in memory as an Arc. An arc is a rule which is in the process of being matched, waiting for the next term (term2) to arrive. Each time a new term arrives via assertion and is matched to an arc, the arc is then copied and advanced forward one term. In this way, the engine hangs on to all partial matches along the way during parsing. If you are familiar with chart parsing, this is the same concept.

In [3]:
rules = [
    {"#when": [{"cat": "det"}, {"cat": "adj"}, {"cat": "noun", "lemma": "?noun"}],
    "#then": [{"cat": "np", "entity": "?noun"}]}
]

engine.add_rules(rules)

results = engine.process({"text": "the quick fox jumped over the lazy dog"})
pprint.pprint(results, sort_dicts=False)

[{'lemma': 'the', 'cat': 'det', '#seq-start': 0, '#seq-end': 1},
 {'lemma': 'quick', 'cat': 'adj', '#seq-start': 1, '#seq-end': 2},
 {'cat': 'np', 'entity': 'fox'},
 {'lemma': 'jumped', 'cat': 'verb', '#seq-start': 3, '#seq-end': 4},
 {'lemma': 'over', 'cat': 'prep', '#seq-start': 4, '#seq-end': 5},
 {'lemma': 'the', 'cat': 'det', '#seq-start': 5, '#seq-end': 6},
 {'lemma': 'lazy', 'cat': 'adj', '#seq-start': 6, '#seq-end': 7},
 {'cat': 'np', 'entity': 'dog'}]


## Clearing the Arcs

As described above, the engine hangs on to all partial matches for rules as arcs. Arcs are partially matched rules waiting for the next term to arrive.

You as a rule developer then need to decide when the engine should clear out its arcs. The most common time to clear them out is immediately before the next full parse. You can do this with the #clear-arcs command.

In [4]:
## manually clearing the arcs via code
## you can also do this with {"#clear-arcs": ""} in a command in the #then clause:
# {"#when": {"text": "?sentence"},
#  "#then": [{"#clear-arcs": ""},
#            {"#tokenize": "?sentence", "assert": {"#lookup": {"lemma": "#"}}}]
# }
engine.clear_arcs()

results = engine.process({"text": "the quick fox jumped over the lazy dog"})
pprint.pprint(results, sort_dicts=False)

[{'lemma': 'the', 'cat': 'det', '#seq-start': 0, '#seq-end': 1},
 {'lemma': 'quick', 'cat': 'adj', '#seq-start': 1, '#seq-end': 2},
 {'cat': 'np', 'entity': 'fox'},
 {'lemma': 'jumped', 'cat': 'verb', '#seq-start': 3, '#seq-end': 4},
 {'lemma': 'over', 'cat': 'prep', '#seq-start': 4, '#seq-end': 5},
 {'lemma': 'the', 'cat': 'det', '#seq-start': 5, '#seq-end': 6},
 {'lemma': 'lazy', 'cat': 'adj', '#seq-start': 6, '#seq-end': 7},
 {'cat': 'np', 'entity': 'dog'}]


## What's Next - Structures

Now that we've seen how to build sequences, let's look at how to build up structures, which can detect multiple sequences at different levels and build them up into assertion facts for additional rules to use.