## Parsing Sentences Using Tokens

Template-based parsing from the last tutorial is a useful and powerful technique for many natural language processing scenarios, but also has its limitations.

What if you could analyze the sentence word-by-word, using the information in each word to contribute to the overall meaning of the sentence? This is the technique used by most modern day parsing techniques, including those used in the machine learning and symbolic computing communities.

Let's start up the engine and walk through an example.

In [1]:
import os, sys
sys.path.insert(1, os.path.abspath('..\\..'))
from thoughts.rules_engine import RulesEngine
import pprint

# start a new engine
engine = RulesEngine()

## Tokenizing the Input

The first step is to "tokenize" the input. We need a way to take a raw sentence, which is a sequence of words, and to trigger rules for each individual word in the sentence.

To do this, we can use the #tokenize command. This command will assert a fact for every word in an input string. In the example below, the ?sentence input text is used in the #tokenize command. The command will output an assertion for every word in the ?sentence variable.

Note that only the final set of resulting assertions is displayed in the output. These are the tokenized assertions generated by the #tokenize command from the input sentence.

In [2]:
rules = [

    {"#when": {"text": "?sentence"},
     "#then": [{"#tokenize": "?sentence"}]
    }
]

engine.clear_rules()
engine.add_rules(rules)

results = engine.process({"text": "the quick brown fox jumped over the lazy dog"})
pprint.pprint(results, sort_dicts=False)

[{'#': 'the', '#seq-start': 0, '#seq-end': 1},
 {'#': 'quick', '#seq-start': 1, '#seq-end': 2},
 {'#': 'brown', '#seq-start': 2, '#seq-end': 3},
 {'#': 'fox', '#seq-start': 3, '#seq-end': 4},
 {'#': 'jumped', '#seq-start': 4, '#seq-end': 5},
 {'#': 'over', '#seq-start': 5, '#seq-end': 6},
 {'#': 'the', '#seq-start': 6, '#seq-end': 7},
 {'#': 'lazy', '#seq-start': 7, '#seq-end': 8},
 {'#': 'dog', '#seq-start': 8, '#seq-end': 9}]


## Wrapping the Tokens in Assertions

Great - now we have generated an assertion for each token (word) in the sentence.

What we need now is a way to tell the engine what to do with those assertions, to be a little more specific in how we want to assert them for other rules to pick them up.

The #tokenize command has a way to do this, by using the "assert" attribute on the command. Within the "assert" attribute, you can specify how you want each generated token to be asserted by using a template. To insert the original token in this template, use the "#" value and the #tokenize function will insert the token in that spot.

In the example below, the #tokenize command will assert each token as a #lookup command, inserting the token into a "lemma" attribute within a new fact.

In [4]:
rules = [

    {"#when": {"text": "?sentence"},
     "#then": [{"#tokenize": "?sentence", "assert": {"#lookup": {"lemma": "#"}}}]
    }
]

engine.clear_rules()
engine.add_rules(rules)

results = engine.process({"text": "the quick brown fox jumped over the lazy dog"})
pprint.pprint(results, sort_dicts=False)

[{'#lookup': {'lemma': 'the'}, '#seq-start': 0, '#seq-end': 1},
 {'#lookup': {'lemma': 'quick'}, '#seq-start': 1, '#seq-end': 2},
 {'#lookup': {'lemma': 'brown'}, '#seq-start': 2, '#seq-end': 3},
 {'#lookup': {'lemma': 'fox'}, '#seq-start': 3, '#seq-end': 4},
 {'#lookup': {'lemma': 'jumped'}, '#seq-start': 4, '#seq-end': 5},
 {'#lookup': {'lemma': 'over'}, '#seq-start': 5, '#seq-end': 6},
 {'#lookup': {'lemma': 'the'}, '#seq-start': 6, '#seq-end': 7},
 {'#lookup': {'lemma': 'lazy'}, '#seq-start': 7, '#seq-end': 8},
 {'#lookup': {'lemma': 'dog'}, '#seq-start': 8, '#seq-end': 9}]


## Enhancing the Tokens with Information from other Facts

The #lookup command can now take each token and find the matching fact in the engine's KB and then assert that token back into the engine. The effect of this is that you can enhance facts that have partial information, such as the tokens generated with the #tokenize function, with information available elsewhere in your KB.

In [5]:
rules = [
  
    {"lemma": "the", "cat": "det"},
    {"lemma": "quick", "cat": "adj"},
    {"lemma": "brown", "cat": "adj"},
    {"lemma": "fox", "cat": "noun"},
    {"lemma": "jumped", "cat": "verb"},
    {"lemma": "over", "cat": "prep"},
    {"lemma": "lazy", "cat": "adj"},
    {"lemma": "dog", "cat": "noun"}
]

engine.add_rules(rules)

results = engine.process({"text": "the quick brown fox jumped over the lazy dog"})
pprint.pprint(results, sort_dicts=False)

[{'lemma': 'the', 'cat': 'det', '#seq-start': 0, '#seq-end': 1},
 {'lemma': 'quick', 'cat': 'adj', '#seq-start': 1, '#seq-end': 2},
 {'lemma': 'brown', 'cat': 'adj', '#seq-start': 2, '#seq-end': 3},
 {'lemma': 'fox', 'cat': 'noun', '#seq-start': 3, '#seq-end': 4},
 {'lemma': 'jumped', 'cat': 'verb', '#seq-start': 4, '#seq-end': 5},
 {'lemma': 'over', 'cat': 'prep', '#seq-start': 5, '#seq-end': 6},
 {'lemma': 'the', 'cat': 'det', '#seq-start': 6, '#seq-end': 7},
 {'lemma': 'lazy', 'cat': 'adj', '#seq-start': 7, '#seq-end': 8},
 {'lemma': 'dog', 'cat': 'noun', '#seq-start': 8, '#seq-end': 9}]


## Next Step - Detecting Sequences

Now that we've tokenized a sentence and performed a lookup on each token to return an enhanced version of each, we can use this information to look for sequences of terms. Sequences are key to parsing natural language, such as detecting a noun phrase from a sequence of adjectives followed by a noun.