# Testing Padatious parser

The Padatious parser from Mycroft AI trains a neural network on a set of intents and entity representation, to classify natural language inputs into pre-registered intents. 

The underlying neural network is FANN, a C library with bindings for Python and other languages. Padatious trains a simple net, with one hidden layer, over the inputs given. Inputs are compiled into regexes, so entities and their positions can be nested within intents.

This net is not based on any foundation model and has no world knowledge. This means that any input must match relatively close to the source material. Not great for our desire to have automatic synonym matching, as it results in false negatives. Even word2vec does better in this regard, although it finds lots of false positives.

As an example, let's train "hello" and "goodbye" intents and try some alternative greetings.


In [49]:
from padatious import IntentContainer

hi_list = ['hey',
'hello',
'hi',
'hello there',
'good morning',
'good evening',
'moin',
'hey there',
"let's go",
'hey dude',
'goodmorning',
'goodevening',
'good afternoon',
"what's up"]

bye_list = ['cu'
'good by'
'cee you later'
'good night'
'bye'
'goodbye'
'have a nice day'
'see you around'
'bye bye'
'see you later']

container = IntentContainer('intent_cache')

container.add_intent('hello', hi_list, reload_cache=True)
container.add_intent('goodbye', bye_list, reload_cache=True)


In [50]:
test_list = ['hello', 'helo', 'what up', 'bye', 'good day', 'later', 'error', 'help']
for i in test_list:
    intent = container.calc_intent(i)
    print(intent)

Regenerated goodbye.Regenerated hello.

{'name': 'hello', 'sent': ['hello'], 'matches': {}, 'conf': 1.0}
{'name': 'hello', 'sent': 'helo', 'matches': {}, 'conf': 0.0}
{'name': 'hello', 'sent': 'what up', 'matches': {}, 'conf': 0.6746383452819594}
{'name': 'hello', 'sent': 'bye', 'matches': {}, 'conf': 0.0}
{'name': 'hello', 'sent': 'good day', 'matches': {}, 'conf': 0.17816539894311617}
{'name': 'hello', 'sent': 'later', 'matches': {}, 'conf': 0.0}
{'name': 'hello', 'sent': 'error', 'matches': {}, 'conf': 0.0}
{'name': 'hello', 'sent': 'help', 'matches': {}, 'conf': 0.0}


As you can see, the confidence is poor with messages that are subsets of direct inputs (like "later" and "see you later") and even with a direct match on "bye"! It also doesn't deal well with typos like "helo", or context-dependent phrases like "good day" (which matches the pattern of "good morning" but is more often used in English to say "good bye").

The messages "error" and "help" should not match to either intent that we've listed, but the classifier can only find intents that have been registered. In some ways this is desirable -- we don't want bots to take commands that don't exist. We can create a "fallback" intent to catch unknown commmands like so:

In [51]:
unk_list = ['error', 'what?', 'i dont get it', 'can i have a cookie']

container.add_intent('fallback', unk_list, reload_cache=True)


In [52]:
for i in test_list:
    intent = container.calc_intent(i)
    print(intent)

Regenerated fallback.
{'name': 'hello', 'sent': ['hello'], 'matches': {}, 'conf': 1.0}
{'name': 'hello', 'sent': 'helo', 'matches': {}, 'conf': 0.0}
{'name': 'hello', 'sent': 'what up', 'matches': {}, 'conf': 0.6746383452819594}
{'name': 'hello', 'sent': 'bye', 'matches': {}, 'conf': 0.0}
{'name': 'hello', 'sent': 'good day', 'matches': {}, 'conf': 0.17816539894311617}
{'name': 'hello', 'sent': 'later', 'matches': {}, 'conf': 0.0}
{'name': 'fallback', 'sent': ['error'], 'matches': {}, 'conf': 1.0}
{'name': 'hello', 'sent': 'help', 'matches': {}, 'conf': 0.0}


This still doesn't solve the problem of manually inputting possible commands, though. In `forest/semantic_dist.py` I tested a method using WordNet, a database of synonyms and related words. The goal was to automatically pull the intent data by finding nearby words to our current commands, so the dev only has to decide what the main command is.

Let's get the related words for our current examples and a few more:

In [53]:
from forest.semantic_dist import get_synonyms

In [54]:
commands = ['goodbye', 'hello', 'imagine', 'paint', 'pay', 'ping', 'printerfact', 'uptime']

In [55]:
container = IntentContainer('intent_cache')

no_syns = []
for command in commands:
    syns = get_synonyms(command)
    if len(syns) > 0:
        container.add_intent(command, syns, reload_cache=True)
        print(command + ':\n', syns)
    else:
        no_syns.append(command)

print('fallback', no_syns)        
container.add_intent('fallback', no_syns, reload_cache=True)

goodbye:
 ['adieu', 'adios', 'arrivederci', 'auf wiedersehen', 'au revoir', 'bye', 'bye bye', 'cheerio', 'good by', 'goodby', 'good bye', 'goodbye', 'good day', 'sayonara', 'so long', 'farewell', 'word of farewell']
hello:
 ['hello', 'hullo', 'hi', 'howdy', 'how do you do', 'greeting', 'salutation']
imagine:
 ['imagine', 'conceive of', 'ideate', 'envisage', 'think', 'opine', 'suppose', 'imagine', 'reckon', 'guess', 'dream', 'daydream', 'woolgather', 'stargaze', 'envision', 'foresee', 'fantasize', 'fantasise', 'fantasy', 'fantasize', 'fantasise', 'prefigure', 'think', 'visualize', 'visualise', 'envision', 'project', 'fancy', 'see', 'figure', 'picture', 'image', 'visualize', 'visualise', 'suspect', 'create by mental act', 'create mentally', 'expect', 'anticipate']
paint:
 ['paint', 'pigment', 'key', 'paint', 'rouge', 'paint', 'blusher', 'paint', 'paint', 'paint', 'paint', 'acrylic', 'acrylic paint', 'antifouling paint', 'coat of paint', 'distemper', 'enamel', 'encaustic', 'finger paint',

In [56]:
test_commands = [
   'hello', 
   'helo', 
   'what up', 
   'bye', 
   'good day', 
   'later', 
   'error', 
   'help',
   'imagine a thing',
   'paint a thing',
   'draw a thing',
   'pigment a thing',
   'image of a thing',
   'time',
   'uptime',
   'up period',
   'how long have you been up',
   'printerfact',
   'printer fact',
   'printer',
   'ping',
   'pong',
   'pay',
   'send payment',
   'wallet',
   'pay me back',
   'cancel',
]


In [57]:
responses = []
for i in test_commands:
    intent = container.calc_intent(i)
    responses.append({'name': intent.name, 'sent': intent.sent, 'conf': intent.conf})

sorted_responses = sorted(responses, key = lambda x: x['conf'], reverse=True)

Regenerated goodbye.Regenerated hello.
Regenerated ping.Regenerated imagine.Regenerated uptime.



Regenerated fallback.
Regenerated paint.
Regenerated pay.


In [58]:
(sorted_responses)

[{'name': 'hello', 'sent': ['hello'], 'conf': 1.0},
 {'name': 'goodbye', 'sent': ['bye'], 'conf': 1.0},
 {'name': 'goodbye', 'sent': ['good', 'day'], 'conf': 1.0},
 {'name': 'uptime', 'sent': ['uptime'], 'conf': 1.0},
 {'name': 'fallback', 'sent': ['printerfact'], 'conf': 1.0},
 {'name': 'ping', 'sent': ['ping'], 'conf': 1.0},
 {'name': 'pay', 'sent': ['pay'], 'conf': 1.0},
 {'name': 'uptime', 'sent': 'up period', 'conf': 0.581136967002809},
 {'name': 'hello',
  'sent': 'how long have you been up',
  'conf': 0.5648365459442635},
 {'name': 'pay', 'sent': 'pay me back', 'conf': 0.5195182216860899},
 {'name': 'imagine', 'sent': 'imagine a thing', 'conf': 0.48457554761404315},
 {'name': 'paint', 'sent': 'pigment a thing', 'conf': 0.4822043721533956},
 {'name': 'imagine', 'sent': 'image of a thing', 'conf': 0.4626699964227443},
 {'name': 'paint', 'sent': 'paint a thing', 'conf': 0.44062001646397364},
 {'name': 'pay', 'sent': 'send payment', 'conf': 0.4146607694993516},
 {'name': 'pay', 'sen

Just under half are parsed correctly. The fallback command isn't catching most of the unknown entities, because it only registers commands that have no synonyms. This could be solved maybe with a hardcoded list of phrases like "what" or "error" to give it more to work with. Ultimately this only solves some of the problem. Without world knowledge we're limited to whatever synonyms we can find, whether manually or programmatically.

The larger problem is that natural language is not a command line. Once people think a bot can understand fuzzy phrasing, they will begin to use casual dialog messages. These can't be extracted from a database like WordNet. Some examples:

In [59]:
intent = container.calc_intent('hi how are you doing')
print(intent)

{'name': 'hello', 'sent': 'hi how are you doing', 'matches': {}, 'conf': 0.6703060311786331}


In [60]:
intent = container.calc_intent('i want a printer fact')
print(intent)

{'name': 'goodbye', 'sent': 'i want a printer fact', 'matches': {}, 'conf': 0.19702130549922886}


In [61]:
intent = container.calc_intent('send money')
print(intent)

{'name': 'hello', 'sent': 'send money', 'matches': {}, 'conf': 0.19960761592598483}


## Entity recognition

Padatious can do slot-filling behavior with entity recognition. This requires registering types of entities withing the intents, like `choose_flavor.intent`: 
```
i want that {flavor} jerky
gimme the {flavor}
i'll take {flavor} if you have it
just {flavor} is fine
can i get the {flavor}
{flavor} flavor please
```

and then creating an entity model like `flavor.entity`:

```
original
spicy
truffle
not4gma
insane
punjabi
garlic
teriyaki
```

This would be a great advantage for bots that need to fill several slots before taking an action. For example, jerkybot needs to make sure it knows the flavor, quantity and size of jerky you're ordering as well as the address to send it to.

I prototyped the NLU data in jerkybot-data folder. Let's see how well it works:

In [62]:
from glob import glob
from os.path import basename

from padatious import IntentContainer

container = IntentContainer('intent_cache')

for file_name in glob('jerkybot-data/*.intent'):
    name = basename(file_name).replace('.intent', '')
    container.load_file(name, file_name, reload_cache=True)

for file_name in glob('jerkybot-data/*.entity'):
    name = basename(file_name).replace('.entity', '')
    container.load_entity(name, file_name, reload_cache=True)

container.train()

def jerkybot(query):
    data = container.calc_intent(query)
    print(data)
    for key, val in data.matches.items():
        print('\t' + key + ': ' + val)

Regenerated {quantity}.Regenerated {flavor}.Regenerated {size}.


Regenerated {address}.
Regenerated affirm.Regenerated greet.Regenerated goodbye.Regenerated deny.Regenerated choose_quantity.




Regenerated choose_flavor.Regenerated botchallenge.
Regenerated stop.


Regenerated choose_size.Regenerated choose_address.Regenerated order.



In [63]:
jerkybot('i would like some jerky please')
jerkybot('i want insane flavor jerky')
jerkybot('my address is 420 69th street')
jerkybot('my address is 1600 Pennsylvania Ave, Washington DC')
jerkybot('420 69th St is where I live')
jerkybot('i need 2 bags of the punjabi jerky')
jerkybot('can i have the 8 oz size')


{'name': 'choose_address', 'sent': '{address}', 'matches': {'address': 'i would like some jerky please'}, 'conf': 0.695333292903387}
	address: i would like some jerky please
{'name': 'choose_flavor', 'sent': '{flavor} flavor jerky', 'matches': {'flavor': 'i want insane'}, 'conf': 0.7503358127540253}
	flavor: i want insane
{'name': 'choose_address', 'sent': ['my', 'address', 'is', '420', '69', 'th', 'street'], 'matches': {'address': '420 69th street'}, 'conf': 1.0}
	address: 420 69th street
{'name': 'choose_address', 'sent': 'my address is {address}', 'matches': {'address': '1600 pennsylvania ave , washington dc'}, 'conf': 0.7909294250121552}
	address: 1600 pennsylvania ave , washington dc
{'name': 'choose_address', 'sent': '{address}', 'matches': {'address': '420 69 th st is where i live'}, 'conf': 0.7734951919196387}
	address: 420 69 th st is where i live
{'name': 'choose_address', 'sent': '{address}', 'matches': {'address': 'i need 2 bags of the punjabi jerky'}, 'conf': 0.68122626611

This slot-filling behavior might be useful to us! Unfortunately it's not very reliable and, again, requires a good deal of forethought on how the user will phrase their questions.


## Conclusions

Padatious is more predictable than Rasa NLU pipelines, and requires much less in the way of setup and dependencies. 

It provides fuzzy matching across command synonyms we can think of or programmatically predict. However, it does not do very well with typos -- string_dist would do much better at telling "helo" and "hello" to be the same thing, although it would then have trouble distinguishing "help". It also doesn't work well with out-of-domain data. Unlike word2vec or other embeddings, it does not contain world knowledge about which words are often used together.

Padatious also has slot-filling capabilities, which may be more useful to us than simple intent matching. Unfortunately this requires even more developer attention up front, and would work a lot better if conversations with users were logged to provide more NLU data, which is against our rough plans for private communication with bots.

In both cases, we would need a feedback loop between the bot and the user to check if we have received the correct intent and/or entities. If we're already implementing this sort of validation logic, perhaps we should use a more deterministic command-line style model, and put more effort into exposing that predictable behavior to the user.

### TL;DR: 

**Padatious is better than Rasa, but it will require a lot of upfront work by devs and followups with users if we want to have useful NLU data for it to draw from. We might be better off using an argparse-style model and communicating to the user what commands and options are available, rather than trying to interpret them from raw natural language.**