mapado/NoAho
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Non-Overlapping Aho-Corasick Trie
Features:
- 'short' and 'long' (longest matching key) searches, both one-off and
iteration over all non-overlapping keyword matches in some text.
- Works with both unicode and str in Python 2, and unicode in Python 3
(it's all UCS4 under the hood).
- Allows you to associate an arbitrary Python object payload with each
keyword, and supports dict operations len(), [], and 'in' for the
keywords (though no del or traversal).
- Does the 'compilation' (generation of Aho-Corasick failure links) of
the trie on-demand; you can mix adding keywords and searching text
freely.
- Can be used commercially, it's under the minimal, MIT license.
Anti-Features:
- Will not find overlapped keywords (eg given keywords "abcde" and
'defgh", will not find "defgh" in "abcdefgh"; would find both in
"abcdedefgh"), unless you move along the string manually, one
character at a time, which would defeat the purpose. The package
'Acora' is an alternative package for this use.
- Lacking overlap, find[all]_short is kind of useless.
- Lacks key iteration and deletion from the mapping (dict) protocol
- Memory leaking untested (should be ok but ...)
- No /testcase/ for unicode in Python 2 (did manual test however)
- Unicode chars represented as ucs4, and, each character has its own
hashtable, so it's relatively memory-heavy.
- Requires a C++ compiler.
Bug reports and patches welcome of course!
To build and install, use either
# Python 2
python setup.py install # (or build, and copy the .so to where you want it)
or
# Python 3
python3 setup.py install # (or build, and copy the .so to where you want it)
API:
from noaho.noaho import NoAho
trie = NoAho()
'text' below applies to str and unicode in Python 2, or unicode in Python 3 (all there is)
trie.add(key_text, optional payload)
(key_start, key_end, key_value) = trie.find_short(text_to_search)
(key_start, key_end, key_value) = trie.find_long(text_to_search)
(key_start, key_end, key_value) = trie.findall_short(text_to_search)
(key_start, key_end, key_value) = trie.findall_long(text_to_search)
# keyword = text_to_search[key_start:key_end]
trie['keyword] = key_value
key_value = trie.find_long(text_to_search)
assert len(trie)
assert keyword in trie
Examples:
>>> a = NoAho()
>>> a.add('ms windows')
>>> a.add('ms windows 2000', "this is canonical")
>>> a.add('windows', None)
>>> a.add('windows 2000')
>>> a['apple'] = None
>>> text = 'windows 2000 ms windows 2000 windows'
>>> for k in a.findall_short(text):
... print text[k[0]:k[1]]
...
windows
ms windows
windows
>>> for k in a.findall_long(text):
... print text[k[0]:k[1]]
...
windows 2000
ms windows 2000
windows
Mapping (dictionary) methods:
trie = NoAho()
trie['apple'] = apple_handling_function
trie['orange'] = Orange()
trie.add('banana') # payload will be None
trie['pear'] # will give key error
assert isinstance(trie['orange'], Orange)
assert 'banana' in trie
len(trie)
# No del;
# no iteration over keys
The 'find[all]_short' forms are named as long and awkwardly as they are,
to leave plain 'find[all]' free if overlapping matches are ever implemented.
Deep Rebuilding:
(Needs C++ and Cython.)
You should not need to use Cython to build and use this module, but if
you want to change it and regenerate, there's a build command line at
the top of noaho.pyx. To allow the generated noaho.cpp file to be
used in both Py2 and Py3, use a Python 3-built Cython. This was built
with Cython-0.15.1, but even as far back as 0.13 should work (for that
version though, you'd have to manually use a c++ compiler to do the
last step, of linking).
For the fullest spec of what the code will and will not do, check out
test-noaho.py (python[3] test-noaho.py)
Untested: whether the payload handling is complete, ie that there are no
memory leaks. It should be correct though.