From e9f7347a9c3518169d7379e74c3b3869985789f4 Mon Sep 17 00:00:00 2001 From: "Jeroen F.J. Laros" Date: Sat, 16 Jun 2018 18:54:27 +0200 Subject: [PATCH] Structured documentation. --- README.rst | 12 ++++ docs/credits.rst | 10 +++ docs/index.rst | 11 +++ docs/installation.rst | 25 +++++++ docs/usage.rst | 164 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 222 insertions(+) create mode 100644 README.rst create mode 100644 docs/credits.rst create mode 100644 docs/index.rst create mode 100644 docs/installation.rst create mode 100644 docs/usage.rst diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..93ebed9 --- /dev/null +++ b/README.rst @@ -0,0 +1,12 @@ +Trie implementation using nested dictionaries +============================================= + +This library provides a trie_ implementation using nested dictionaries. Apart +from the basic operations, a number of functions for *approximate matching* are +implemented. + +Please see ReadTheDocs_ for the latest documentation. + + +.. _trie: https://en.wikipedia.org/wiki/Trie +.. _ReadTheDocs: http://dict-trie.readthedocs.io/en/latest/index.html diff --git a/docs/credits.rst b/docs/credits.rst new file mode 100644 index 0000000..48c2430 --- /dev/null +++ b/docs/credits.rst @@ -0,0 +1,10 @@ +Contributors +============ + +- Jeroen F.J. Laros (Original author, maintainer) + +Find out who contributed: + +:: + + git shortlog -s -e diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 0000000..c0379f9 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,11 @@ +.. doc_test documentation. + +.. include:: ../README.rst + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + installation + usage + credits diff --git a/docs/installation.rst b/docs/installation.rst new file mode 100644 index 0000000..0c2586f --- /dev/null +++ b/docs/installation.rst @@ -0,0 +1,25 @@ +Installation +============ + +The software is distributed via PyPI_, it can be installed with ``pip``: + +:: + + pip install dict-trie + + +From source +----------- + +The source is hosted on GitHub_, to install the latest development version, use +the following commands. + +:: + + git clone https://github.com/jfjlaros/dict-trie.git + cd dict-trie + pip install . + + +.. _PyPI: https://pypi.org/project/dict-trie +.. _GitHub: https://github.com/jfjlaros/dict-trie.git diff --git a/docs/usage.rst b/docs/usage.rst new file mode 100644 index 0000000..e1d700f --- /dev/null +++ b/docs/usage.rst @@ -0,0 +1,164 @@ +Usage +===== + +The library provides the ``Trie`` class. + + +Basic operations +---------------- + +Initialisation of the trie is done via the constructor by providing a list of +words. + +.. code:: python + + >>> from dict_trie import Trie + >>> + >>> trie = Trie(['abc', 'te', 'test']) + +Alternatively, an empty trie can be made to which words can be added with the +``add`` function. + +.. code:: python + + >>> trie = Trie() + >>> trie.add('abc') + >>> trie.add('te') + >>> trie.add('test') + +Membership can be tested with the ``in`` statement. + +.. code:: python + + >>> 'abc' in trie + True + +Test whether a prefix is present by using the ``has_prefix`` function. + +.. code:: python + + >>> trie.has_prefix('ab') + True + +Remove a word from the trie with the ``remove`` function. This function returns +``False`` if the word was not in the trie. + +.. code:: python + + >>> trie.remove('abc') + True + >>> 'abc' in trie + False + >>> trie.remove('abc') + False + +Iterate over all words in a trie. + +.. code:: python + + >>> list(trie) + ['abc', 'te', 'test'] + + +Approximate matching +-------------------- + +A trie can be used to efficiently find a word that is similar to a query word. +This is implemented via a number of functions that search for a word, allowing +a given number of mismatches. These functions are divided in two families, one +using the Hamming distance which only allows substitutions, the other using the +Levenshtein distance which allows substitutions, insertions and deletions. + +To find a word that has at most Hamming distance 2 to the word 'abe', the +``hamming`` function is used. + +.. code:: python + + >>> trie = Trie(['abc', 'aaa', 'ccc']) + >>> trie.hamming('abe', 2) + 'aaa' + +To get all words that have at most Hamming distance 2 to the word 'abe', the +``all_hamming`` function is used. This function returns a generator. + +.. code:: python + + >>> list(trie.all_hamming('abe', 2)) + ['aaa', 'abc'] + +In order to find a word that is closest to the query word, the ``best_hamming`` +function is used. In this case a word with distance 1 is returned. + +.. code:: python + + >>> trie.best_hamming('abe', 2) + 'abc' + +The functions ``levenshtein``, ``all_levenshtein`` and ``best_levenshtein`` are +used in a similar way. + + +Other functionalities +--------------------- + +A trie can be populated with all words of a fixed length over an alphabet by +using the ``fill`` function. + +.. code:: python + + >>> trie = Trie() + >>> trie.fill(('a', 'b'), 2) + >>> list(trie) + ['aa', 'ab', 'ba', 'bb'] + +The trie data structure can be accessed via the ``root`` member variable. + +.. code:: python + + >>> trie.root + {'a': {'a': {'': 1}, 'b': {'': 1}}, 'b': {'a': {'': 1}, 'b': {'': 1}}} + >>> trie.root.keys() + ['a', 'b'] + +The distance functions ``all_hamming`` and ``all_levenshtein`` also have +counterparts that give the developer more information by returning a list of +tuples containing not only the matched word, but also its distance to the query +string and a CIGAR_-like string. + +The following encoding is used in the CIGAR-like string: + ++-------------+---------------+ +| character | description | ++-------------+---------------+ +| = | match | ++-------------+---------------+ +| X | mismatch | ++-------------+---------------+ +| I | insertion | ++-------------+---------------+ +| D | deletion | ++-------------+---------------+ + +In the following example, we search for all words with Hamming distance 1 to +the word 'acc'. In the results we see a match with the word 'abc' having +distance 1 and a mismatch at position 2. + +.. code:: python + + >>> trie = Trie(['abc']) + >>> list(trie.all_hamming_('acc', 1)) + [('abc', 1, '=X=')] + +Similarly, we can search for all words having Levenshtein distance 2 to the +word 'acb'. The word 'abc' matches three times, once by deleting the 'b' on +position 2 and inserting a 'b' after position 3, once by inserting a 'c' after +position 1 and deleting the last character and once by introducing two +mismatches. + +.. code:: python + + >>> list(trie.all_levenshtein_('acb', 2)) + [('abc', 2, '=D=I'), ('abc', 2, '=XX'), ('abc', 2, '=I=D')] + + +.. _CIGAR: https://samtools.github.io/hts-specs/SAMv1.pdf