Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Include __version__ attribute in main module #28

Open
Erotemic opened this issue Jun 18, 2019 · 7 comments
Open

Include __version__ attribute in main module #28

Erotemic opened this issue Jun 18, 2019 · 7 comments

Comments

@Erotemic
Copy link

It would be helpful if the main module has a __version__ attribute like most modules. This might be tricky to reconcile with the current way of versioning this lib (i.e. via version.py), but I think it would be an overall improvement.

@mina86
Copy link
Contributor

mina86 commented Jul 22, 2019

Do you have particular use case that you're needing this for? In general I recommend tisting if a feature exist rather than parsing __version__ though I'm not opposed to adding that variable.

version.py could be replaced with a release script which in addition to picking up version from git could also do all the other boring stuff (run all tests, commit version-history.rst changes, tag commit, sdist and upload).

@Erotemic
Copy link
Author

A very common way to determine the version of a package you have is: python -c "import pkgname; print(pkgname.__version__)". Most packages include a __version__ variable, so this usually works.

I find testing for a feature to be insufficient for two reasons. First, it makes it difficult to help someone who is installing this software as a dependency and might not be familiar with the features of the library, whereas asking them to check a version number of relatively simple. Second, sometimes new features don't expose themselves as top-level functions, which may make it more complicated and package specific to test the version.

I agree that changing version.py to a release script (I usually call mine publish.py), would be an improvement.

@Erotemic
Copy link
Author

Erotemic commented Aug 8, 2019

Anther use case for __version__ (aside from the fact that its a defacto standard) is dynamically checking versions at runtime. The most recent update to 2.3.2 broke my code. Instead of simply being able to blindly do:

from distutils.version import LooseVersion
import pygtrie
if LooseVersion(pygtrie.__version__) >= LooseVersion('2.3.2'):
    do_new_thing()
else:
    do_old_thing()

I had to search for an attribute that the new version had and the old one didn't and then use a hasattr test, which doesn't give the code any readability when it comes to what condition corresponds to the new version and which corresponds to the old.

@mina86
Copy link
Contributor

mina86 commented Aug 8, 2019 via email

@Erotemic
Copy link
Author

Erotemic commented Aug 9, 2019

Awesome, I'm glad to hear __version__ is coming.

The breakage was due to the non-standard way I was using the library and not due to any mistakes in the update. I was using non-public variables to gain access to the raw node data structure. The change from using dictionaries to using the _OneChild / _NoChildren / _Children objects was causing me issues because the .values method didn't exist anymore.

Here is the code in question:

def _trie_iternodes(self):
    """
    Generates all nodes in the trie

    # Hack into the internal structure and insert frequencies at each node
    """
    from collections import deque
    stack = deque([[self._root]])
    while stack:
        for node in stack.pop():
            yield node
            try:
                # only works in pygtrie-2.2 broken in pygtrie-2.3.2
                stack.append(node.children.values())
            except AttributeError:
                stack.append([v for k, v in node.children.iteritems()])

And I only use this to hack the values of all nodes to zero:

    # Set the value (frequency) of all nodes to zero.
    for node in _trie_iternodes(trie):
        node.value = 0

For the full context, this is used to implement the shortest_unique_prefix algorithm.

def shortest_unique_prefixes(items, sep=None, allow_simple=True, min_length=0, allow_end=False):
    r"""
    The shortest unique prefix algorithm.

    Args:
        items (list of str): returned prefixes will be unique wrt this set
        sep (str): if specified, all characters between separators are treated
            as a single symbol. Makes the algo much faster.
        allow_simple (bool): if True tries to construct a simple feasible
            solution before resorting to the optimal trie algorithm.
        min_length (int): minimum length each prefix can be
        allow_end (bool): if True allows for string terminators to be
            considered in the prefix

    Returns:
        list of str: a prefix for each item that uniquely identifies it
           wrt to the original items.

    References:
        http://www.geeksforgeeks.org/find-all-shortest-unique-prefixes-to-represent-each-word-in-a-given-list/
        https://github.com/Briaares/InterviewBit/blob/master/Level6/Shortest%20Unique%20Prefix.cpp

    Requires:
        pip install pygtrie

    Example:
        >>> # xdoctest: +REQUIRES(--pygtrie)
        >>> items = ["zebra", "dog", "duck", "dove"]
        >>> shortest_unique_prefixes(items)
        ['z', 'dog', 'du', 'dov']

    Timeing:
        >>> # DISABLE_DOCTEST
        >>> # make numbers larger to stress test
        >>> # L = max length of a string, N = number of strings,
        >>> # C = smallest gaurenteed common length
        >>> # (the setting N=10000, L=100, C=20 is feasible we are good)
        >>> import ubelt as ub
        >>> import random
        >>> def make_data(N, L, C):
        >>>     rng = random.Random(0)
        >>>     return [''.join(['a' if i < C else chr(rng.randint(97, 122))
        >>>                      for i in range(L)]) for _ in range(N)]
        >>> items = make_data(N=1000, L=10, C=0)
        >>> ub.Timerit(3).call(shortest_unique_prefixes, items).print()
        Timed for: 3 loops, best of 3
            time per loop: best=24.54 ms, mean=24.54 ± 0.0 ms
        >>> items = make_data(N=1000, L=100, C=0)
        >>> ub.Timerit(3).call(shortest_unique_prefixes, items).print()
        Timed for: 3 loops, best of 3
            time per loop: best=155.4 ms, mean=155.4 ± 0.0 ms
        >>> items = make_data(N=1000, L=100, C=70)
        >>> ub.Timerit(3).call(shortest_unique_prefixes, items).print()
        Timed for: 3 loops, best of 3
            time per loop: best=232.8 ms, mean=232.8 ± 0.0 ms
        >>> items = make_data(N=10000, L=250, C=20)
        >>> ub.Timerit(3).call(shortest_unique_prefixes, items).print()
        Timed for: 3 loops, best of 3
            time per loop: best=4.063 s, mean=4.063 ± 0.0 s
    """
    import pygtrie
    if len(set(items)) != len(items):
        raise ValueError('inputs must be unique')

    # construct trie
    if sep is None:
        trie = pygtrie.CharTrie.fromkeys(items, value=0)
    else:
        # In some simple cases we can avoid constructing a trie
        if allow_simple:
            tokens = [item.split(sep) for item in items]
            simple_solution = [t[0] for t in tokens]
            if len(simple_solution) == len(set(simple_solution)):
                return simple_solution
            for i in range(2, 10):
                # print('return simple solution at i = {!r}'.format(i))
                simple_solution = ['-'.join(t[:i]) for t in tokens]
                if len(simple_solution) == len(set(simple_solution)):
                    return simple_solution

        trie = pygtrie.StringTrie.fromkeys(items, value=0, separator=sep)

    # Set the value (frequency) of all nodes to zero.
    for node in _trie_iternodes(trie):
        node.value = 0

    # For each item trace its path and increment frequencies
    for item in items:
        final_node, trace = trie._get_node(item)
        for key, node in trace:
            node.value += 1

    # if not isinstance(node.value, int):
    #     node.value = 0

    # Query for the first prefix with frequency 1 for each item.
    # This is the shortest unique prefix over all items.
    unique = []
    for item in items:
        freq = None
        for prefix, freq in trie.prefixes(item):
            if freq == 1 and len(prefix) >= min_length:
                break
        if not allow_end:
            assert freq == 1, 'item={} has no unique prefix. freq={}'.format(item, freq)
        # print('items = {!r}'.format(items))
        unique.append(prefix)
    return unique

Honestly, I haven't looked at this code in so long, I'm don't remember why access to the raw nodes was necessary. When I took a quick look at the new code, it looked like there might have been a new better way of doing this, but opted for the quick solution rather than spending effort to investigate if there was a new public API that gave me what I needed. If the new update has a better way of doing this, let me know.

@mina86
Copy link
Contributor

mina86 commented Aug 9, 2019 via email

@mina86
Copy link
Contributor

mina86 commented Apr 4, 2020

2.3.3 now includes __version__ and hopefully it’ll stay that way going forward.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants