Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do floor and ceiling lookups? #87

Closed
zardus opened this issue Apr 16, 2018 · 11 comments
Closed

How to do floor and ceiling lookups? #87

zardus opened this issue Apr 16, 2018 · 11 comments

Comments

@zardus
Copy link

zardus commented Apr 16, 2018

With bintrees being discontinued, and projects using it being urged to migrate to sortedcontainers, I have a feature question. In bintrees, you can look up the "floor" and "ceiling" items given a key that is either too big or too small. For example:

import bintrees
x = bintrees.BinaryTree()
x[10] = 'ten'
assert x.floor_item(15) == (10, 'ten')
assert x.ceiling_item(5) == (10, 'ten')

As far as I can tell, this isn't possible in sortedcontainers. Am I missing anything? If not, is it a planned feature to add? Alternatively, do you have any pointers for a direction toward implementing it?

@grantjenks
Copy link
Owner

It's possible using bisect_left and bisect_right in the current API though the semantics are a bit different than what you're wanting. We can also improve the performance by avoiding the positional index. I had not planned to add the feature but I'm not opposed to it. You're just the first to ask for it :)

Here's a solution:

import unittest
import bintrees
import sortedcontainers

from bisect import bisect_left, bisect_right


class CustomSortedDict(sortedcontainers.SortedDict):
    def floor_key(self, key):
        """Return greatest key less than or equal to given key.

        Raises KeyError if there is no such key.

        """
        _list = self._list
        _maxes = _list._maxes

        if not _maxes:
            raise KeyError(key)

        maxes_index = bisect_left(_maxes, key)
        _lists = _list._lists

        if maxes_index == len(_maxes):
            sublist = _lists[-1]
            return sublist[-1]

        sublist = _lists[maxes_index]
        sublist_index = bisect_left(sublist, key)

        if sublist[sublist_index] == key:
            return key

        if sublist_index:
            return sublist[sublist_index - 1]

        if maxes_index:
            return _lists[maxes_index - 1][-1]

        raise KeyError(key)


    def ceiling_key(self, key):
        """Return smallest key greater than or equal to given key.

        Raises KeyError if there is no such key.

        """
        _list = self._list
        _maxes = _list._maxes

        if not _maxes:
            raise KeyError(key)

        maxes_index = bisect_right(_maxes, key)
        _lists = _list._lists

        if maxes_index == len(_maxes):
            if _maxes[-1] == key:
                return key
            raise KeyError(key)

        sublist = _lists[maxes_index]
        sublist_index = bisect_right(sublist, key)

        if sublist_index:
            if sublist[sublist_index - 1] == key:
                return key
            return sublist[sublist_index]

        if maxes_index and _lists[maxes_index - 1][-1] == key:
            return key

        return sublist[0]


    def floor_item(self, key):
        key = self.floor_key(key)
        return key, self[key]


    def ceiling_item(self, key):
        key = self.ceiling_key(key)
        return key, self[key]


class TestCustomSortedDict(unittest.TestCase):
    "Test CustomSortedDict floor and ceiling methods."
    keys = 50
    sublist_size = 7


    def setUp(self):
        "Initialize bintrees.BinaryTree and CustomSortedDict objects."
        self.tree = bintrees.BinaryTree()
        self.csd = CustomSortedDict()
        self.csd._reset(self.sublist_size)  # Stress sublist lookups.
        for key in range(self.keys):
            self.tree[key] = key
            self.csd[key] = key


    def test_floor_key(self):
        for key in range(self.keys):
            self.assertEqual(key, self.tree.floor_key(key))
            self.assertEqual(key, self.tree.floor_key(key + 0.5))
            self.assertEqual(key, self.csd.floor_key(key))
            self.assertEqual(key, self.csd.floor_key(key + 0.5))


    def test_floor_key_keyerror(self):
        with self.assertRaises(KeyError):
            self.tree.floor_key(-0.5)
        with self.assertRaises(KeyError):
            self.csd.floor_key(-0.5)


    def test_floor_key_empty(self):
        self.tree.clear()
        self.csd.clear()
        with self.assertRaises(KeyError):
            self.tree.floor_key(0)
        with self.assertRaises(KeyError):
            self.csd.floor_key(0)


    def test_ceiling_key(self):
        for key in range(self.keys):
            self.assertEqual(key, self.tree.ceiling_key(key))
            self.assertEqual(key, self.tree.ceiling_key(key - 0.5))
            self.assertEqual(key, self.csd.ceiling_key(key))
            self.assertEqual(key, self.csd.ceiling_key(key - 0.5))


    def test_ceiling_key_keyerror(self):
        with self.assertRaises(KeyError):
            self.tree.ceiling_key(49.5)
        with self.assertRaises(KeyError):
            self.csd.ceiling_key(49.5)


    def test_ceiling_key_empty(self):
        self.tree.clear()
        self.csd.clear()
        with self.assertRaises(KeyError):
            self.tree.ceiling_key(0)
        with self.assertRaises(KeyError):
            self.csd.ceiling_key(0)


if __name__ == '__main__':
    unittest.main()

@grantjenks
Copy link
Owner

Do you need the prev/succ functionality of bintrees as well?

@zardus
Copy link
Author

zardus commented Apr 16, 2018

That'd definitely be very useful! Let me know if I can help.

@grantjenks
Copy link
Owner

prev/succ is pretty similar to floor/ceil. I'll try to get to it this week. I'll probably spell them:

  • SortedList.floor(value)
  • SortedList.ceil(value)
  • SortedListWithKey.floor_key(key)
  • SortedListWithKey.ceil_key(key)

Actually, you can do the same with the existing "irange" method.

I've got to run now but I'll try to find some time this week to think about it.

@grantjenks
Copy link
Owner

Actually, I now remember I added irange for these use cases. The performance will be excellent if you’re calling prev/succ repeatedly. It also covers the floor/ceil cases. Can you look at that method and see if will work? It’s on sorted list and sorted dict. The performance may not be quite so fast as above but better to optimize after profiling.

@zardus
Copy link
Author

zardus commented Apr 16, 2018

Awesome, that seems to work! This is the code:

import sortedcontainers
s = sortedcontainers.SortedDict()
s[10] = 'ten'
s[3] = 'three'

# floor
assert next(s.irange(maximum=15, reverse=True)) == 10

# ceiling
assert next(s.irange(minimum=5, reverse=False)) == 10

# succ
assert next(s.irange(minimum=3, inclusive=(False, False), reverse=False)) == 10

# prev
assert next(s.irange(maximum=10, inclusive=(False, False), reverse=True)) == 3

Awesome stuff! It might be worth making some aliases for this functionality (and, like you say, specializing them if performance is an issue)?

@grantjenks
Copy link
Owner

That looks right. Glad it works!

I benchmarked the "floor_key" operation with bintrees.BinaryTree, sortedcontainers.SortedDict and CustomSortedDict. Here's the timing results:

size bintrees.BinaryTree sortedcontainers.SortedDict CustomSortedDict
10 6.771e-07 1.245e-06 4.63e-07
100 1.372e-06 1.629e-06 7.93e-07
1000 1.98e-06 3.528e-06 1.171e-06
10000 2.06e-06 1.093e-05 8.759e-07
100000 2.96e-06 9.562e-06 9.089e-07
1000000 3.01e-06 2.979e-05 1.004e-06

The benchmark constructs a sorted dictionary with "size" random float keys and then looks one up at random. SortedDict is currently between 1-10x slower than bintrees.BinaryTree. The CustomSortedDict is 1-3x faster.

Here's the code:

from __future__ import print_function

import random
import timeit
import bintrees
import sortedcontainers

from test import CustomSortedDict


def median(values):
    return sorted(values)[len(values) // 2]


def run(statement, variable):
    statement = statement.format(variable=variable)
    setup = 'from __main__ import {variable}'.format(variable=variable)
    times = timeit.repeat(statement, setup, repeat=5, number=1000)
    return median(times) / 1000


tree = None
sd = None
csd = None


def benchmark(size):
    global tree, sd, csd

    random.seed(0)
    tree = bintrees.BinaryTree()
    sd = sortedcontainers.SortedDict()
    csd = CustomSortedDict()

    keys = [random.random() for num in range(size)]

    for num, key in enumerate(keys):
        tree[key] = num
        sd[key] = num
        csd[key] = num

    key = keys[0] + 0.001
    tree_time = run('{variable}.floor_key(%r)' % key, 'tree')
    statement = 'next({variable}.irange(maximum=%r, reverse=True))' % key
    sd_time = run(statement, 'sd')
    csd_time = run('{variable}.floor_key(%r)' % key, 'csd')

    print(
        size,
        format(tree_time, '.4g'),
        format(sd_time, '.4g'),
        format(csd_time, '.4g'),
    )


if __name__ == '__main__':
    print(
        'size',
        'bintrees.BinaryTree',
        'sortedcontainers.SortedDict',
        'CustomSortedDict',
    )
    for exp in range(1, 7):
        benchmark(10 ** exp)

@zardus
Copy link
Author

zardus commented Apr 17, 2018

Interesting... It's weird that the runtime doesn't seem to get progressively slower with a bigger list size. Are you running on pypy?

Would you recommend that we adapt something like the CustomSortedDict, or do you think that functionality should make it into sortedcontainer?

@grantjenks
Copy link
Owner

grantjenks commented Apr 18, 2018

It's not that strange if you look at how "SortedList._islice" is implemented. The iterator returned is optimized for iterating. It's not optimized for returning the first element as quickly as possible. That can/should be changed.

I would adapt to irange and expect that in the next week/month, it will become faster for your use-case :)

Only after profiling would I use the specialized CustomSortedDict.

Couple notes for myself:

  1. Generator solution is more expensive to iterate but faster to get first element.
  2. Lazy list slicing in C-code: map(sublist.getitem, range(5, 10))
  3. Maybe a hybrid approach between the generator/c-implementation is possible? When we can move to Python 3, "yield from" will be useful.

@grantjenks grantjenks changed the title Capability to do floor and ceiling lookups? How to do floor and ceiling lookups? Apr 18, 2018
@grantjenks
Copy link
Owner

New changes at 6321f22. Here's the benchmark results:

size bintrees.BinaryTree sortedcontainers.SortedDict CustomSortedDict
10 5.678e-07 1.714e-06 5.401e-07
100 1.206e-06 1.85e-06 8.721e-07
1000 1.613e-06 1.998e-06 9.54e-07
10000 1.714e-06 3.684e-06 9.983e-07
100000 2.316e-06 3.919e-06 1.001e-06
1000000 2.439e-06 4.214e-06 1.164e-06

The irange method is now 1-2x slower than the floor/ceiling functions but maintains its fast iteration. It also has more consistent performance because it "lazily" slices the lists in C-code.

I think this is "fast-enough" for now. If you need more performance then I'd suggest the CustomSortedDict changes.

I will plan to deploy these improvements in the next week or so.

@grantjenks
Copy link
Owner

Released at v1.5.10 to pypi.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants