How to do floor and ceiling lookups? #87

zardus · 2018-04-16T15:34:03Z

With bintrees being discontinued, and projects using it being urged to migrate to sortedcontainers, I have a feature question. In bintrees, you can look up the "floor" and "ceiling" items given a key that is either too big or too small. For example:

import bintrees
x = bintrees.BinaryTree()
x[10] = 'ten'
assert x.floor_item(15) == (10, 'ten')
assert x.ceiling_item(5) == (10, 'ten')

As far as I can tell, this isn't possible in sortedcontainers. Am I missing anything? If not, is it a planned feature to add? Alternatively, do you have any pointers for a direction toward implementing it?

The text was updated successfully, but these errors were encountered:

grantjenks · 2018-04-16T18:55:31Z

It's possible using bisect_left and bisect_right in the current API though the semantics are a bit different than what you're wanting. We can also improve the performance by avoiding the positional index. I had not planned to add the feature but I'm not opposed to it. You're just the first to ask for it :)

Here's a solution:

import unittest
import bintrees
import sortedcontainers

from bisect import bisect_left, bisect_right


class CustomSortedDict(sortedcontainers.SortedDict):
    def floor_key(self, key):
        """Return greatest key less than or equal to given key.

        Raises KeyError if there is no such key.

        """
        _list = self._list
        _maxes = _list._maxes

        if not _maxes:
            raise KeyError(key)

        maxes_index = bisect_left(_maxes, key)
        _lists = _list._lists

        if maxes_index == len(_maxes):
            sublist = _lists[-1]
            return sublist[-1]

        sublist = _lists[maxes_index]
        sublist_index = bisect_left(sublist, key)

        if sublist[sublist_index] == key:
            return key

        if sublist_index:
            return sublist[sublist_index - 1]

        if maxes_index:
            return _lists[maxes_index - 1][-1]

        raise KeyError(key)


    def ceiling_key(self, key):
        """Return smallest key greater than or equal to given key.

        Raises KeyError if there is no such key.

        """
        _list = self._list
        _maxes = _list._maxes

        if not _maxes:
            raise KeyError(key)

        maxes_index = bisect_right(_maxes, key)
        _lists = _list._lists

        if maxes_index == len(_maxes):
            if _maxes[-1] == key:
                return key
            raise KeyError(key)

        sublist = _lists[maxes_index]
        sublist_index = bisect_right(sublist, key)

        if sublist_index:
            if sublist[sublist_index - 1] == key:
                return key
            return sublist[sublist_index]

        if maxes_index and _lists[maxes_index - 1][-1] == key:
            return key

        return sublist[0]


    def floor_item(self, key):
        key = self.floor_key(key)
        return key, self[key]


    def ceiling_item(self, key):
        key = self.ceiling_key(key)
        return key, self[key]


class TestCustomSortedDict(unittest.TestCase):
    "Test CustomSortedDict floor and ceiling methods."
    keys = 50
    sublist_size = 7


    def setUp(self):
        "Initialize bintrees.BinaryTree and CustomSortedDict objects."
        self.tree = bintrees.BinaryTree()
        self.csd = CustomSortedDict()
        self.csd._reset(self.sublist_size)  # Stress sublist lookups.
        for key in range(self.keys):
            self.tree[key] = key
            self.csd[key] = key


    def test_floor_key(self):
        for key in range(self.keys):
            self.assertEqual(key, self.tree.floor_key(key))
            self.assertEqual(key, self.tree.floor_key(key + 0.5))
            self.assertEqual(key, self.csd.floor_key(key))
            self.assertEqual(key, self.csd.floor_key(key + 0.5))


    def test_floor_key_keyerror(self):
        with self.assertRaises(KeyError):
            self.tree.floor_key(-0.5)
        with self.assertRaises(KeyError):
            self.csd.floor_key(-0.5)


    def test_floor_key_empty(self):
        self.tree.clear()
        self.csd.clear()
        with self.assertRaises(KeyError):
            self.tree.floor_key(0)
        with self.assertRaises(KeyError):
            self.csd.floor_key(0)


    def test_ceiling_key(self):
        for key in range(self.keys):
            self.assertEqual(key, self.tree.ceiling_key(key))
            self.assertEqual(key, self.tree.ceiling_key(key - 0.5))
            self.assertEqual(key, self.csd.ceiling_key(key))
            self.assertEqual(key, self.csd.ceiling_key(key - 0.5))


    def test_ceiling_key_keyerror(self):
        with self.assertRaises(KeyError):
            self.tree.ceiling_key(49.5)
        with self.assertRaises(KeyError):
            self.csd.ceiling_key(49.5)


    def test_ceiling_key_empty(self):
        self.tree.clear()
        self.csd.clear()
        with self.assertRaises(KeyError):
            self.tree.ceiling_key(0)
        with self.assertRaises(KeyError):
            self.csd.ceiling_key(0)


if __name__ == '__main__':
    unittest.main()

grantjenks · 2018-04-16T19:16:36Z

Do you need the prev/succ functionality of bintrees as well?

zardus · 2018-04-16T19:31:21Z

That'd definitely be very useful! Let me know if I can help.

grantjenks · 2018-04-16T19:37:31Z

prev/succ is pretty similar to floor/ceil. I'll try to get to it this week. I'll probably spell them:

SortedList.floor(value)
SortedList.ceil(value)
SortedListWithKey.floor_key(key)
SortedListWithKey.ceil_key(key)

Actually, you can do the same with the existing "irange" method.

I've got to run now but I'll try to find some time this week to think about it.

grantjenks · 2018-04-16T20:40:34Z

Actually, I now remember I added irange for these use cases. The performance will be excellent if you’re calling prev/succ repeatedly. It also covers the floor/ceil cases. Can you look at that method and see if will work? It’s on sorted list and sorted dict. The performance may not be quite so fast as above but better to optimize after profiling.

zardus · 2018-04-16T20:54:49Z

Awesome, that seems to work! This is the code:

import sortedcontainers
s = sortedcontainers.SortedDict()
s[10] = 'ten'
s[3] = 'three'

# floor
assert next(s.irange(maximum=15, reverse=True)) == 10

# ceiling
assert next(s.irange(minimum=5, reverse=False)) == 10

# succ
assert next(s.irange(minimum=3, inclusive=(False, False), reverse=False)) == 10

# prev
assert next(s.irange(maximum=10, inclusive=(False, False), reverse=True)) == 3

Awesome stuff! It might be worth making some aliases for this functionality (and, like you say, specializing them if performance is an issue)?

grantjenks · 2018-04-17T00:36:52Z

That looks right. Glad it works!

I benchmarked the "floor_key" operation with bintrees.BinaryTree, sortedcontainers.SortedDict and CustomSortedDict. Here's the timing results:

size	bintrees.BinaryTree	sortedcontainers.SortedDict	CustomSortedDict
10	6.771e-07	1.245e-06	4.63e-07
100	1.372e-06	1.629e-06	7.93e-07
1000	1.98e-06	3.528e-06	1.171e-06
10000	2.06e-06	1.093e-05	8.759e-07
100000	2.96e-06	9.562e-06	9.089e-07
1000000	3.01e-06	2.979e-05	1.004e-06

The benchmark constructs a sorted dictionary with "size" random float keys and then looks one up at random. SortedDict is currently between 1-10x slower than bintrees.BinaryTree. The CustomSortedDict is 1-3x faster.

Here's the code:

from __future__ import print_function

import random
import timeit
import bintrees
import sortedcontainers

from test import CustomSortedDict


def median(values):
    return sorted(values)[len(values) // 2]


def run(statement, variable):
    statement = statement.format(variable=variable)
    setup = 'from __main__ import {variable}'.format(variable=variable)
    times = timeit.repeat(statement, setup, repeat=5, number=1000)
    return median(times) / 1000


tree = None
sd = None
csd = None


def benchmark(size):
    global tree, sd, csd

    random.seed(0)
    tree = bintrees.BinaryTree()
    sd = sortedcontainers.SortedDict()
    csd = CustomSortedDict()

    keys = [random.random() for num in range(size)]

    for num, key in enumerate(keys):
        tree[key] = num
        sd[key] = num
        csd[key] = num

    key = keys[0] + 0.001
    tree_time = run('{variable}.floor_key(%r)' % key, 'tree')
    statement = 'next({variable}.irange(maximum=%r, reverse=True))' % key
    sd_time = run(statement, 'sd')
    csd_time = run('{variable}.floor_key(%r)' % key, 'csd')

    print(
        size,
        format(tree_time, '.4g'),
        format(sd_time, '.4g'),
        format(csd_time, '.4g'),
    )


if __name__ == '__main__':
    print(
        'size',
        'bintrees.BinaryTree',
        'sortedcontainers.SortedDict',
        'CustomSortedDict',
    )
    for exp in range(1, 7):
        benchmark(10 ** exp)

zardus · 2018-04-17T03:44:32Z

Interesting... It's weird that the runtime doesn't seem to get progressively slower with a bigger list size. Are you running on pypy?

Would you recommend that we adapt something like the CustomSortedDict, or do you think that functionality should make it into sortedcontainer?

grantjenks · 2018-04-18T22:13:26Z

It's not that strange if you look at how "SortedList._islice" is implemented. The iterator returned is optimized for iterating. It's not optimized for returning the first element as quickly as possible. That can/should be changed.

I would adapt to irange and expect that in the next week/month, it will become faster for your use-case :)

Only after profiling would I use the specialized CustomSortedDict.

Couple notes for myself:

Generator solution is more expensive to iterate but faster to get first element.
Lazy list slicing in C-code: map(sublist.getitem, range(5, 10))
Maybe a hybrid approach between the generator/c-implementation is possible? When we can move to Python 3, "yield from" will be useful.

grantjenks · 2018-04-20T23:24:59Z

New changes at 6321f22. Here's the benchmark results:

size	bintrees.BinaryTree	sortedcontainers.SortedDict	CustomSortedDict
10	5.678e-07	1.714e-06	5.401e-07
100	1.206e-06	1.85e-06	8.721e-07
1000	1.613e-06	1.998e-06	9.54e-07
10000	1.714e-06	3.684e-06	9.983e-07
100000	2.316e-06	3.919e-06	1.001e-06
1000000	2.439e-06	4.214e-06	1.164e-06

The irange method is now 1-2x slower than the floor/ceiling functions but maintains its fast iteration. It also has more consistent performance because it "lazily" slices the lists in C-code.

I think this is "fast-enough" for now. If you need more performance then I'd suggest the CustomSortedDict changes.

I will plan to deploy these improvements in the next week or so.

grantjenks · 2018-04-27T22:41:45Z

Released at v1.5.10 to pypi.org

zardus mentioned this issue Apr 16, 2018

bintrees is discontinued in favor of sortedcontainers angr/angr#947

Closed

grantjenks added the Feature Request (Needs Votes) label Apr 16, 2018

grantjenks changed the title ~~Capability to do floor and ceiling lookups?~~ How to do floor and ceiling lookups? Apr 18, 2018

grantjenks closed this as completed Apr 27, 2018

981377660LMT mentioned this issue Jun 25, 2023

python SortedList 源码简析 981377660LMT/ts#267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do floor and ceiling lookups? #87

How to do floor and ceiling lookups? #87

zardus commented Apr 16, 2018

grantjenks commented Apr 16, 2018

grantjenks commented Apr 16, 2018

zardus commented Apr 16, 2018

grantjenks commented Apr 16, 2018

grantjenks commented Apr 16, 2018

zardus commented Apr 16, 2018

grantjenks commented Apr 17, 2018

zardus commented Apr 17, 2018

grantjenks commented Apr 18, 2018 •

edited

grantjenks commented Apr 20, 2018

grantjenks commented Apr 27, 2018

How to do floor and ceiling lookups? #87

How to do floor and ceiling lookups? #87

Comments

zardus commented Apr 16, 2018

grantjenks commented Apr 16, 2018

grantjenks commented Apr 16, 2018

zardus commented Apr 16, 2018

grantjenks commented Apr 16, 2018

grantjenks commented Apr 16, 2018

zardus commented Apr 16, 2018

grantjenks commented Apr 17, 2018

zardus commented Apr 17, 2018

grantjenks commented Apr 18, 2018 • edited

grantjenks commented Apr 20, 2018

grantjenks commented Apr 27, 2018

grantjenks commented Apr 18, 2018 •

edited