New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to do floor and ceiling lookups? #87
Comments
It's possible using Here's a solution: import unittest
import bintrees
import sortedcontainers
from bisect import bisect_left, bisect_right
class CustomSortedDict(sortedcontainers.SortedDict):
def floor_key(self, key):
"""Return greatest key less than or equal to given key.
Raises KeyError if there is no such key.
"""
_list = self._list
_maxes = _list._maxes
if not _maxes:
raise KeyError(key)
maxes_index = bisect_left(_maxes, key)
_lists = _list._lists
if maxes_index == len(_maxes):
sublist = _lists[-1]
return sublist[-1]
sublist = _lists[maxes_index]
sublist_index = bisect_left(sublist, key)
if sublist[sublist_index] == key:
return key
if sublist_index:
return sublist[sublist_index - 1]
if maxes_index:
return _lists[maxes_index - 1][-1]
raise KeyError(key)
def ceiling_key(self, key):
"""Return smallest key greater than or equal to given key.
Raises KeyError if there is no such key.
"""
_list = self._list
_maxes = _list._maxes
if not _maxes:
raise KeyError(key)
maxes_index = bisect_right(_maxes, key)
_lists = _list._lists
if maxes_index == len(_maxes):
if _maxes[-1] == key:
return key
raise KeyError(key)
sublist = _lists[maxes_index]
sublist_index = bisect_right(sublist, key)
if sublist_index:
if sublist[sublist_index - 1] == key:
return key
return sublist[sublist_index]
if maxes_index and _lists[maxes_index - 1][-1] == key:
return key
return sublist[0]
def floor_item(self, key):
key = self.floor_key(key)
return key, self[key]
def ceiling_item(self, key):
key = self.ceiling_key(key)
return key, self[key]
class TestCustomSortedDict(unittest.TestCase):
"Test CustomSortedDict floor and ceiling methods."
keys = 50
sublist_size = 7
def setUp(self):
"Initialize bintrees.BinaryTree and CustomSortedDict objects."
self.tree = bintrees.BinaryTree()
self.csd = CustomSortedDict()
self.csd._reset(self.sublist_size) # Stress sublist lookups.
for key in range(self.keys):
self.tree[key] = key
self.csd[key] = key
def test_floor_key(self):
for key in range(self.keys):
self.assertEqual(key, self.tree.floor_key(key))
self.assertEqual(key, self.tree.floor_key(key + 0.5))
self.assertEqual(key, self.csd.floor_key(key))
self.assertEqual(key, self.csd.floor_key(key + 0.5))
def test_floor_key_keyerror(self):
with self.assertRaises(KeyError):
self.tree.floor_key(-0.5)
with self.assertRaises(KeyError):
self.csd.floor_key(-0.5)
def test_floor_key_empty(self):
self.tree.clear()
self.csd.clear()
with self.assertRaises(KeyError):
self.tree.floor_key(0)
with self.assertRaises(KeyError):
self.csd.floor_key(0)
def test_ceiling_key(self):
for key in range(self.keys):
self.assertEqual(key, self.tree.ceiling_key(key))
self.assertEqual(key, self.tree.ceiling_key(key - 0.5))
self.assertEqual(key, self.csd.ceiling_key(key))
self.assertEqual(key, self.csd.ceiling_key(key - 0.5))
def test_ceiling_key_keyerror(self):
with self.assertRaises(KeyError):
self.tree.ceiling_key(49.5)
with self.assertRaises(KeyError):
self.csd.ceiling_key(49.5)
def test_ceiling_key_empty(self):
self.tree.clear()
self.csd.clear()
with self.assertRaises(KeyError):
self.tree.ceiling_key(0)
with self.assertRaises(KeyError):
self.csd.ceiling_key(0)
if __name__ == '__main__':
unittest.main() |
Do you need the prev/succ functionality of bintrees as well? |
That'd definitely be very useful! Let me know if I can help. |
prev/succ is pretty similar to floor/ceil. I'll try to get to it this week. I'll probably spell them:
Actually, you can do the same with the existing "irange" method. I've got to run now but I'll try to find some time this week to think about it. |
Actually, I now remember I added irange for these use cases. The performance will be excellent if you’re calling prev/succ repeatedly. It also covers the floor/ceil cases. Can you look at that method and see if will work? It’s on sorted list and sorted dict. The performance may not be quite so fast as above but better to optimize after profiling. |
Awesome, that seems to work! This is the code:
Awesome stuff! It might be worth making some aliases for this functionality (and, like you say, specializing them if performance is an issue)? |
That looks right. Glad it works! I benchmarked the "floor_key" operation with bintrees.BinaryTree, sortedcontainers.SortedDict and CustomSortedDict. Here's the timing results:
The benchmark constructs a sorted dictionary with "size" random float keys and then looks one up at random. SortedDict is currently between 1-10x slower than bintrees.BinaryTree. The CustomSortedDict is 1-3x faster. Here's the code: from __future__ import print_function
import random
import timeit
import bintrees
import sortedcontainers
from test import CustomSortedDict
def median(values):
return sorted(values)[len(values) // 2]
def run(statement, variable):
statement = statement.format(variable=variable)
setup = 'from __main__ import {variable}'.format(variable=variable)
times = timeit.repeat(statement, setup, repeat=5, number=1000)
return median(times) / 1000
tree = None
sd = None
csd = None
def benchmark(size):
global tree, sd, csd
random.seed(0)
tree = bintrees.BinaryTree()
sd = sortedcontainers.SortedDict()
csd = CustomSortedDict()
keys = [random.random() for num in range(size)]
for num, key in enumerate(keys):
tree[key] = num
sd[key] = num
csd[key] = num
key = keys[0] + 0.001
tree_time = run('{variable}.floor_key(%r)' % key, 'tree')
statement = 'next({variable}.irange(maximum=%r, reverse=True))' % key
sd_time = run(statement, 'sd')
csd_time = run('{variable}.floor_key(%r)' % key, 'csd')
print(
size,
format(tree_time, '.4g'),
format(sd_time, '.4g'),
format(csd_time, '.4g'),
)
if __name__ == '__main__':
print(
'size',
'bintrees.BinaryTree',
'sortedcontainers.SortedDict',
'CustomSortedDict',
)
for exp in range(1, 7):
benchmark(10 ** exp) |
Interesting... It's weird that the runtime doesn't seem to get progressively slower with a bigger list size. Are you running on pypy? Would you recommend that we adapt something like the CustomSortedDict, or do you think that functionality should make it into sortedcontainer? |
It's not that strange if you look at how "SortedList._islice" is implemented. The iterator returned is optimized for iterating. It's not optimized for returning the first element as quickly as possible. That can/should be changed. I would adapt to irange and expect that in the next week/month, it will become faster for your use-case :) Only after profiling would I use the specialized CustomSortedDict. Couple notes for myself:
|
New changes at 6321f22. Here's the benchmark results:
The irange method is now 1-2x slower than the floor/ceiling functions but maintains its fast iteration. It also has more consistent performance because it "lazily" slices the lists in C-code. I think this is "fast-enough" for now. If you need more performance then I'd suggest the CustomSortedDict changes. I will plan to deploy these improvements in the next week or so. |
Released at v1.5.10 to pypi.org |
With bintrees being discontinued, and projects using it being urged to migrate to sortedcontainers, I have a feature question. In bintrees, you can look up the "floor" and "ceiling" items given a key that is either too big or too small. For example:
As far as I can tell, this isn't possible in sortedcontainers. Am I missing anything? If not, is it a planned feature to add? Alternatively, do you have any pointers for a direction toward implementing it?
The text was updated successfully, but these errors were encountered: