Lazy sorting in python
A common need when working with data, is to find the top 10 in a larger list. An easy solution is to use the built-in sort function to sort the entire list, and then just take the first 10 items. While this works, it feels wasteful. Other than knowing what is the tail, you're not interested in the tail of the list at all.
January 2017, Bora M. Alper posted his take on this problem:
lazysort. Using python generators, he sorted the list using dropsort as needed, returning just the part of the list that was requested. While Bora's idea is quite neat, sadly his benchmark showed that it couldn't actually compete with using python's built-in sort function. This is not very surprising, cpython's
sorted function uses the state-of-the-art TimSort algorithm, and it is implemented in optimized C.
But his approach gave me inspiration. While I feel dropsort is sort of a gimmick, Bora might actually be on to something here. During my time as a game developer, I built an early-out version of quicksort, which returned as soon as the head of the list was sorted. We should be able to build a something similar using python generators.
A generator-based quicksort could look like this:
def partition(unsorted): sublist_length = len(unsorted) if sublist_length <= 1: return unsorted, () pivot = unsorted lhs =  rhs =  for idx in range(1, sublist_length): finger = unsorted[idx] if finger < pivot: lhs.append(finger) else: rhs.append(finger) lhs.append(pivot) return lhs, pivot, rhs def quicksort(unsorted): if len(unsorted) < 2: yield from unsorted else: lhs, rhs = partition(unsorted) yield from quicksort(lhs) del lhs # release memory used by lhs yield pivot yield from quicksort(rhs)
For the uninitiated: quicksort works by splitting lists in two parts: one list with all items smaller than some pivot value, one with all items larger than that pivot value, and then repeating that procedure on both lists. Splitting the lists is the task of the partition function.
When comparing my implementation to lazy sort and the built-in
sorted function, you can clearly see that while Bora's
lazysort can't compete, his idea is sound. (This benchmark was based on retrieving the top 10 from a list of integers).
The chart is dominated by
lazysort, obscuring the fact that partial quicksort is indeed faster than the built-in
Now that we know we can beat
sorted, let's see why this is. A lazy sorting algorithm only has benefit when it can decide not to do some work. In quicksort's case, not having to sort the right partition several times in a row is a huge benefit. But where's the cut-off when sorting items?
If you just need the first few percent, lazy
quicksort can help out, but it has nothing on using
heapq, which breaks even at ~45%, and has a worst case (100% iteration) premium of only 200% over
sorted. Conclusion: while partial quicksort is an interesting gimmick, it has no added value over
Possible performance improvements
I'm doubtful optimization of
quicksort will make it compete with
quicksort's best case (only retrieving one item) still requires performing
log(n) partitions; O(n log(log(n)) performance, and it's worst case (retrieving all items) is still O(n log(n)). The
heapq based solution has a generalized performance of O(n + m log(n)), which is always better than
Possible feature improvements
Currently support for
reverse parameters is lacking, but those are trivial to add.
Lazy Quicksort could fake an list-like interface. The current implementation only sorts the head, but the principle applies to any part of the sorted list. This is something you can't achieve with heapq, which can only operate at the end of the lists. This could speed up computation of median values, and percentiles, although with my test data, it will probably only outperform
heapq for the 45-55 percentile, which is a really narrow niche.