Pagination in Cassandra

newacct edited this page Oct 14, 2010 · 2 revisions

Logsandra is a log management application, the idea behind Logsandra is two fold, to make the day easier for people with large amount of log files and to showcase how a Cassandra application can be built. For Logsandra to be useful, it needs the ability to display a list of log entries. With the amount of data a Cassandra cluster can handle I need to divide the list into smaller chunks and make use of next and previous links. This technique is called pagination, which I thought would be really easy to implement with Cassandra by using a SliceRange when calling get_range_slices. Do not worry, it is easy if you do it right.

SliceRange takes four arguments, start, finish, reversed, count. My first endeavor was to get the next link working. From the initial get_range_slices call I stored the last column key which I sent in as the start argument in my second get_range_slices call. It worked! The previous link was a bit trickier. I continued with the same concept, except I used the first column key instead of the last. My intuition told me to use the first column key as the finish argument, and set reversed as true. Oddly, this did not work.

Between this attempt and my “epiphany” I tried various solution most of which involved querying for a large amount of columns and store them in my application. Solutions of this kind got too complex, too fast, mainly because the need to store the columns in between http requests.

I was back on square one again, until it hit me, thanks to a comment by Brandon Williams (driftx). The solution was pretty obvious now, start is always start. The first column key should still be the start argument, and reversed should be true. It worked! Below is an illustration that helps me visualize it, hope it helps someone else too:

    Next: Reversed: false             Start -> Finish    Start = Last column key
    Prev: Reversed: true    Finish <- Start              Start = First column key

Another thing is that start and finish is inclusive, which means that two adjacent lists will share one element. I got around that in Logsandra because I use the LongType for column keys (using it for dates as unixtime with microseconds). I only needed to add one to the next column key and subtract one from the previous column key.

I have taken the time to write up a small example in Python using Pycassa to show how pagination can be done:

import pycassa
from string import ascii_lowercase
from ordereddict import OrderedDict

client = pycassa.connect(['%s:%s' % ('localhost', 9160)], timeout=10)
pagination_example = pycassa.ColumnFamily(client, 'logsandra', 
                                          'pagination_example_ascii', dict_class=OrderedDict)

# Set every letter in the alphabet as a column key at row key 'example'
data = {}
for i in ascii_lowercase:
    data[i] = i
pagination_example.insert('example', data)

def get_data_paginated(rowkey, action_next=None, action_prev=None):
    if action_next and action_prev:
        raise ValueError('action_next and action_prev is mutually exclusive')

    column_start = ''
    column_reversed = False
    if action_next:
        column_start = action_next

    if action_prev:
        column_start = action_prev
        column_reversed = True

    result = pagination_example.get(rowkey, column_count=10, 

    # If we got columns in reversed order, reverse them
    if column_reversed:
        result = OrderedDict(reversed(result.items()))

    # Return result, first column key, last column key
    keys = result.keys()
    return result, keys[0], keys[-1]

# Try to execute an example
result, first, last = get_data_paginated('example')
print 'INIT: %s\n' % result

result, first, last = get_data_paginated('example', action_next=last)
print 'NEXT: %s\n' % result

result, first, last = get_data_paginated('example', action_next=last)
print 'NEXT: %s\n' % result

result, first, last = get_data_paginated('example', action_prev=first)
print 'PREV: %s\n' % result

result, first, last = get_data_paginated('example', action_prev=first)
print 'PREV: %s' % result

The example is very simple and has no checks for out of bounds and probably is missing a few other features too. But the Logsandra source code is available with a more complete implementation but it might be a bit harder to follow the code. Feel free to ask any questions and if you think something is wrong in this article, help me correct it!

A note to Python and Pycassa users (and maybe to other languages/libraries too), if you use pagination you also need to have your data ordered. Pycassa by default return a dict which in Python is an unordered hash table. But Pycassa support custom dict implementations such as the ordereddict, which will be part of Python 2.7.