Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Precalculate a map of column names -> column indices #43

Merged
merged 4 commits into from

2 participants

@outoftime

Hey Kelley,

We noticed some pretty poor performance when pulling back big wide rows (~a few thousand columns) with UUID headers, and tracked it down to the use of the Array#index method in the default proc in @value_cache on the Row class. This would end up doing O(N**2) comparisons between column headers, which ends up being a considerable performance bottleneck when there are lots of column headers.

This patch lazily generates a @column_indices Hash that maps column names to column indices, and uses that in favor of column_names.index. I did some benchmarking and it performs better at various row sizes, although the difference only becomes significant between 1K-10K columns in the row. As you can see, at that point it's quite stark:

Rehearsal ----------------------------------------------------------------------------
Array#index: 1 columns                     0.000000   0.000000   0.000000 (  0.001695)
Hash lookup: 1 columns                     0.000000   0.000000   0.000000 (  0.001645)
Array#index: 10 columns                    0.000000   0.000000   0.000000 (  0.003558)
Hash lookup: 10 columns                    0.000000   0.000000   0.000000 (  0.003279)
Array#index: 100 columns                   0.020000   0.000000   0.020000 (  0.014500)
Hash lookup: 100 columns                   0.010000   0.000000   0.010000 (  0.009414)
Array#index: 1000 columns                  0.160000   0.000000   0.160000 (  0.161603)
Hash lookup: 1000 columns                  0.030000   0.000000   0.030000 (  0.036212)
Array#index: 10000 columns                13.290000   0.000000  13.290000 ( 13.362359)
Hash lookup: 10000 columns                 0.330000   0.000000   0.330000 (  0.349780)
------------------------------------------------------------------ total: 13.840000sec

                                               user     system      total        real
Array#index: 1 columns                     0.000000   0.000000   0.000000 (  0.002420)
Hash lookup: 1 columns                     0.000000   0.000000   0.000000 (  0.002279)
Array#index: 10 columns                    0.000000   0.000000   0.000000 (  0.001562)
Hash lookup: 10 columns                    0.000000   0.000000   0.000000 (  0.003126)
Array#index: 100 columns                   0.010000   0.000000   0.010000 (  0.006003)
Hash lookup: 100 columns                   0.010000   0.000000   0.010000 (  0.004811)
Array#index: 1000 columns                  0.160000   0.000000   0.160000 (  0.162298)
Hash lookup: 1000 columns                  0.030000   0.000000   0.030000 (  0.037594)
Array#index: 10000 columns                12.870000   0.000000  12.870000 ( 12.947497)
Hash lookup: 10000 columns                 0.320000   0.000000   0.320000 (  0.345220)

I've gisted the benchmarking code and a modified row.rb for benchmarking purposes if you want to try it yourself:

https://gist.github.com/4520182

Unfortunately I'm using my old desktop right now and it can't seem to run the tests without Cassandra dropping out constantly, but hopefully Travis will confirm a clean patch : )

outoftime and others added some commits
@outoftime outoftime Use Bundler to manage spec environment e34f5ee
@kreynolds Merge pull request #41 from outoftime/bundler
Use Bundler to manage spec environment
7cc3ad8
@outoftime outoftime Precalculate a map of column names -> column indices
This allows us to avoid using Array#index to map names to indices, which
requires N**2 comparisons between column names to fully hydrate a row.
Performance improvements are considerable for large result rows.
6390951
@kreynolds
Owner

Nice research .. I like the general idea, but a few tweaks will make it better imo:

  • I don't think we need a column_index helper as a reference to column_indices, it doesn't add clarity to me, just another layer
  • Since @column_indices are now used on the first access to the row (except for index only access, though we deserialize the names at that point anyway), line 25 should also use the hash and has_key instead of include?. Should speed up integer CFs quite a bit as well (benchmark? :) )

Thanks

@outoftime

Agreed on both counts, and updated -- thanks!

@kreynolds kreynolds merged commit 8d2905f into kreynolds:master
@kreynolds
Owner

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 7, 2013
  1. @outoftime
  2. Merge pull request #41 from outoftime/bundler

    authored
    Use Bundler to manage spec environment
Commits on Jan 12, 2013
  1. @outoftime

    Precalculate a map of column names -> column indices

    outoftime authored
    This allows us to avoid using Array#index to map names to indices, which
    requires N**2 comparisons between column names to fully hydrate a row.
    Performance improvements are considerable for large result rows.
Commits on Jan 13, 2013
  1. @outoftime
This page is out of date. Refresh to see the latest.
Showing with 10 additions and 4 deletions.
  1. +7 −3 lib/cassandra-cql/row.rb
  2. +3 −1 spec/spec_helper.rb
View
10 lib/cassandra-cql/row.rb
@@ -22,12 +22,12 @@ def initialize(row, schema)
@row, @schema = row, schema
@value_cache = Hash.new { |h, key|
# If it's a number and not one of our columns, assume it's an index
- if key.kind_of?(Fixnum) and !column_names.include?(key)
+ if key.kind_of?(Fixnum) and !column_indices.key?(key)
column_name = column_names[key]
column_index = key
else
column_name = key
- column_index = column_names.index(key)
+ column_index = column_indices[key]
end
if column_index.nil?
@@ -48,7 +48,11 @@ def column_names
ColumnFamily.cast(column.name, @schema.names[column.name])
end
end
-
+
+ def column_indices
+ @column_indices ||= Hash[column_names.each_with_index.to_a]
+ end
+
def column_values
column_names.map do |name|
@value_cache[name]
View
4 spec/spec_helper.rb
@@ -4,12 +4,14 @@
end
require 'rubygems'
+require 'bundler'
+Bundler.setup(:default, :test)
+
require 'yaml'
require 'rspec'
CASSANDRA_VERSION = ENV['CASSANDRA_VERSION'] || '1.1' unless defined?(CASSANDRA_VERSION)
-$LOAD_PATH << "#{File.expand_path(File.dirname(__FILE__))}/../lib"
require "cassandra-cql/#{CASSANDRA_VERSION}"
def yaml_fixture(file)
Something went wrong with that request. Please try again.