Precalculate a map of column names -> column indices #43

Merged
merged 4 commits into from Jan 13, 2013

Projects

None yet

2 participants

@outoftime
Contributor

Hey Kelley,

We noticed some pretty poor performance when pulling back big wide rows (~a few thousand columns) with UUID headers, and tracked it down to the use of the Array#index method in the default proc in @value_cache on the Row class. This would end up doing O(N**2) comparisons between column headers, which ends up being a considerable performance bottleneck when there are lots of column headers.

This patch lazily generates a @column_indices Hash that maps column names to column indices, and uses that in favor of column_names.index. I did some benchmarking and it performs better at various row sizes, although the difference only becomes significant between 1K-10K columns in the row. As you can see, at that point it's quite stark:

Rehearsal ----------------------------------------------------------------------------
Array#index: 1 columns                     0.000000   0.000000   0.000000 (  0.001695)
Hash lookup: 1 columns                     0.000000   0.000000   0.000000 (  0.001645)
Array#index: 10 columns                    0.000000   0.000000   0.000000 (  0.003558)
Hash lookup: 10 columns                    0.000000   0.000000   0.000000 (  0.003279)
Array#index: 100 columns                   0.020000   0.000000   0.020000 (  0.014500)
Hash lookup: 100 columns                   0.010000   0.000000   0.010000 (  0.009414)
Array#index: 1000 columns                  0.160000   0.000000   0.160000 (  0.161603)
Hash lookup: 1000 columns                  0.030000   0.000000   0.030000 (  0.036212)
Array#index: 10000 columns                13.290000   0.000000  13.290000 ( 13.362359)
Hash lookup: 10000 columns                 0.330000   0.000000   0.330000 (  0.349780)
------------------------------------------------------------------ total: 13.840000sec

                                               user     system      total        real
Array#index: 1 columns                     0.000000   0.000000   0.000000 (  0.002420)
Hash lookup: 1 columns                     0.000000   0.000000   0.000000 (  0.002279)
Array#index: 10 columns                    0.000000   0.000000   0.000000 (  0.001562)
Hash lookup: 10 columns                    0.000000   0.000000   0.000000 (  0.003126)
Array#index: 100 columns                   0.010000   0.000000   0.010000 (  0.006003)
Hash lookup: 100 columns                   0.010000   0.000000   0.010000 (  0.004811)
Array#index: 1000 columns                  0.160000   0.000000   0.160000 (  0.162298)
Hash lookup: 1000 columns                  0.030000   0.000000   0.030000 (  0.037594)
Array#index: 10000 columns                12.870000   0.000000  12.870000 ( 12.947497)
Hash lookup: 10000 columns                 0.320000   0.000000   0.320000 (  0.345220)

I've gisted the benchmarking code and a modified row.rb for benchmarking purposes if you want to try it yourself:

https://gist.github.com/4520182

Unfortunately I'm using my old desktop right now and it can't seem to run the tests without Cassandra dropping out constantly, but hopefully Travis will confirm a clean patch : )

outoftime and others added some commits Jan 7, 2013
@outoftime outoftime Use Bundler to manage spec environment e34f5ee
@kreynolds Merge pull request #41 from outoftime/bundler
Use Bundler to manage spec environment
7cc3ad8
@outoftime outoftime Precalculate a map of column names -> column indices
This allows us to avoid using Array#index to map names to indices, which
requires N**2 comparisons between column names to fully hydrate a row.
Performance improvements are considerable for large result rows.
6390951
@kreynolds

Nice research .. I like the general idea, but a few tweaks will make it better imo:

  • I don't think we need a column_index helper as a reference to column_indices, it doesn't add clarity to me, just another layer
  • Since @column_indices are now used on the first access to the row (except for index only access, though we deserialize the names at that point anyway), line 25 should also use the hash and has_key instead of include?. Should speed up integer CFs quite a bit as well (benchmark? :) )

Thanks

Contributor

Agreed on both counts, and updated -- thanks!

@kreynolds kreynolds merged commit 8d2905f into kreynolds:master Jan 13, 2013

1 check failed

default The Travis build failed
Details
Owner

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment