-
Notifications
You must be signed in to change notification settings - Fork 3
Dont compress our sortkeys #43
Comments
I've finally managed to measure this change against a subset of our activity event data. The bottom line is that I detected no noticeable improvement. I created two subsets of the activity event data for the period 1st November until 17th December, one with compressed SORTKEYs and the other uncompressed. These are available in the following tables:
I then ran modified versions of the engagement ratio / multi-device query, against each pair of tables. Those queries are available here:
The timings showed a lot of variation when other queries were running but converge around the 8-minute mark in both cases:
Perhaps with a bigger sample it would be possible to detect a more reliable difference, but it took over 2 days to set up the sample data for this small test so I'm not sure it's worth the effort. Fortunately importing the flow data is faster than the activity event data, so I'm going to try a bigger experiment with that. But even if there is a positive result there, I'm not sure it would be worth changing the activity event schemata. Given how much data is in those tables, I suspect a much bigger win might come if we implement some form of data expiry (see #45). |
I just finished testing this against the full set of flow data using the time-to-device-connection query. As before, two separate datasets, one with compression and one without. Like the previous test, these also showed no improvement. In fact, if anything, the compressed version seemed to be a bit quicker: Uncompressed: 7'40", 7'37", 7'35" Given the above, I'll remove the |
👍 thanks for doing the necessary science here @philbooth! |
There's an interesting note on [1] that says:
"""
We do not recommend applying runlength encoding on any column that is designated as a sort key. Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly.
"""
We're not using run-length encoding, but we are compressing our sortkey. The presentation at [2] contains this advice:
"""
If your sort keys compress significantly more than your data columns, you may want to skip compression of sortkey column(s).
Check SVV_TABLE_INFO(skew_sortkey1)
"""
And indeed, SVV_TABLE_INFO tells me that
skew_sortkey
for our main tables is on the order of several hundred, which seems high. We might find there's a performance win if we disable compression of our sortkeys. It's probably worth an experiment.[1] http://docs.aws.amazon.com/redshift/latest/dg/c_Runlength_encoding.html
[2] http://www.slideshare.net/AmazonWebServices/bdt401-amazon-redshift-deep-dive-tuning-and-best-practices
The text was updated successfully, but these errors were encountered: