Skip to content
This repository has been archived by the owner on Mar 19, 2021. It is now read-only.

Dont compress our sortkeys #43

Closed
rfk opened this issue Dec 16, 2016 · 3 comments
Closed

Dont compress our sortkeys #43

rfk opened this issue Dec 16, 2016 · 3 comments

Comments

@rfk
Copy link
Contributor

rfk commented Dec 16, 2016

There's an interesting note on [1] that says:

"""
We do not recommend applying runlength encoding on any column that is designated as a sort key. Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly.
"""

We're not using run-length encoding, but we are compressing our sortkey. The presentation at [2] contains this advice:

"""
If your sort keys compress significantly more than your data columns, you may want to skip compression of sortkey column(s).

Check SVV_TABLE_INFO(skew_sortkey1)
"""

And indeed, SVV_TABLE_INFO tells me that skew_sortkey for our main tables is on the order of several hundred, which seems high. We might find there's a performance win if we disable compression of our sortkeys. It's probably worth an experiment.

[1] http://docs.aws.amazon.com/redshift/latest/dg/c_Runlength_encoding.html
[2] http://www.slideshare.net/AmazonWebServices/bdt401-amazon-redshift-deep-dive-tuning-and-best-practices

philbooth added a commit that referenced this issue Dec 16, 2016
This change addresses two issues:

1. `flow_metadata.flow_id` is promoted to `DISTKEY`,
   bringing it in to line with `flow_events.flow_id`.
   Fixes #40.

2. Compression is disabled for every `SORTKEY`,
   to improve general query performance.
   Fixes #43.
@philbooth
Copy link
Contributor

philbooth commented Dec 20, 2016

I've finally managed to measure this change against a subset of our activity event data. The bottom line is that I detected no noticeable improvement.

I created two subsets of the activity event data for the period 1st November until 17th December, one with compressed SORTKEYs and the other uncompressed. These are available in the following tables:

  • daily_multi_device_users_test_compressed
  • daily_activity_per_device_test_compressed
  • daily_multi_device_users_test_uncompressed
  • daily_activity_per_device_test_uncompressed

I then ran modified versions of the engagement ratio / multi-device query, against each pair of tables. Those queries are available here:

The timings showed a lot of variation when other queries were running but converge around the 8-minute mark in both cases:

  • Compressed: 8'06", 8'33", 13'21", 8'34", 7'27"
  • Uncompressed: 8'03", 7'41", 23'02", 7'50", 7'58"

Perhaps with a bigger sample it would be possible to detect a more reliable difference, but it took over 2 days to set up the sample data for this small test so I'm not sure it's worth the effort.

Fortunately importing the flow data is faster than the activity event data, so I'm going to try a bigger experiment with that. But even if there is a positive result there, I'm not sure it would be worth changing the activity event schemata. Given how much data is in those tables, I suspect a much bigger win might come if we implement some form of data expiry (see #45).

@philbooth
Copy link
Contributor

philbooth commented Dec 21, 2016

I just finished testing this against the full set of flow data using the time-to-device-connection query. As before, two separate datasets, one with compression and one without. Like the previous test, these also showed no improvement. In fact, if anything, the compressed version seemed to be a bit quicker:

Uncompressed: 7'40", 7'37", 7'35"
Compressed: 7'35", 7'19", 6'58"

Given the above, I'll remove the SORTKEY change from #44. Closing this.

@rfk
Copy link
Contributor Author

rfk commented Dec 21, 2016

👍 thanks for doing the necessary science here @philbooth!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants