-
Notifications
You must be signed in to change notification settings - Fork 3
Performance and correctness fixes for the flow import scripts #44
Conversation
Would it be worthwhile to add logic that drops such events before import, e.g. by deleting anything in
So I gotta be honest...I don't understand how the changes to the flow-event tables can have affected this query over activity-events. Unless I'm misunderstanding what you're saying above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question re: the unique constraint, but otherwise this LGTM. Nice work @philbooth, r+
@@ -61,8 +61,8 @@ | |||
""" | |||
Q_CREATE_METADATA_TABLE = """ | |||
CREATE TABLE IF NOT EXISTS flow_metadata ( | |||
flow_id VARCHAR(64) NOT NULL UNIQUE ENCODE lzo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we lose anything by dropping the UNIQUE
here? Redshift doesn't enforce the uniqueness, but it does allegedly use it to speed up queries in some circumstances. (Indeed, I wonder if flow_id
should be marked as PRIMARY KEY
in this table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we lose anything by dropping the
UNIQUE
here?
I don't know for sure. Based on reading the docs beforehand I didn't think so but...
Redshift doesn't enforce the uniqueness, but it does allegedly use it to speed up queries in some circumstances.
I didn't realise this. Where did you read about the performance part? I confess I am on shaky ground with much of the key-related stuff in redshift, some of the docs could do with more concrete detail imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See e.g. http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html although that only talks about primary keys, not general unique constraints.
Possibly, yep. Or it may be simpler to just run a separate clean-up script once against the affected CSVs and scrub the future events. I don't know why I didn't do that already tbh.
Oh dear, that is an excellent point. I'm talking complete gibberish, sorry. For some reason, I was completely mixed up about what tables each query runs against, I don't think my brain is working properly post-Hawaii yet. Annoyingly, I now don't have anything to make the before-and-after comparison against because I didn't record any numbers for the time-to-device-connection query. I'm still working on measuring the |
👍 |
eae6e2c
to
cb4d050
Compare
Changes pushed:
I've also added a |
Removed the |
This changeset contains three significant commits:
flow_metadata.flow_id
is promoted toDISTKEY
, to matchflow_events.flow_id
. This fixes flow_metadata table should use flow_id as distkey #40.Compression is disabled on
flow_metadata.begin_time
andflow_events.timestamp
. This is a partial fix for Dont compress our sortkeys #43.Sane deletion logic for overlapping flow data is implemented via
flow_metadata.export_date
andflow_events.export_date
columns. This a combination of Suggested tweak to data re-import logic #39 and fix(db): sanely delete rows to avoid duplicate event data #34, with a couple of fixes for stupid syntax errors (my fault). Fixes Flow event CSVs contain overlapping data #33.For expediency, or perhaps it was just laziness, I tested the effects of these changes all together. I hope that's okay, because they can still be measured independently to some extent.
The import time should only be affected by the
export_date
commit. Performance of the analytical queries is affected by both theDISTKEY
and theSORTKEY
commits, but it seemed like the latter was the only one we were unsure about there so I opted to measure them together. And we can get independent validation of theSORTKEY
change by further measuring its impact against a subset of the activity event data, which I plan to do tomorrow.So the results, based on 104 days of data (2nd September until 16th December), are like so:
Complete import duration: ~9.5 hours
Of which, vacuuming duration: ~45 minutes
Independently clearing the 104th day: ~20 seconds
Independently importing the 104th day: ~7.5 minutes
Executing the engagement ratio/multi-device query: ~1.5 hours
The import time is a huge improvement on my original ham-fisted attempt to fix duplicate events in #34. It's a little bit worse than the last timing I have for importing without the fix, which was ~6 hours for 84 days of data. The single-day clear/import figures seem eminently reasonable too.
The performance of the engagement ratio query is a significant improvement over what we currently have. My last timing was ~2 hours when run against 96 days of data.
In terms of correctness, everything seems to look okay. I can conceive of one theoretical problem but it shouldn't harm us in practice. Before the content server implemented flow event timestamp validation (train 74), we emitted events that were more than one day removed from their
export_date
. In such cases, were we to clear and then import a single day, it would lead to duplicating those events with the current logic.However, in practice, I don't expect us to re-import any of that historical data on a per-day basis and the described problem does not occur when re-importing wholesale. Furthermore, given that we know the data prior to train 74 contains garbage, I fully expect us to permanently delete it all once we've built up a bit more history to show longer trends in the charts.
All told, I believe these changes are a big improvement on what we currently have. @rfk, r?