Skip to content
This repository has been archived by the owner on Mar 19, 2021. It is now read-only.

Performance and correctness fixes for the flow import scripts #44

Merged
merged 4 commits into from
Dec 28, 2016

Conversation

philbooth
Copy link
Contributor

@philbooth philbooth commented Dec 17, 2016

This changeset contains three significant commits:

For expediency, or perhaps it was just laziness, I tested the effects of these changes all together. I hope that's okay, because they can still be measured independently to some extent.

The import time should only be affected by the export_date commit. Performance of the analytical queries is affected by both the DISTKEY and the SORTKEY commits, but it seemed like the latter was the only one we were unsure about there so I opted to measure them together. And we can get independent validation of the SORTKEY change by further measuring its impact against a subset of the activity event data, which I plan to do tomorrow.

So the results, based on 104 days of data (2nd September until 16th December), are like so:

Complete import duration: ~9.5 hours
Of which, vacuuming duration: ~45 minutes

Independently clearing the 104th day: ~20 seconds
Independently importing the 104th day: ~7.5 minutes

Executing the engagement ratio/multi-device query: ~1.5 hours

The import time is a huge improvement on my original ham-fisted attempt to fix duplicate events in #34. It's a little bit worse than the last timing I have for importing without the fix, which was ~6 hours for 84 days of data. The single-day clear/import figures seem eminently reasonable too.

The performance of the engagement ratio query is a significant improvement over what we currently have. My last timing was ~2 hours when run against 96 days of data.

In terms of correctness, everything seems to look okay. I can conceive of one theoretical problem but it shouldn't harm us in practice. Before the content server implemented flow event timestamp validation (train 74), we emitted events that were more than one day removed from their export_date. In such cases, were we to clear and then import a single day, it would lead to duplicating those events with the current logic.

However, in practice, I don't expect us to re-import any of that historical data on a per-day basis and the described problem does not occur when re-importing wholesale. Furthermore, given that we know the data prior to train 74 contains garbage, I fully expect us to permanently delete it all once we've built up a bit more history to show longer trends in the charts.

All told, I believe these changes are a big improvement on what we currently have. @rfk, r?

@philbooth philbooth self-assigned this Dec 17, 2016
@philbooth philbooth requested a review from rfk December 17, 2016 16:54
@philbooth philbooth changed the title Phil/flow schema Performance and correctness fixes for the flow import scripts Dec 17, 2016
@rfk
Copy link
Contributor

rfk commented Dec 19, 2016

Before the content server implemented flow event timestamp validation (train 74),
we emitted events that were more than one day removed from their export_date.

Would it be worthwhile to add logic that drops such events before import, e.g. by deleting anything in temporary_raw_flow_data that has a timestamp outside the affected range?

The performance of the engagement ratio query is a significant improvement over
what we currently have

So I gotta be honest...I don't understand how the changes to the flow-event tables can have affected this query over activity-events. Unless I'm misunderstanding what you're saying above.

Copy link
Contributor

@rfk rfk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question re: the unique constraint, but otherwise this LGTM. Nice work @philbooth, r+

@@ -61,8 +61,8 @@
"""
Q_CREATE_METADATA_TABLE = """
CREATE TABLE IF NOT EXISTS flow_metadata (
flow_id VARCHAR(64) NOT NULL UNIQUE ENCODE lzo,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we lose anything by dropping the UNIQUE here? Redshift doesn't enforce the uniqueness, but it does allegedly use it to speed up queries in some circumstances. (Indeed, I wonder if flow_id should be marked as PRIMARY KEY in this table).

Copy link
Contributor Author

@philbooth philbooth Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we lose anything by dropping the UNIQUE here?

I don't know for sure. Based on reading the docs beforehand I didn't think so but...

Redshift doesn't enforce the uniqueness, but it does allegedly use it to speed up queries in some circumstances.

I didn't realise this. Where did you read about the performance part? I confess I am on shaky ground with much of the key-related stuff in redshift, some of the docs could do with more concrete detail imho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e.g. http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html although that only talks about primary keys, not general unique constraints.

@philbooth
Copy link
Contributor Author

Would it be worthwhile to add logic that drops such events before import, e.g. by deleting anything in temporary_raw_flow_data that has a timestamp outside the affected range?

Possibly, yep. Or it may be simpler to just run a separate clean-up script once against the affected CSVs and scrub the future events. I don't know why I didn't do that already tbh.

I don't understand how the changes to the flow-event tables can have affected this query over activity-events.

Oh dear, that is an excellent point. I'm talking complete gibberish, sorry. For some reason, I was completely mixed up about what tables each query runs against, I don't think my brain is working properly post-Hawaii yet.

Annoyingly, I now don't have anything to make the before-and-after comparison against because I didn't record any numbers for the time-to-device-connection query.

I'm still working on measuring the SORTKEY change against a subset of the activity event data though, so hopefully it's okay to extrapolate from there?

@rfk
Copy link
Contributor

rfk commented Dec 19, 2016

I'm still working on measuring the SORTKEY change against a subset of the activity
event data though, so hopefully it's okay to extrapolate from there?

👍

@philbooth
Copy link
Contributor Author

philbooth commented Dec 21, 2016

Changes pushed:

I've also added a WIP label to the PR because I plan to try these changes out in anger tonight, just to make sure I haven't broken anything.

@philbooth philbooth removed the WIP label Dec 28, 2016
@philbooth
Copy link
Contributor Author

Removed the WIP from this and carrying forward @rfk's earlier r+. These changes are running well against our redshift instance and the import cron jobs have been reinstated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flow_metadata table should use flow_id as distkey Flow event CSVs contain overlapping data
2 participants