Performance and correctness fixes for the flow import scripts #44

philbooth · 2016-12-17T16:54:53Z

This changeset contains three significant commits:

flow_metadata.flow_id is promoted to DISTKEY, to match flow_events.flow_id. This fixes flow_metadata table should use flow_id as distkey #40.
Compression is disabled on flow_metadata.begin_time and flow_events.timestamp. This is a partial fix for Dont compress our sortkeys #43.
Sane deletion logic for overlapping flow data is implemented via flow_metadata.export_date and flow_events.export_date columns. This a combination of Suggested tweak to data re-import logic #39 and fix(db): sanely delete rows to avoid duplicate event data #34, with a couple of fixes for stupid syntax errors (my fault). Fixes Flow event CSVs contain overlapping data #33.

For expediency, or perhaps it was just laziness, I tested the effects of these changes all together. I hope that's okay, because they can still be measured independently to some extent.

The import time should only be affected by the export_date commit. Performance of the analytical queries is affected by both the DISTKEY and the SORTKEY commits, but it seemed like the latter was the only one we were unsure about there so I opted to measure them together. And we can get independent validation of the SORTKEY change by further measuring its impact against a subset of the activity event data, which I plan to do tomorrow.

So the results, based on 104 days of data (2nd September until 16th December), are like so:

Complete import duration: ~9.5 hours
Of which, vacuuming duration: ~45 minutes

Independently clearing the 104th day: ~20 seconds
Independently importing the 104th day: ~7.5 minutes

Executing the engagement ratio/multi-device query: ~1.5 hours

The import time is a huge improvement on my original ham-fisted attempt to fix duplicate events in #34. It's a little bit worse than the last timing I have for importing without the fix, which was ~6 hours for 84 days of data. The single-day clear/import figures seem eminently reasonable too.

The performance of the engagement ratio query is a significant improvement over what we currently have. My last timing was ~2 hours when run against 96 days of data.

In terms of correctness, everything seems to look okay. I can conceive of one theoretical problem but it shouldn't harm us in practice. Before the content server implemented flow event timestamp validation (train 74), we emitted events that were more than one day removed from their export_date. In such cases, were we to clear and then import a single day, it would lead to duplicating those events with the current logic.

However, in practice, I don't expect us to re-import any of that historical data on a per-day basis and the described problem does not occur when re-importing wholesale. Furthermore, given that we know the data prior to train 74 contains garbage, I fully expect us to permanently delete it all once we've built up a bit more history to show longer trends in the charts.

All told, I believe these changes are a big improvement on what we currently have. @rfk, r?

rfk · 2016-12-19T00:04:05Z

Before the content server implemented flow event timestamp validation (train 74),
we emitted events that were more than one day removed from their export_date.

Would it be worthwhile to add logic that drops such events before import, e.g. by deleting anything in temporary_raw_flow_data that has a timestamp outside the affected range?

The performance of the engagement ratio query is a significant improvement over
what we currently have

So I gotta be honest...I don't understand how the changes to the flow-event tables can have affected this query over activity-events. Unless I'm misunderstanding what you're saying above.

rfk

One question re: the unique constraint, but otherwise this LGTM. Nice work @philbooth, r+

rfk · 2016-12-19T00:11:26Z

import_flow_events.py

@@ -61,8 +61,8 @@
 """
 Q_CREATE_METADATA_TABLE = """
    CREATE TABLE IF NOT EXISTS flow_metadata (
-      flow_id VARCHAR(64) NOT NULL UNIQUE ENCODE lzo,


Do we lose anything by dropping the UNIQUE here? Redshift doesn't enforce the uniqueness, but it does allegedly use it to speed up queries in some circumstances. (Indeed, I wonder if flow_id should be marked as PRIMARY KEY in this table).

Do we lose anything by dropping the UNIQUE here?

I don't know for sure. Based on reading the docs beforehand I didn't think so but...

Redshift doesn't enforce the uniqueness, but it does allegedly use it to speed up queries in some circumstances.

I didn't realise this. Where did you read about the performance part? I confess I am on shaky ground with much of the key-related stuff in redshift, some of the docs could do with more concrete detail imho.

See e.g. http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html although that only talks about primary keys, not general unique constraints.

philbooth · 2016-12-19T04:19:38Z

Would it be worthwhile to add logic that drops such events before import, e.g. by deleting anything in temporary_raw_flow_data that has a timestamp outside the affected range?

Possibly, yep. Or it may be simpler to just run a separate clean-up script once against the affected CSVs and scrub the future events. I don't know why I didn't do that already tbh.

I don't understand how the changes to the flow-event tables can have affected this query over activity-events.

Oh dear, that is an excellent point. I'm talking complete gibberish, sorry. For some reason, I was completely mixed up about what tables each query runs against, I don't think my brain is working properly post-Hawaii yet.

Annoyingly, I now don't have anything to make the before-and-after comparison against because I didn't record any numbers for the time-to-device-connection query.

I'm still working on measuring the SORTKEY change against a subset of the activity event data though, so hopefully it's okay to extrapolate from there?

rfk · 2016-12-19T04:20:20Z

I'm still working on measuring the SORTKEY change against a subset of the activity
event data though, so hopefully it's okay to extrapolate from there?

👍

philbooth · 2016-12-21T17:58:45Z

Changes pushed:

UNIQUE constraint reinstated as per Performance and correctness fixes for the flow import scripts #44 (comment). I couldn't find anything to suggest that it has any benefit, but I also couldn't find anything to suggest it does any harm.
SORTKEY compression has been reinstated, see discussion in Dont compress our sortkeys #43.
Old-style flow.${viewName}.begin events are mapped to flow.begin on import, just to make querying against old data more consistent in light of fix(metrics): separate the begin and view flow events fxa-content-server#4440. Fixes Remove ${viewName} from begin events, move it to flow_metadata #19.

I've also added a WIP label to the PR because I plan to try these changes out in anger tonight, just to make sure I haven't broken anything.

philbooth · 2016-12-28T11:43:06Z

Removed the WIP from this and carrying forward @rfk's earlier r+. These changes are running well against our redshift instance and the import cron jobs have been reinstated.

philbooth added the waffle:review label Dec 17, 2016

philbooth added this to the FxA-41:signin funnel metrics milestone Dec 17, 2016

philbooth self-assigned this Dec 17, 2016

philbooth requested a review from rfk December 17, 2016 16:54

philbooth changed the title ~~Phil/flow schema~~ Performance and correctness fixes for the flow import scripts Dec 17, 2016

rfk approved these changes Dec 19, 2016

View reviewed changes

philbooth mentioned this pull request Dec 21, 2016

Dont compress our sortkeys #43

Closed

philbooth added 4 commits December 21, 2016 17:49

fix(db): promote flow_metadata.flow_id to DISTKEY

2412ef8

fix(db): add export_date column and sane deletion logic for flow data

73631a1

chore(scripts): consistently indent flow import queries

e85ecb7

fix(scripts): marshall flow.${viewName}.begin to flow.begin on import

cb4d050

philbooth force-pushed the phil/flow-schema branch from eae6e2c to cb4d050 Compare December 21, 2016 17:49

philbooth added the WIP label Dec 21, 2016

philbooth removed the WIP label Dec 28, 2016

philbooth merged commit ef854b5 into master Dec 28, 2016

philbooth removed the waffle:review label Dec 28, 2016

philbooth deleted the phil/flow-schema branch December 28, 2016 11:44

philbooth mentioned this pull request Dec 28, 2016

Remove ${viewName} from begin events, move it to flow_metadata #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and correctness fixes for the flow import scripts #44

Performance and correctness fixes for the flow import scripts #44

philbooth commented Dec 17, 2016 •

edited

Loading

rfk commented Dec 19, 2016

rfk left a comment

rfk Dec 19, 2016

philbooth Dec 19, 2016 •

edited

Loading

rfk Dec 19, 2016

philbooth commented Dec 19, 2016

rfk commented Dec 19, 2016

philbooth commented Dec 21, 2016 •

edited

Loading

philbooth commented Dec 28, 2016

Performance and correctness fixes for the flow import scripts #44

Performance and correctness fixes for the flow import scripts #44

Conversation

philbooth commented Dec 17, 2016 • edited Loading

rfk commented Dec 19, 2016

rfk left a comment

Choose a reason for hiding this comment

rfk Dec 19, 2016

Choose a reason for hiding this comment

philbooth Dec 19, 2016 • edited Loading

Choose a reason for hiding this comment

rfk Dec 19, 2016

Choose a reason for hiding this comment

philbooth commented Dec 19, 2016

rfk commented Dec 19, 2016

philbooth commented Dec 21, 2016 • edited Loading

philbooth commented Dec 28, 2016

philbooth commented Dec 17, 2016 •

edited

Loading

philbooth Dec 19, 2016 •

edited

Loading

philbooth commented Dec 21, 2016 •

edited

Loading