-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for populating raw deduplicated tables from live tables #220
Comments
I'm going to start working on this. |
jklukas
added a commit
that referenced
this issue
Jul 26, 2019
Closes #220 A PR to add schedule this script in Airflow to follow.
jklukas
added a commit
that referenced
this issue
Aug 1, 2019
Closes #220 A PR to add schedule this script in Airflow to follow.
jklukas
added a commit
that referenced
this issue
Aug 1, 2019
Closes #220 A PR to add schedule this script in Airflow to follow.
This is now deployed in Airflow and running daily (for prod tables). The stable tables now contain 2 days of data. By comparing live to stable, we can see a fairly consistent dupe rate of ~0.01% in the fenix live tables, so our pipeline deduping is probably performing quite well:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As discussed in the BigQuery Table Layout and Structure Proposal, we will have the GCP pipeline populate "live" tables clustered on submission_timestamp, then rely on Airflow to run a nightly job to populate "raw" tables clustered on sample_id.
That likely will look like an additional mode in this repo's
entrypoint
script that will invoke a query like the following:with output going to destination table
moz-fx-data-shared-prod.${document_namespace}_raw.${document_type}_v${document_version}$ds_nodash
.We will need to have two different modes. In one, we run the above query for only a specific table or set of tables. We'll need to add that at the root of the main_summary DAG, for example, to get live main pings into the deduplicated raw table before running main_summary and all the downstream jobs.
In the other mode, we run the above query for all tables in
_live
datasets that are not already run as part of other DAGs. We probably need to pass in a list of tables to exclude, and keep that in sync with the tables that are handled in other DAGs.The text was updated successfully, but these errors were encountered: