Script for populating raw deduplicated tables from live tables #220

jklukas · 2019-07-05T18:49:55Z

As discussed in the BigQuery Table Layout and Structure Proposal, we will have the GCP pipeline populate "live" tables clustered on submission_timestamp, then rely on Airflow to run a nightly job to populate "raw" tables clustered on sample_id.

That likely will look like an additional mode in this repo's entrypoint script that will invoke a query like the following:

WITH
  srctable AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.${document_namespace}_live.${document_type}_v${document_version}` 
  WHERE
    DATE(submission_timestamp) = @submission_date ),
  --
  numbered_duplicates AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    srctable )
  --
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )

with output going to destination table moz-fx-data-shared-prod.${document_namespace}_raw.${document_type}_v${document_version}$ds_nodash.

We will need to have two different modes. In one, we run the above query for only a specific table or set of tables. We'll need to add that at the root of the main_summary DAG, for example, to get live main pings into the deduplicated raw table before running main_summary and all the downstream jobs.

In the other mode, we run the above query for all tables in _live datasets that are not already run as part of other DAGs. We probably need to pass in a list of tables to exclude, and keep that in sync with the tables that are handled in other DAGs.

The text was updated successfully, but these errors were encountered:

jklukas · 2019-07-24T20:38:31Z

I'm going to start working on this.

Closes #220 A PR to add schedule this script in Airflow to follow.

jklukas · 2019-08-02T13:53:42Z

This is now deployed in Airflow and running daily (for prod tables). The stable tables now contain 2 days of data. By comparing live to stable, we can see a fairly consistent dupe rate of ~0.01% in the fenix live tables, so our pipeline deduping is probably performing quite well:

WITH
  stable AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    _TABLE_SUFFIX AS table_name,
    COUNT(*) AS n
  FROM
    `moz-fx-data-shared-prod.org_mozilla_fenix_stable.*`
  WHERE
    DATE(submission_timestamp) >= "2019-07-31"
  GROUP BY
    1,
    2 ),
  --
  live AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    _TABLE_SUFFIX AS table_name,
    COUNT(*) AS n
  FROM
    `moz-fx-data-shared-prod.org_mozilla_fenix_live.*`
  WHERE
    DATE(submission_timestamp) >= "2019-07-31"
  GROUP BY
    1,
    2 )
  --
SELECT
  table_name,
  submission_date,
  stable.n AS stable_n,
  live.n AS live_n,
  (1 - (stable.n / live.n)) * 100 AS pct_dupes
FROM
  live
JOIN
  stable
USING
  (submission_date,
    table_name)
ORDER BY 1, 2

Row	table_name	submission_date	stable_n	live_n	pct_dupes
1	activation_v1	2019-07-31	2731	2733	0.07317965605562193
2	activation_v1	2019-08-01	3911	3911	0.0
3	baseline_v1	2019-07-31	442118	442178	0.013569196115592597
4	baseline_v1	2019-08-01	444724	444755	0.006970129621930443
5	bookmarks_sync_v1	2019-07-31	141702	141708	0.004234058768737281
6	events_v1	2019-07-31	282118	282170	0.0184286068682038
7	events_v1	2019-08-01	286195	286211	0.005590281295964061
8	history_sync_v1	2019-07-31	141626	141635	0.0063543615631767025
9	metrics_v1	2019-07-31	48321	48323	0.004138815884779934
10	metrics_v1	2019-08-01	47474	47476	0.004212654815061168

jklukas self-assigned this Jul 24, 2019

jklukas added a commit that referenced this issue Jul 26, 2019

Add copy_deduplicate script

5af2849

Closes #220 A PR to add schedule this script in Airflow to follow.

jklukas mentioned this issue Jul 26, 2019

Add copy_deduplicate script #262

Merged

jklukas added a commit that referenced this issue Aug 1, 2019

Add copy_deduplicate script

3497ea9

Closes #220 A PR to add schedule this script in Airflow to follow.

jklukas closed this as completed in #262 Aug 1, 2019

jklukas added a commit that referenced this issue Aug 1, 2019

Add copy_deduplicate script

a275cfb

Closes #220 A PR to add schedule this script in Airflow to follow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script for populating raw deduplicated tables from live tables #220

Script for populating raw deduplicated tables from live tables #220

jklukas commented Jul 5, 2019

jklukas commented Jul 24, 2019

jklukas commented Aug 2, 2019

Script for populating raw deduplicated tables from live tables #220

Script for populating raw deduplicated tables from live tables #220

Comments

jklukas commented Jul 5, 2019

jklukas commented Jul 24, 2019

jklukas commented Aug 2, 2019