Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for populating raw deduplicated tables from live tables #220

Closed
jklukas opened this issue Jul 5, 2019 · 2 comments · Fixed by #262
Closed

Script for populating raw deduplicated tables from live tables #220

jklukas opened this issue Jul 5, 2019 · 2 comments · Fixed by #262
Assignees

Comments

@jklukas
Copy link
Contributor

jklukas commented Jul 5, 2019

As discussed in the BigQuery Table Layout and Structure Proposal, we will have the GCP pipeline populate "live" tables clustered on submission_timestamp, then rely on Airflow to run a nightly job to populate "raw" tables clustered on sample_id.

That likely will look like an additional mode in this repo's entrypoint script that will invoke a query like the following:

WITH
  srctable AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.${document_namespace}_live.${document_type}_v${document_version}` 
  WHERE
    DATE(submission_timestamp) = @submission_date ),
  --
  numbered_duplicates AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    srctable )
  --
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )

with output going to destination table moz-fx-data-shared-prod.${document_namespace}_raw.${document_type}_v${document_version}$ds_nodash.

We will need to have two different modes. In one, we run the above query for only a specific table or set of tables. We'll need to add that at the root of the main_summary DAG, for example, to get live main pings into the deduplicated raw table before running main_summary and all the downstream jobs.

In the other mode, we run the above query for all tables in _live datasets that are not already run as part of other DAGs. We probably need to pass in a list of tables to exclude, and keep that in sync with the tables that are handled in other DAGs.

@jklukas jklukas self-assigned this Jul 24, 2019
@jklukas
Copy link
Contributor Author

jklukas commented Jul 24, 2019

I'm going to start working on this.

jklukas added a commit that referenced this issue Jul 26, 2019
Closes #220

A PR to add schedule this script in Airflow to follow.
jklukas added a commit that referenced this issue Aug 1, 2019
Closes #220

A PR to add schedule this script in Airflow to follow.
jklukas added a commit that referenced this issue Aug 1, 2019
Closes #220

A PR to add schedule this script in Airflow to follow.
@jklukas
Copy link
Contributor Author

jklukas commented Aug 2, 2019

This is now deployed in Airflow and running daily (for prod tables). The stable tables now contain 2 days of data. By comparing live to stable, we can see a fairly consistent dupe rate of ~0.01% in the fenix live tables, so our pipeline deduping is probably performing quite well:

WITH
  stable AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    _TABLE_SUFFIX AS table_name,
    COUNT(*) AS n
  FROM
    `moz-fx-data-shared-prod.org_mozilla_fenix_stable.*`
  WHERE
    DATE(submission_timestamp) >= "2019-07-31"
  GROUP BY
    1,
    2 ),
  --
  live AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    _TABLE_SUFFIX AS table_name,
    COUNT(*) AS n
  FROM
    `moz-fx-data-shared-prod.org_mozilla_fenix_live.*`
  WHERE
    DATE(submission_timestamp) >= "2019-07-31"
  GROUP BY
    1,
    2 )
  --
SELECT
  table_name,
  submission_date,
  stable.n AS stable_n,
  live.n AS live_n,
  (1 - (stable.n / live.n)) * 100 AS pct_dupes
FROM
  live
JOIN
  stable
USING
  (submission_date,
    table_name)
ORDER BY 1, 2
Row table_name submission_date stable_n live_n pct_dupes  
1 activation_v1 2019-07-31 2731 2733 0.07317965605562193  
2 activation_v1 2019-08-01 3911 3911 0.0  
3 baseline_v1 2019-07-31 442118 442178 0.013569196115592597  
4 baseline_v1 2019-08-01 444724 444755 0.006970129621930443  
5 bookmarks_sync_v1 2019-07-31 141702 141708 0.004234058768737281  
6 events_v1 2019-07-31 282118 282170 0.0184286068682038  
7 events_v1 2019-08-01 286195 286211 0.005590281295964061  
8 history_sync_v1 2019-07-31 141626 141635 0.0063543615631767025  
9 metrics_v1 2019-07-31 48321 48323 0.004138815884779934  
10 metrics_v1 2019-08-01 47474 47476 0.004212654815061168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant