Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first phase of FxA ETL for calculating MAU #32

Merged
merged 2 commits into from
Mar 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions sql/firefox_accounts_exact_mau28_by_dimensions_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
CREATE OR REPLACE VIEW
`moz-fx-data-derived-datasets.analysis.firefox_accounts_exact_mau28_by_dimensions_v1`
AS
SELECT
jklukas marked this conversation as resolved.
Show resolved Hide resolved
* EXCEPT (generated_time, country),
-- We normalize country to match the two-digit country codes that appear in
-- telemetry data, so that this view is compatible with the exact_mau28 views
-- for desktop and nondesktop.
CASE country
WHEN 'United States' THEN 'US'
WHEN 'France' THEN 'FR'
WHEN 'Germany' THEN 'DE'
WHEN 'United Kingdom' THEN 'UK'
WHEN 'Canada' THEN 'CA'
ELSE 'Other'
END AS country
FROM
`moz-fx-data-derived-datasets.analysis.firefox_accounts_exact_mau28_raw_v1`
39 changes: 39 additions & 0 deletions sql/firefox_accounts_exact_mau28_raw_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
WITH
inactive_days AS (
SELECT
*,
DATE_DIFF(submission_date, date_last_seen, DAY) AS _inactive_days,
DATE_DIFF(submission_date, date_last_seen_in_tier1_country, DAY) AS _inactive_days_tier1
FROM
fxa_users_last_seen_v1
)

SELECT
submission_date,
CURRENT_DATETIME() AS generated_time,
COUNTIF(_inactive_days < 28) AS mau,
COUNTIF(_inactive_days < 7) AS wau,
COUNTIF(_inactive_days < 1) AS dau,
-- We are generally using an "exclusive dimensions" methodology where only
-- the last country observed for a user is considered for determining whether
-- they contribute to tier 1 MAU, but we also include an "inclusive" tier 1
-- mau calculation that we have previously been using for KPI calculations on
-- FxA data; we assign a single country per user per day and include a user
-- in this calculation if they were assigned a tier 1 country in any of the
-- 28 days of the MAU window.
jklukas marked this conversation as resolved.
Show resolved Hide resolved
COUNTIF(_inactive_days_tier1 < 28) AS mau_tier1_inclusive,
-- We hash user_ids into 20 buckets to aid in computing
-- confidence intervals for mau/wau/dau sums; the particular hash
-- function and number of buckets is subject to change in the future.
MOD(ABS(FARM_FINGERPRINT(user_id)), 20) AS id_bucket,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: should we shift to non-negative range rather than taking ABS(...) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be complex to do so. FARM_FINGERPRINT returns a signed INT64 and BQ does not have any larger or unsigned integer variants, so we don't have a type to shift into.

Is your concern that ABS is introducing a bias between buckets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's my concern. It may not matter, and I agree that working around it seems overly complex here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would shift mod to positive, and guarantee we don't get a int overflow if we hit -0x8000000000000000:

Suggested change
MOD(ABS(FARM_FINGERPRINT(user_id)), 20) AS id_bucket,
MOD(MOD(FARM_FINGERPRINT(user_id), 20) + 20, 20) AS id_bucket,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stick with the simpler form for now. I've broken out the hash function discussion to #36 so it can be evaluated as a separate project in the future.

country
FROM
inactive_days
WHERE
-- First data is on 2017-10-01, so we start 28 days later for first complete MAU value.
submission_date >= DATE '2017-10-28'
--AND submission_date = @submission_date
GROUP BY
submission_date,
id_bucket,
country
15 changes: 15 additions & 0 deletions sql/firefox_accounts_exact_mau28_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
CREATE OR REPLACE VIEW
`moz-fx-data-derived-datasets.analysis.firefox_accounts_exact_mau28_v1`
AS
SELECT
submission_date,
SUM(mau) AS mau,
SUM(wau) AS wau,
SUM(dau) AS dau,
SUM(IF(country IN ('US', 'FR', 'DE', 'UK', 'CA'), mau, 0)) AS tier1_mau,
SUM(IF(country IN ('US', 'FR', 'DE', 'UK', 'CA'), wau, 0)) AS tier1_wau,
SUM(IF(country IN ('US', 'FR', 'DE', 'UK', 'CA'), dau, 0)) AS tier1_dau
FROM
`moz-fx-data-derived-datasets.analysis.firefox_accounts_exact_mau28_by_dimensions_v1`
GROUP BY
submission_date
64 changes: 64 additions & 0 deletions sql/fxa_users_daily_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
CREATE TEMP FUNCTION udf_mode_last(x ANY TYPE) AS ((
SELECT
val
FROM (
SELECT
val,
COUNT(val) AS n,
MAX(offset) AS max_offset
FROM
UNNEST(x) AS val
WITH OFFSET AS offset
GROUP BY
val
ORDER BY
n DESC,
max_offset DESC
)
LIMIT 1
));

WITH
windowed AS (
SELECT
event_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
user_id,
ROW_NUMBER() OVER w1_unframed AS _n,
udf_mode_last(ARRAY_AGG(country) OVER w1) AS country
FROM
-- We've done a one-time import of historical FxA data into static.fxa_amplitude_export_event_date;
-- this table contains events exported from Amplitude and ends on 2019-03-17;
-- we only consider this table for now, but we will soon be unioning this with
-- data from a separate BigQuery project that does daily imports of FxA logs.
static.fxa_amplitude_export_event_date
WHERE
user_id IS NOT NULL
AND event_type NOT IN (
'fxa_email - bounced', 'fxa_email - click', 'fxa_email - sent',
'fxa_reg - password_blocked','fxa_reg - password_common', 'fxa_reg - password_enrolled',
'fxa_reg - password_missing','fxa_sms - sent', 'mktg - email_click',
'mktg - email_open','mktg - email_sent','sync - repair_success',
'sync - repair_triggered')
AND event_date = @submission_date
WINDOW
w1 AS (
PARTITION BY
user_id,
event_date
ORDER BY
event_time DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
-- We must provide a modified window for ROW_NUMBER which cannot accept a frame clause.
w1_unframed AS (
PARTITION BY
user_id,
event_date
ORDER BY
event_time DESC) )
SELECT
* EXCEPT (_n)
FROM
windowed
WHERE
_n = 1
13 changes: 13 additions & 0 deletions sql/fxa_users_last_seen_v1.init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
SELECT
submission_date,
CURRENT_DATETIME() AS generated_time,
submission_date AS date_last_seen,
IF(country IN ('United States', 'France', 'Germany', 'United Kingdom', 'Canada'),
submission_date,
NULL) AS date_last_seen_in_tier1_country,
* EXCEPT (submission_date, generated_time)
FROM
fxa_users_daily_v1
WHERE
-- 2017-10-01 is the first date in the Amplitude FxA project.
submission_date = DATE '2017-10-01'
relud marked this conversation as resolved.
Show resolved Hide resolved
54 changes: 54 additions & 0 deletions sql/fxa_users_last_seen_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
WITH
current_sample AS (
SELECT
-- Record the last day on which we received any FxA event at all from this user.
@submission_date AS date_last_seen,
-- Record the last day on which the user was in a "Tier 1" country;
-- this allows a variant of country-segmented MAU where we can still count
-- a user that appeared in one of the target countries in the previous
-- 28 days even if the most recent "country" value is not in this set.
IF(country IN (
'United States',
'France',
'Germany',
'United Kingdom',
'Canada'),
@submission_date,
NULL) AS date_last_seen_in_tier1_country,
* EXCEPT (submission_date,
generated_time)
FROM
fxa_users_daily_v1
WHERE
submission_date = @submission_date ),
previous AS (
SELECT
* EXCEPT (submission_date,
generated_time)
-- We use REPLACE to null out any last_seen observations older than 28 days;
-- this ensures data never bleeds in from outside the target 28 day window.
REPLACE (IF(date_last_seen_in_tier1_country > DATE_SUB(@submission_date, INTERVAL 28 DAY),
date_last_seen_in_tier1_country,
NULL) AS date_last_seen_in_tier1_country)
FROM
fxa_users_last_seen_v1
WHERE
submission_date = DATE_SUB(@submission_date, INTERVAL 1 DAY)
AND date_last_seen > DATE_SUB(@submission_date, INTERVAL 28 DAY) )
SELECT
@submission_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
COALESCE(current_sample.date_last_seen,
previous.date_last_seen) AS date_last_seen,
jklukas marked this conversation as resolved.
Show resolved Hide resolved
COALESCE(current_sample.date_last_seen_in_tier1_country,
previous.date_last_seen_in_tier1_country) AS date_last_seen_in_tier1_country,
IF(current_sample.user_id IS NOT NULL,
current_sample,
previous).* EXCEPT (date_last_seen,
date_last_seen_in_tier1_country)
FROM
current_sample
FULL JOIN
previous
USING
(user_id)