Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add core_clients_daily_v1 #8

Merged
merged 8 commits into from
Mar 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions sql/core_clients_daily_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
CREATE TEMP FUNCTION udf_mode_last(x ANY TYPE) AS ((
SELECT
val
FROM (
SELECT
val,
COUNT(val) AS n,
MAX(offset) AS max_offset
FROM
UNNEST(x) AS val
WITH OFFSET AS offset
GROUP BY
val
ORDER BY
n DESC,
max_offset DESC
)
LIMIT 1
));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode udf here is also codified in the udfs directory. We don't have a way yet to automatically inject.


WITH
numbered_duplicates AS (
SELECT
* REPLACE(LOWER(client_id) AS client_id),
ROW_NUMBER() OVER (PARTITION BY client_id, submission_date_s3, metadata.document_id ORDER BY metadata.timestamp DESC) AS n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I saw a PR float by where an "internal-use" type field was aliased to begin with an underscore - I thought that was a nice idea, so that might be a good thing to use here for a temporary field we remove in the next CTE step, just in case someone adds an n field to the core ping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@relud has advocated for this, too. You two are probably right. I've just seen a lot of SQL queries using ROW_NUMBER and WHERE n = 1, so it feels right to me. I have no proof this is a wider standard in the community, though, and you make a good point about potentially running into difficulty with a field named n in the source table.

Breaking out this discussion to #10

FROM
telemetry_core_parquet_v3
WHERE
client_id IS NOT NULL ),
-- Deduplicating on document_id is necessary to get valid SUM values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an excellent pattern, and I think we should use it everywhere, and maybe provide it as a non-materialized view

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, when we're working with the data ingested directly to BQ, should we consider having a query upstream in the DAG that runs a DELETE of duplicates in the base table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, when we're working with the data ingested directly to BQ, should we consider having a query upstream in the DAG that runs a DELETE of duplicates in the base table?

Yes, I think this is something we should continue to consider and at some point do a focused test to make sure we have that logic 100% correct. For the time being, though, this non-destructive dedupe within the query is a safe and effective alternative.

deduplicated AS (
SELECT
* EXCEPT (n)
FROM
numbered_duplicates
WHERE
n = 1 ),
windowed AS (
SELECT
@submission_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
client_id,
ROW_NUMBER() OVER w1 AS n,
-- For now, we're ignoring the following RECORD type fields:
-- accessibility_services
-- experiments
-- searches
--
-- Take the min observed profile creation date.
SAFE.DATE_FROM_UNIX_DATE(MIN(profile_date) OVER w1) AS profile_date,
-- These integer fields are already sums over sessions since last upload,
-- so we sum to represent all uploads in the given day;
-- we set an upper limit of 100K which contains 99.9th percentile of durations
-- while avoiding integer overflow on pathological input.
SUM(IF(sessions BETWEEN 0 AND 100000, sessions, 0)) OVER w1 AS sessions,
SUM(IF(durations BETWEEN 0 AND 100000, durations, 0)) OVER w1 AS durations,
SUM(IF(flash_usage BETWEEN 0 AND 100000, flash_usage, 0)) OVER w1 AS flash_usage,
-- For all other dimensions, we use the mode of observed values in the day.
udf_mode_last(ARRAY_AGG(app_name) OVER w1) AS app_name,
udf_mode_last(ARRAY_AGG(os) OVER w1) AS os,
udf_mode_last(ARRAY_AGG(metadata.geo_country) OVER w1) AS country,
udf_mode_last(ARRAY_AGG(metadata.geo_city) OVER w1) AS city,
udf_mode_last(ARRAY_AGG(metadata.app_build_id) OVER w1) AS app_build_id,
udf_mode_last(ARRAY_AGG(metadata.normalized_channel) OVER w1) AS normalized_channel,
udf_mode_last(ARRAY_AGG(locale) OVER w1) AS locale,
udf_mode_last(ARRAY_AGG(osversion) OVER w1) AS osversion,
udf_mode_last(ARRAY_AGG(device) OVER w1) AS device,
udf_mode_last(ARRAY_AGG(arch) OVER w1) AS arch,
udf_mode_last(ARRAY_AGG(default_search) OVER w1) AS default_search,
udf_mode_last(ARRAY_AGG(distribution_id) OVER w1) AS distribution_id,
udf_mode_last(ARRAY_AGG(campaign) OVER w1) AS campaign,
udf_mode_last(ARRAY_AGG(campaign_id) OVER w1) AS campaign_id,
udf_mode_last(ARRAY_AGG(default_browser) OVER w1) AS default_browser,
udf_mode_last(ARRAY_AGG(show_tracker_stats_share) OVER w1) AS show_tracker_stats_share,
udf_mode_last(ARRAY_AGG(metadata_app_version) OVER w1) AS metadata_app_version,
udf_mode_last(ARRAY_AGG(bug_1501329_affected) OVER w1) AS bug_1501329_affected
FROM
deduplicated
WHERE
submission_date_s3 = @submission_date
-- Bug 1501329: avoid the pathological "canary" client_id
AND client_id != 'c0ffeec0-ffee-c0ff-eec0-ffeec0ffeec0'
WINDOW
w1 AS (
PARTITION BY
client_id,
submission_date_s3
ORDER BY
metadata.timestamp DESC) )
SELECT
* EXCEPT (n)
FROM
windowed
WHERE
n = 1
13 changes: 13 additions & 0 deletions sql/core_clients_last_seen_v1.init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
SELECT
submission_date,
CURRENT_DATETIME() AS generated_time,
submission_date AS date_last_seen,
IF(country IN ('US', 'FR', 'DE', 'UK', 'CA'),
submission_date,
NULL) AS date_last_seen_in_tier1_country,
* EXCEPT (submission_date, generated_time)
FROM
core_clients_daily_v1
WHERE
-- 2017-01-01 is the first date in telemetry_core_parquet_v3.
submission_date = DATE('2017-01-01')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

47 changes: 47 additions & 0 deletions sql/core_clients_last_seen_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
WITH
current_sample AS (
SELECT
-- Include dummy dates for date_last_* fields to make schema match with `previous`.
DATE('2000-01-01') AS date_last_seen,
DATE('2000-01-01') AS date_last_seen_in_tier1_country,
* EXCEPT (submission_date,
generated_time)
FROM
core_clients_daily_v1
WHERE
submission_date = @submission_date ),
previous AS (
SELECT
* EXCEPT (submission_date,
generated_time)
FROM
core_clients_last_seen_v1
WHERE
submission_date = DATE_SUB(@submission_date, INTERVAL 1 DAY)
AND date_last_seen > DATE_SUB(@submission_date, INTERVAL 28 DAY) )
SELECT
@submission_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
-- Record the last day on which we recieved any core ping at all from this client.
IF(current_sample.client_id IS NOT NULL,
@submission_date,
previous.date_last_seen) AS date_last_seen,
-- Record the last day on which the client was in a "Tier 1" country;
-- this allows a variant of country-segmented MAU where we can still count
-- a client that appeared in one of the target countries in the previous
-- 28 days even if the most recent "country" value is not in this set.
IF(current_sample.client_id IS NOT NULL
AND current_sample.country IN ('US', 'FR', 'DE', 'UK', 'CA'),
@submission_date,
IF(previous.date_last_seen_in_tier1_country > DATE_SUB(@submission_date, INTERVAL 28 DAY),
previous.date_last_seen_in_tier1_country,
NULL)) AS date_last_seen_in_tier1_country,
IF(current_sample.client_id IS NOT NULL,
current_sample,
previous).* EXCEPT (date_last_seen, date_last_seen_in_tier1_country)
FROM
current_sample
FULL JOIN
previous
USING
(client_id)
39 changes: 39 additions & 0 deletions udf/udf_mode_last.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
CREATE TEMP FUNCTION udf_mode_last(x ANY TYPE) AS ((
SELECT
val
FROM (
SELECT
val,
COUNT(val) AS n,
MAX(offset) AS max_offset
FROM
UNNEST(x) AS val
WITH OFFSET AS offset
GROUP BY
val
ORDER BY
n DESC,
max_offset DESC
)
LIMIT 1
));

/*

Returns the most frequently occuring element in an array.

In the case of multiple values tied for the highest count, it returns the value
that appears latest in the array. Nulls are ignored.

Examples:

SELECT udf_mode_last(['foo', 'bar', 'baz', 'bar', 'fred']);
-- bar

SELECT udf_mode_last(['foo', 'bar', 'baz', 'bar', 'baz', 'fred']);
-- baz

SELECT udf_mode_last([null, 'foo', null]);
-- foo

*/