Add clients_daily query #74

relud · 2019-04-08T21:16:04Z

No description provided.

jklukas · 2019-04-09T13:37:37Z

Is this using the auto-generation machinery, or is this hand-built? Is this expected to match exactly with the imported parquet-based clients_daily?

jklukas

This is pretty elegant. 300 lines of easy-to-scan SQL.

jklukas · 2019-04-09T13:25:14Z

udf/udf_require_whole_geo.sql

+  udf_require_whole_geo(country STRING,
+    city STRING,
+    geo_subdivision1 STRING,
+    geo_subdivision2 STRING) AS ( IF(country != '??',


Do you know where the ?? comes from? That's always been surprising to me, and it would be great to have it documented in a comment here.

I'll add a comment, it comes from hindsight's geoip decoding

jklukas · 2019-04-09T13:39:46Z

sql/clients_daily_v7.sql

+        "searchbar",
+        "system",
+        "urlbar")));
+  --


These separators are a nice pattern. It's much easier to see where functions begin and end this way.

jklukas · 2019-04-09T13:41:07Z

sql/clients_daily_v7.sql

+      NULL));
+  --
+WITH
+  -- normalize client_id and rank by document_id


Suggested change

-- normalize client_id and rank by document_id

-- normalize client_id and partition by by document_id

The usage of "rank" here is confusing to me.

I meant to use order by, sorry I wrote this comment a while back, not sure what i was intending, especially since rank has pretty specific meaning in a bigquery sql context

oh, i see now. partition by is definitely better, using that.

jklukas · 2019-04-09T13:45:37Z

tests/clients_daily_v7/test_first_value_falsey/main_summary_v4.schema.json

@@ -0,0 +1,8736 @@
+[


I initially was expecting this to be the main summary json schema from mozilla-pipeline-schemas. Can we adopt a naming convention here that specifies bq schema? foo_v1.bq.schema.json perhaps? Ideally, this should match the naming we use for bq schemas generated in m-p-s. cc @acmiyaguchi

yes, we can, for this PR i was just continuing to use what was already in place

Feel free to ignore for the scope of this PR.

jklukas · 2019-04-09T13:51:17Z

udf/udf_null_if_empty_list.sql

+CREATE TEMP FUNCTION
+  udf_null_if_empty_list(list ANY TYPE) AS ( IF(ARRAY_LENGTH(list.list) > 0,
+      list,
+      NULL) );


Annoying. Seems like this should be equivalent to NULLIF(list, []) but apparently nullif doesn't work with array types? Can you add a comment to explain why this needs to exist?

I will add a comment, it throws this error: NULLIF is not defined for arguments of type ARRAY<STRUCT<element STRING>>

relud · 2019-04-09T16:17:26Z

Is this using the auto-generation machinery, or is this hand-built? Is this expected to match exactly with the imported parquet-based clients_daily?

This is hand-built to exactly match the schema of clients_daily_v6 (the imported parquet-based one). Matching the data exactly isn't really possible because v6 uses FIRST_VALUE with no ordering and it's very non-trivial to add ordering, so it's effectively ANY_VALUE. So we could probably use udf_mode_last instead, I just forgot about switching to that when I hit the PR button.

jklukas · 2019-04-09T16:25:55Z

So we could probably use udf_mode_last instead, I just forgot about switching to that when I hit the PR button

I think we might as well go ahead and do that if we aren't able to match existing clients_daily exactly anyway.

I'm also just realizing that since we're ordering by timestamp DESC, we have probably been doing the wrong thing (breaking ties by taking the earliest seen value rather than latest) by using mode_last. Should we be using udf_mode_first instead?

relud · 2019-04-09T16:35:01Z

I think mode_last is still the right method, and we should consider fixing and backfilling the other tables that coincidentally did mode_first

jklukas · 2019-04-09T16:57:42Z

I think mode_last is still the right method, and we should consider fixing and backfilling the other tables that coincidentally did mode_first

Got it. I see now that you're sorting ASC here in both frames, so mode_last is correct. I agree that other tables should be fixed to match the pattern used here, which is easier to reason about than sorting DESC. Breaking out to #75

relud · 2019-04-09T19:02:13Z

I've updated this to use mode_last, but it's throwing BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex. so we'll see how tricky i have to get to avoid that

relud · 2019-04-10T19:30:53Z

I was able to do mode_last without getting a too complex exception by combining all of the udf_mode_last UDFs into a single (nested) subquery expression, but it's not pretty, so I'm going to break that out into a separate PR for further discussion.

Remove outdated forecasts.

relud requested a review from jklukas April 8, 2019 21:16

jklukas approved these changes Apr 9, 2019

View reviewed changes

jklukas mentioned this pull request Apr 9, 2019

Nondesktop and FxA daily tables may be using wrong sorting order #75

Closed

Add clients_daily query

4daaa7e

relud force-pushed the desktop_clients_daily branch from faac723 to 4daaa7e Compare April 10, 2019 19:24

relud merged commit 2e81993 into master Apr 10, 2019

relud deleted the desktop_clients_daily branch April 10, 2019 19:31

quiiver pushed a commit that referenced this pull request Jun 25, 2024

Merge pull request #74 from mozilla/remove_outdated_KPI_forecasts

1fddd61

Remove outdated forecasts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add clients_daily query #74

Add clients_daily query #74

relud commented Apr 8, 2019

jklukas commented Apr 9, 2019

jklukas left a comment

jklukas Apr 9, 2019

relud Apr 9, 2019

jklukas Apr 9, 2019

jklukas Apr 9, 2019

relud Apr 9, 2019 •

edited

Loading

relud Apr 9, 2019

jklukas Apr 9, 2019

relud Apr 9, 2019

jklukas Apr 9, 2019

jklukas Apr 9, 2019

relud Apr 9, 2019

relud commented Apr 9, 2019

jklukas commented Apr 9, 2019

relud commented Apr 9, 2019

jklukas commented Apr 9, 2019 •

edited

Loading

relud commented Apr 9, 2019

relud commented Apr 10, 2019

	-- normalize client_id and rank by document_id
	-- normalize client_id and partition by by document_id

Add clients_daily query #74

Add clients_daily query #74

Conversation

relud commented Apr 8, 2019

jklukas commented Apr 9, 2019

jklukas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relud Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relud commented Apr 9, 2019

jklukas commented Apr 9, 2019

relud commented Apr 9, 2019

jklukas commented Apr 9, 2019 • edited Loading

relud commented Apr 9, 2019

relud commented Apr 10, 2019

relud Apr 9, 2019 •

edited

Loading

jklukas commented Apr 9, 2019 •

edited

Loading