-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clients_daily query #74
Conversation
Is this using the auto-generation machinery, or is this hand-built? Is this expected to match exactly with the imported parquet-based clients_daily? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty elegant. 300 lines of easy-to-scan SQL.
udf_require_whole_geo(country STRING, | ||
city STRING, | ||
geo_subdivision1 STRING, | ||
geo_subdivision2 STRING) AS ( IF(country != '??', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know where the ??
comes from? That's always been surprising to me, and it would be great to have it documented in a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment, it comes from hindsight's geoip decoding
"searchbar", | ||
"system", | ||
"urlbar"))); | ||
-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These separators are a nice pattern. It's much easier to see where functions begin and end this way.
NULL)); | ||
-- | ||
WITH | ||
-- normalize client_id and rank by document_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- normalize client_id and rank by document_id | |
-- normalize client_id and partition by by document_id |
The usage of "rank" here is confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to use order by
, sorry I wrote this comment a while back, not sure what i was intending, especially since rank
has pretty specific meaning in a bigquery sql context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i see now. partition by
is definitely better, using that.
@@ -0,0 +1,8736 @@ | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially was expecting this to be the main summary json schema from mozilla-pipeline-schemas. Can we adopt a naming convention here that specifies bq schema? foo_v1.bq.schema.json
perhaps? Ideally, this should match the naming we use for bq schemas generated in m-p-s. cc @acmiyaguchi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can, for this PR i was just continuing to use what was already in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore for the scope of this PR.
CREATE TEMP FUNCTION | ||
udf_null_if_empty_list(list ANY TYPE) AS ( IF(ARRAY_LENGTH(list.list) > 0, | ||
list, | ||
NULL) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoying. Seems like this should be equivalent to NULLIF(list, [])
but apparently nullif doesn't work with array types? Can you add a comment to explain why this needs to exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment, it throws this error: NULLIF is not defined for arguments of type ARRAY<STRUCT<element STRING>>
This is hand-built to exactly match the schema of clients_daily_v6 (the imported parquet-based one). Matching the data exactly isn't really possible because v6 uses |
I think we might as well go ahead and do that if we aren't able to match existing clients_daily exactly anyway. I'm also just realizing that since we're ordering by timestamp DESC, we have probably been doing the wrong thing (breaking ties by taking the earliest seen value rather than latest) by using |
I think |
Got it. I see now that you're sorting ASC here in both frames, so |
I've updated this to use |
faac723
to
4daaa7e
Compare
I was able to do |
No description provided.