Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clients_daily query #74

Merged
merged 1 commit into from
Apr 10, 2019
Merged

Add clients_daily query #74

merged 1 commit into from
Apr 10, 2019

Conversation

relud
Copy link
Collaborator

@relud relud commented Apr 8, 2019

No description provided.

@relud relud requested a review from jklukas April 8, 2019 21:16
@jklukas
Copy link
Contributor

jklukas commented Apr 9, 2019

Is this using the auto-generation machinery, or is this hand-built? Is this expected to match exactly with the imported parquet-based clients_daily?

Copy link
Contributor

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty elegant. 300 lines of easy-to-scan SQL.

udf_require_whole_geo(country STRING,
city STRING,
geo_subdivision1 STRING,
geo_subdivision2 STRING) AS ( IF(country != '??',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where the ?? comes from? That's always been surprising to me, and it would be great to have it documented in a comment here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment, it comes from hindsight's geoip decoding

"searchbar",
"system",
"urlbar")));
--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These separators are a nice pattern. It's much easier to see where functions begin and end this way.

NULL));
--
WITH
-- normalize client_id and rank by document_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- normalize client_id and rank by document_id
-- normalize client_id and partition by by document_id

The usage of "rank" here is confusing to me.

Copy link
Collaborator Author

@relud relud Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to use order by, sorry I wrote this comment a while back, not sure what i was intending, especially since rank has pretty specific meaning in a bigquery sql context

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see now. partition by is definitely better, using that.

@@ -0,0 +1,8736 @@
[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially was expecting this to be the main summary json schema from mozilla-pipeline-schemas. Can we adopt a naming convention here that specifies bq schema? foo_v1.bq.schema.json perhaps? Ideally, this should match the naming we use for bq schemas generated in m-p-s. cc @acmiyaguchi

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can, for this PR i was just continuing to use what was already in place

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore for the scope of this PR.

CREATE TEMP FUNCTION
udf_null_if_empty_list(list ANY TYPE) AS ( IF(ARRAY_LENGTH(list.list) > 0,
list,
NULL) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoying. Seems like this should be equivalent to NULLIF(list, []) but apparently nullif doesn't work with array types? Can you add a comment to explain why this needs to exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment, it throws this error: NULLIF is not defined for arguments of type ARRAY<STRUCT<element STRING>>

@relud
Copy link
Collaborator Author

relud commented Apr 9, 2019

Is this using the auto-generation machinery, or is this hand-built? Is this expected to match exactly with the imported parquet-based clients_daily?

This is hand-built to exactly match the schema of clients_daily_v6 (the imported parquet-based one). Matching the data exactly isn't really possible because v6 uses FIRST_VALUE with no ordering and it's very non-trivial to add ordering, so it's effectively ANY_VALUE. So we could probably use udf_mode_last instead, I just forgot about switching to that when I hit the PR button.

@jklukas
Copy link
Contributor

jklukas commented Apr 9, 2019

So we could probably use udf_mode_last instead, I just forgot about switching to that when I hit the PR button

I think we might as well go ahead and do that if we aren't able to match existing clients_daily exactly anyway.

I'm also just realizing that since we're ordering by timestamp DESC, we have probably been doing the wrong thing (breaking ties by taking the earliest seen value rather than latest) by using mode_last. Should we be using udf_mode_first instead?

@relud
Copy link
Collaborator Author

relud commented Apr 9, 2019

I think mode_last is still the right method, and we should consider fixing and backfilling the other tables that coincidentally did mode_first

@jklukas
Copy link
Contributor

jklukas commented Apr 9, 2019

I think mode_last is still the right method, and we should consider fixing and backfilling the other tables that coincidentally did mode_first

Got it. I see now that you're sorting ASC here in both frames, so mode_last is correct. I agree that other tables should be fixed to match the pattern used here, which is easier to reason about than sorting DESC. Breaking out to #75

@relud
Copy link
Collaborator Author

relud commented Apr 9, 2019

I've updated this to use mode_last, but it's throwing BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex. so we'll see how tricky i have to get to avoid that

@relud
Copy link
Collaborator Author

relud commented Apr 10, 2019

I was able to do mode_last without getting a too complex exception by combining all of the udf_mode_last UDFs into a single (nested) subquery expression, but it's not pretty, so I'm going to break that out into a separate PR for further discussion.

@relud relud merged commit 2e81993 into master Apr 10, 2019
@relud relud deleted the desktop_clients_daily branch April 10, 2019 19:31
quiiver pushed a commit that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants