New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft documentation for DS workflow for adding to clients_last_seen #579
Conversation
I will be shopping around this workflow to some Data Scientists for review. It is intended to empower them to explore new feature usage definitions in a manner that has a clear path to being included in clients_last_seen if it proves useful. Relevant to the [Data Warehouse daily aggregations sub-project](https://docs.google.com/document/d/1Lml-hWiqhvUazjn_-sDF6TUvwAEA3jMG3LJbyIC7fEc/edit#).
cc @felixlawrence @irrationalagent @SuYoungHong @godelstheory for feedback. Does this seem usable? Is there obvious value in defining this workflow? If this does seem valuable, my goal would be to use this as part of a coherent strategy for what it looks like to add fields to clients_last_seen. |
This does look like it would be useful to me. Many feature usage metrics are based on events. At first glance, it seems like this could be adapted to use the events table by aggregating LOGICAL_OR-like per client-day. Am I missing anything else there? |
Yes, that's quite feasible. Jesse prototyped some events-based usage criteria in mozilla/bigquery-etl#1193 using that pattern. The path for getting such a feature integrated into clients_last_seen is unpaved, though, so there would be discussions needed to figure out what it's like to productionize (which I'm happy to shepherd if we get there). |
your specific new field are marked between `-- BEGIN` and `-- END` comments. | ||
|
||
The example queries `main_v4` directly in order to be as generic as possible. | ||
The `daily` CTE below could be removed in the case that `clients_daily` already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I suspect that >90% of the time, clients_daily
will be enough. From this information, it was obvious-enough to me how I could use clients_daily
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think that none of my current or planned work truly needs this (though I could well be overlooking something), but it's a good thing to have in the toolbox, and is an comprehensible explanation of the tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already proving very useful for the core actives work, where we are constrained with a URI>=1 as active day definition. It is a straightforward to create a core actives
table to join upon, with long enough lookbacks for forecasting.
Can you point me to this work? I'd be very interested to see how you've chosen to adapt this approach. Are you using the BYTES fields to look at more than 28 days of history from a single row? |
I will be shopping around this workflow to some Data Scientists for review. It is intended to empower them to explore new feature usage definitions in a manner that has a clear path to being included in clients_last_seen if it proves useful.
I'm coming to the conclusion that a general
clients_all_time
orfeature_usage_all_time
table is not feasible in the short term, but the technique is still useful for rapid prototyping of bit patterns that can be used to investigate new user segmentation, feature usage definitions, etc.Relevant to the Data Warehouse daily aggregations sub-project.