Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft documentation for DS workflow for adding to clients_last_seen #579

Merged
merged 9 commits into from Nov 17, 2020

Conversation

jklukas
Copy link
Contributor

@jklukas jklukas commented Nov 4, 2020

I will be shopping around this workflow to some Data Scientists for review. It is intended to empower them to explore new feature usage definitions in a manner that has a clear path to being included in clients_last_seen if it proves useful.

I'm coming to the conclusion that a general clients_all_time or feature_usage_all_time table is not feasible in the short term, but the technique is still useful for rapid prototyping of bit patterns that can be used to investigate new user segmentation, feature usage definitions, etc.

Relevant to the Data Warehouse daily aggregations sub-project.

I will be shopping around this workflow to some Data Scientists for review.
It is intended to empower them to explore new feature usage definitions
in a manner that has a clear path to being included in clients_last_seen
if it proves useful.

Relevant to the [Data Warehouse daily aggregations sub-project](https://docs.google.com/document/d/1Lml-hWiqhvUazjn_-sDF6TUvwAEA3jMG3LJbyIC7fEc/edit#).
@jklukas jklukas changed the title Alltime workflow Draft documentation for DS workflow for adding to clients_last_seen Nov 4, 2020
@jklukas
Copy link
Contributor Author

jklukas commented Nov 5, 2020

cc @felixlawrence @irrationalagent @SuYoungHong @godelstheory for feedback. Does this seem usable? Is there obvious value in defining this workflow?

If this does seem valuable, my goal would be to use this as part of a coherent strategy for what it looks like to add fields to clients_last_seen.

@irrationalagent
Copy link
Contributor

This does look like it would be useful to me.

Many feature usage metrics are based on events. At first glance, it seems like this could be adapted to use the events table by aggregating LOGICAL_OR-like per client-day. Am I missing anything else there?

@jklukas
Copy link
Contributor Author

jklukas commented Nov 5, 2020

It seems like this could be adapted to use the events table by aggregating LOGICAL_OR-like per client-day

Yes, that's quite feasible. Jesse prototyped some events-based usage criteria in mozilla/bigquery-etl#1193 using that pattern. The path for getting such a feature integrated into clients_last_seen is unpaved, though, so there would be discussions needed to figure out what it's like to productionize (which I'm happy to shepherd if we get there).

your specific new field are marked between `-- BEGIN` and `-- END` comments.

The example queries `main_v4` directly in order to be as generic as possible.
The `daily` CTE below could be removed in the case that `clients_daily` already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I suspect that >90% of the time, clients_daily will be enough. From this information, it was obvious-enough to me how I could use clients_daily.

Copy link
Contributor

@felixlawrence felixlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think that none of my current or planned work truly needs this (though I could well be overlooking something), but it's a good thing to have in the toolbox, and is an comprehensible explanation of the tool.

@godelstheory godelstheory marked this pull request as ready for review November 9, 2020 23:40
@godelstheory godelstheory marked this pull request as draft November 9, 2020 23:41
@godelstheory godelstheory self-requested a review November 9, 2020 23:42
Copy link
Contributor

@godelstheory godelstheory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already proving very useful for the core actives work, where we are constrained with a URI>=1 as active day definition. It is a straightforward to create a core actives table to join upon, with long enough lookbacks for forecasting.

@jklukas
Copy link
Contributor Author

jklukas commented Nov 10, 2020

It is a straightforward to create a core actives table to join upon, with long enough lookbacks for forecasting.

Can you point me to this work? I'd be very interested to see how you've chosen to adapt this approach. Are you using the BYTES fields to look at more than 28 days of history from a single row?

@jklukas jklukas marked this pull request as ready for review November 17, 2020 20:02
@jklukas jklukas merged commit baaa60f into master Nov 17, 2020
@jklukas jklukas deleted the alltime-workflow branch November 17, 2020 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants