Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record field usage during query execution #40494

Closed
Tracked by #38229
qnkhuat opened this issue Mar 22, 2024 · 5 comments · Fixed by #39671
Closed
Tracked by #38229

Record field usage during query execution #40494

qnkhuat opened this issue Mar 22, 2024 · 5 comments · Fixed by #39671
Assignees
Labels
.Team/BackendComponents also known as BEC
Milestone

Comments

@qnkhuat
Copy link
Contributor

qnkhuat commented Mar 22, 2024

What

We want to record how fields are used during query execution. The schema for field usage.

Column Type remark Example
field_id int
query_execution_id integer FK to query_execution.id
used_in varchar(25) which part of the query this field is used in breakout, aggregation, filter
created_at timestamp with timezone
filter_op varchar(25) the operator of the filter >
aggregation_function varchar(25) name of the aggregation function :sum, :max
breakout_binning_strategy varchar(25) breakout binning strategy num-bins
breakout_binning_num_bins integer breakout num bin binning option 20
breakout_binning_bin_width integer breakout bin width binning option 10
breakout_temporal_unit text breakout temporal unit option

Gathering field usage

We’ll rely on our friends over at the QP team and ask them to help us with an API that takes a query and returns field_usage. The schema for the output is:

[:and
 [:map
  [:clause-type [:enum :expression :breakout :aggregation :filter]]]
 [:multi {:dispatch :clause-type}
  [:expression  [:map
                 [:type                 [:= :expression]]
                 [:field-id             pos-int?]]
  [:breakout    [:map
                 [:type                 [:= :breakout]]
                 [:field-id             pos-int?]
                 [:breakout-param       [:metabase.lib.schema.ref/field.options]]]]
  [:aggregation [:map
                 [:type                 [:= :aggregation]]
                 [:field-id             pos-int?]
                 [:aggregation-function [:enum :avg :count :cum-count :count-where :distinct :var
                                         :max :median :min :percentile :share :stddev :sum :cum-sum :sum-where]]]]
  [:fitler      [:map
                 [:type                 [:= :filter]]
                 [:field-id             pos-int?]
                 [:filter-args        [:sequential :any]
                 [:filter-op           [:enum := :!= :inside :between :< :> :<= :>= :is-null :not-null :is-empty :not-empty :contains :does-not-contain :starts-with :ends-with]]]]]]

Saving field usage

We already have a process that save query_execution asynchronously in

For our task, we’ll extend save-query-execution!* to save field_usage.

@qnkhuat qnkhuat added the .Task Not a part of any Epic, used by the Task Issue Template label Mar 22, 2024
@calherries
Copy link
Contributor

calherries commented Mar 22, 2024

High-level approach looks solid. I have a few ideas:

  1. If a column is used in an expression that is then used in a filter or aggregation, do we intend to store that data anywhere? I'm not sure how we'd use it, but want to make sure we're aware of that case.
  2. Since we're inserting this record in save-query-execution!, we'll have access to the query_execution record. If we add a foreign key on this table, we'll have more flexibility in the future to grab things like the user_id for personalized recommendations.
  3. If we have a foreign key, context is not necessary to store here either.

@qnkhuat
Copy link
Contributor Author

qnkhuat commented Mar 22, 2024

If a column is used in an expression that is then used in a filter or aggregation, do we intend to store that data anywhere? I'm not sure how we'd use it, but want to make sure we're aware of that case.

yep I'm not sure how useful is that either, but if the goal is to find all the fields that were used in the last 15 days then that that should be enough.

Since we're inserting this record in save-query-execution!, we'll have access to the query_execution record. If we add a foreign key on this table, we'll have more flexibility in the future to grab things like the user_id for personalized recommendations.

that's a good idea ! updated the schema.

@calherries
Copy link
Contributor

If we're not going to store any expression related data in v1 of this, maybe we should leave it out of the QP API? I imagine we can save us all some time if we don't require it.

@darksciencebase darksciencebase added .Team/BackendComponents also known as BEC and removed .Task Not a part of any Epic, used by the Task Issue Template labels Mar 25, 2024
@qnkhuat
Copy link
Contributor Author

qnkhuat commented Mar 26, 2024

I think saving the expression clause might be useful, wdyt?

on the second thought, not sure what we can do with it since expression can have a high degree of freedom.

@qnkhuat
Copy link
Contributor Author

qnkhuat commented Apr 17, 2024

Implemented in #39671

@qnkhuat qnkhuat closed this as completed Apr 17, 2024
@qnkhuat qnkhuat added this to the 0.50 milestone Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants