Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude sample content and Metabase Analytics from usage stats #41519

Conversation

calherries
Copy link
Contributor

@calherries calherries commented Apr 17, 2024

Now that we're adding sample content to new instances in #40753, this PR excludes the content from usage stats. The relevant section of the product doc is here:

When sending usage stats, we should not include this sample content (or the audit v2 content) in any question/dashboards aggregates

It also excludes Metabase Analytics content (behaviour approved here by Arakaki)

Both changes the same approach:

  1. We now exclude any cards and dashboards that were created by the internal user from the stats. We also exclude any dashcards that belong to a dashboard that was created by the internal user.
  2. We exclude any collections that have is_sample=true from the stats. This required creating a new is_sample column on the collection table.

@metabase-bot metabase-bot bot added the .Team/BackendComponents also known as BEC label Apr 17, 2024
@calherries calherries requested a review from a team April 17, 2024 14:04
@calherries calherries changed the title Exclude sample content from usage stats Exclude sample content and Metabase Analytics from usage stats Apr 17, 2024
Copy link

replay-io bot commented Apr 17, 2024

Status Complete ↗︎
Commit 068f605
Results
⚠️ 7 Flaky
2402 Passed

Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was appalled at doing table scans to filter such potentially-large tables, but then I realized we're already pulling them all into memory in any case - ouch! 😆

Comment on lines +210 to +211
:join [[(t2/table-name Dashboard) :d] [:= :d.id :dc.dashboard_id]]
:where [:not [:= :d.creator_id config/internal-mb-user-id]]})]
Copy link
Contributor

@crisptrutski crisptrutski Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably more efficient to flip this around and skip the join by doing something like [:not [:in :dc.dashboard_id the-really-small-number-of-sample-dashboard-ids]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter? This query gets every dashcard in the app DB anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition for how databases work, and the related costs, is very rotted, so I may be way off.

I suspect that a table scan with a join is a lot more expensive than a plain table scan, I'd expect it to build a temporary table or do sub-selects.

I don't think the query analyzer has access to statistics that would suggest doing this inversion internally, and it's not any more complicated. Not gonna block on it, this namespace is ripe for broader optimizations.

@calherries calherries requested a review from a team April 17, 2024 14:29
@calherries
Copy link
Contributor Author

calherries commented Apr 17, 2024

@crisptrutski Sorry, I updated the PR because I decided to also include the Metabase Analytics changes. Can you review again?

@@ -298,8 +302,8 @@
(defn- collection-metrics
"Get metrics on Collection usage."
[]
(let [collections (t2/select Collection)
cards (t2/select [Card :collection_id])]
(let [collections (t2/select Collection :creator_id [:not= config/internal-mb-user-id])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to all the blood in the last CI run, Collection doesn't have such a column 😢

I'm guessing you can't abuse :personal_owner_id without hiding it from other users, so I'm not sure what you can do here.

One dubious option would be to remove collections which only contain sample cards 🥴

Copy link
Contributor Author

@calherries calherries Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've updated this. I had some local state tricking me into thinking there was a creator_id column.

See the latest version https://github.com/metabase/metabase/pull/41519/files#diff-96de37bff643ff1187fa863f5a14541a6c03a6c2c20deaa7e043a0fcf8f54a08R305

tableName: collection
columns:
- column:
name: is_sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool 👍

@calherries calherries merged commit 62778f9 into all-users-curate-example-collection Apr 17, 2024
76 checks passed
@calherries calherries deleted the exclude-sample-content-from-usage-analytics branch April 17, 2024 14:37
@calherries calherries restored the exclude-sample-content-from-usage-analytics branch April 17, 2024 14:38
calherries added a commit that referenced this pull request Apr 17, 2024
Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-ticking ✅ ✅

A slightly philosophical question - at what point does sample content become user content?

e.g.

  • a customer renames and adds new cards to a collection
  • a customer edits a card substantially and adds it to another dashboard

etc.

@calherries
Copy link
Contributor Author

calherries commented Apr 17, 2024

Made a clone because I accidentally merged this into the parent branch #41522

@calherries
Copy link
Contributor Author

calherries commented Apr 17, 2024

A slightly philosophical question - at what point does sample content become user content?

Good question, it won't ever change. Once something is created as sample content it stays sample content. A sample content card that is renamed will stay sample content. A new card/dashboard in a sample content collection will not be counted as sample content though.

I think this is ok since this is just for our own internal statistics and probably only need to be like 99% correct to be useful.

@crisptrutski
Copy link
Contributor

Yeah it's probably better to keep things simple than to navigate such a grey area.

If we were going to dip our toes into refining things, I was thinking we could compare created and updated times.

@calherries calherries deleted the exclude-sample-content-from-usage-analytics branch April 18, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants