Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098

tiftran · 2024-02-02T08:24:10Z

Pull Request checklist

Breaking changes: This PR follows our breaking change policy
- This PR follows the breaking change policy:
  - This PR has no breaking API changes, or
  - There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
Quality: This PR builds and tests run cleanly
- Note:
  - For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
  - If this pull request includes a breaking change, consider cutting a new release after merging.
Tests: This PR includes thorough tests or an explanation of why it does not
Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
- Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
Dependencies: This PR follows our dependency management guidelines
- Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

codecov-commenter · 2024-02-02T08:34:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8c8e3e0) 27.44% compared to head (491f832) 51.12%.
Report is 15 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #6098       +/-   ##
===========================================
+ Coverage   27.44%   51.12%   +23.68%     
===========================================
  Files         398      114      -284     
  Lines       50603    11911    -38692     
===========================================
- Hits        13889     6090     -7799     
+ Misses      36714     5821    -30893

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tiftran · 2024-02-02T08:44:00Z

components/suggest/src/db.rs

+                FROM suggestions s
+                JOIN prefix_keywords k ON k.suggestion_id = s.id
+                JOIN (
+                SELECT
+                    suggestion_id,
+                    MAX(rank) as highest_rank
+                    FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
+                ) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
+                WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
+                ORDER BY s.score DESC
+                LIMIT :suggestions_limit",


so i used rank to reduce the suggestions to one suggestion, I think it should be okay for now, since we don't use the full keyword for pocket or amo at the moment? I know @linabutler mentioned it in the bugzilla, but tagging @0c0w3 again just to make sure :)

Nit: Can you add a newline before the GROUP BY?

Now that you're joining on the inner SELECT, do you still need the outer JOIN prefix_keywords k ON k.suggestion_id = s.id and WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider?

Bigger picture: is there a way to test performance? Adding the nested query seems slightly scary to me, it would be nice if there was a performance suite that would allow us to say this stayed within X% of the current preformance.

Hi @tiftran, we don't use full keywords for AMO. We do use them for Pocket -- desktop picks the longest full keyword that starts with the query -- but since there are no plans to enable Pocket again, IMO there's no reason to worry about that right now.

To @bendk's point, this can be written more simply (GitHub won't let me suggest a change for deleted lines, so I'll just paste the whole query here):

"SELECT s.id, MAX(k.rank) AS rank, s.title, s.url, s.provider, s.score, k.confidence, k.keyword_suffix FROM suggestions s JOIN prefix_keywords k ON k.suggestion_id = s.id WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider GROUP BY s.id ORDER BY rank DESC, s.score DESC LIMIT :suggestions_limit",

okay hold on, so we want s.id and k.confidence for pocket, and just s.id for amo?

Not at all; thanks for thinking this through together!

How does the k.confidence addition look to you, @0c0w3 and @tiftran?

I'm having a hard time thinking of how that case could show up in practice—maybe in something like:

{ "lowConfidenceKeywords": ["toxic work culture", "soft life", "work-life balance"], "highConfidenceKeywords": ["burnout women", "soft living"] }

If your query is soft li, we have a low-confidence ("soft life", confidence = 0) and a high-confidence ("soft living", confidence = 1) with the same rank (1)—so we want to use the confidence and the rank to break the tie. And since we can LIMIT the number of suggestions we return, we want to sort low-confidence keywords first (ASC)—because if we sorted high-confidence first, and had LIMIT 1, we'd need an exact match for soft living, and so we wouldn't show the Pocket suggestion at all.

okay hold on, so we want s.id and k.confidence for pocket, and just s.id for amo?

@tiftran Yep, I think so! AMO (and MDN! I forgot about those 🙈 —they also use the prefix_keywords table) keywords don't have a confidence, so their rank should always be unique for suggestion_id.

Looks good to me, I forgot about confidence.

@tiftran and I talked through this a bit on Slack, and decided on:

... GROUP BY s.id, k.confidence ORDER BY s.score DESC, rank DESC

We group by (id, confidence) to catch the case where a suggestion has a high-confidence and a low-confidence keyword with the same first word and the same rank. But we order by just (score, rank), and drop the LIMIT.

That way, the query will return a row for the high-confidence keyword, and another row for the low-confidence keyword. The query_rows_and_then_cached closure will be called for each row, which will determine which of the keywords matches based on the confidence—and drop the other.

Since query_rows_and_then_cached returns an iterator, we can use .take() to do the limit in Rust:

)?.into_iter().flatten().take( query.limit.and_then(|limit| usize::try_from(limit).ok() ).unwrap_or(usize::MAX) ).collect();

feedback changes made. Lots of love and appreciation @bendk, @0c0w3 , @linabutler for helping me simplify and improve the sql 💜

bendk

The code looks correct to me. I mentioned some performance questions in my comments, but I'll let you make the call on that. I don't think want to block this on optimizations that's aren't clearly motivated.

bendk · 2024-02-02T15:41:07Z

components/suggest/src/db.rs

+                FROM suggestions s
+                JOIN prefix_keywords k ON k.suggestion_id = s.id
+                JOIN (
+                SELECT
+                    suggestion_id,
+                    MAX(rank) as highest_rank
+                    FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
+                ) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
+                WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
+                ORDER BY s.score DESC
+                LIMIT :suggestions_limit",


Nit: Can you add a newline before the GROUP BY?

bendk · 2024-02-02T15:49:00Z

components/suggest/src/db.rs

+                FROM suggestions s
+                JOIN prefix_keywords k ON k.suggestion_id = s.id
+                JOIN (
+                SELECT
+                    suggestion_id,
+                    MAX(rank) as highest_rank
+                    FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
+                ) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
+                WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
+                ORDER BY s.score DESC
+                LIMIT :suggestions_limit",


Now that you're joining on the inner SELECT, do you still need the outer JOIN prefix_keywords k ON k.suggestion_id = s.id and WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider?

Bigger picture: is there a way to test performance? Adding the nested query seems slightly scary to me, it would be nice if there was a performance suite that would allow us to say this stayed within X% of the current preformance.

0c0w3 · 2024-02-05T22:16:20Z

Ah sorry, I didn't see I was flagged for review on this! Looking now.

0c0w3

Thanks!

0c0w3 · 2024-02-05T22:24:36Z

components/suggest/src/db.rs

+                FROM suggestions s
+                JOIN prefix_keywords k ON k.suggestion_id = s.id
+                JOIN (
+                SELECT
+                    suggestion_id,
+                    MAX(rank) as highest_rank
+                    FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
+                ) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
+                WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
+                ORDER BY s.score DESC
+                LIMIT :suggestions_limit",


To @bendk's point, this can be written more simply (GitHub won't let me suggest a change for deleted lines, so I'll just paste the whole query here):

"SELECT s.id, MAX(k.rank) AS rank, s.title, s.url, s.provider, s.score, k.confidence, k.keyword_suffix FROM suggestions s JOIN prefix_keywords k ON k.suggestion_id = s.id WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider GROUP BY s.id ORDER BY rank DESC, s.score DESC LIMIT :suggestions_limit",

0c0w3 · 2024-02-05T22:27:55Z

components/suggest/src/db.rs

+                FROM suggestions s
+                JOIN prefix_keywords k ON k.suggestion_id = s.id
+                JOIN (
+                SELECT
+                    suggestion_id,
+                    MAX(rank) as highest_rank
+                    FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' 
+                    GROUP BY suggestion_id
+                ) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
+                WHERE s.provider = :provider
+                ORDER BY s.score DESC
+                LIMIT :suggestions_limit",


Same here. Since this query is the same as the AMO one -- and probably any suggestion type that uses the prefix_keywords table will use it -- is it feasible to factor out either the SQL or the surrounding Rust code into a common helper? Maybe a method that queries the table given a query and provider? That's fine for a follow-up though since the SQL is already duplicated.

considering we made changes to the sql statement for pocket and mdn and they slightly differ, i think i'm going to opt to do this as a follow up 🙂

linabutler

r+wc, thanks!

0c0w3

Thanks!

0c0w3 · 2024-02-06T17:36:57Z

components/suggest/src/store.rs

+                     "lowConfidenceKeywords": ["soft life", "soft living", "soft work", "workaholism", "toxic work culture"],
+                     "highConfidenceKeywords": ["burnout women", "grind culture", "women burnout"],


Could we add a couple more Pocket tests involving a high and low keyword with the same prefix? e.g. a soft li prefix from @linabutler's example in a recent comment:

{ "lowConfidenceKeywords": ["toxic work culture", "soft life", "work-life balance"], "highConfidenceKeywords": ["burnout women", "soft living"] }

A query for soft li should return a non-top-pick suggesion

A query for soft living should return a top-pick suggestion

(Doesn't have to be those exact keywords but that's the idea)

left "soft living" in the low confidence (since we want to ensure "soft l" returns one suggestion) and added "soft lives" to high confidence to test the test case instead 🙂

linabutler

Still LGTM!

linabutler · 2024-02-06T20:11:47Z

components/suggest/src/db.rs

@@ -365,23 +374,19 @@ impl<'a> SuggestDao<'a> {
        let suggestions = self
            .conn
            .query_rows_and_then_cached(
-                r#"


I do like this roomier formatting for the SQL strings that @ncloudioj used for MDN—it would be nice to adopt it elsewhere, but not at all a blocker for your PR 😄

Ah, thanks, Lina! I stole this script from our pipeline project so that I don't need to manually format SQL any more. The only downside is that it could get too sparse sometimes.

oh this is fancy, will have a PR up for this soon

…plicate prefix keywords

tiftran requested a review from a team February 2, 2024 08:41

tiftran commented Feb 2, 2024

View reviewed changes

bendk approved these changes Feb 2, 2024

View reviewed changes

0c0w3 approved these changes Feb 5, 2024

View reviewed changes

linabutler approved these changes Feb 5, 2024

View reviewed changes

tiftran force-pushed the pocket-bug branch 3 times, most recently from acc7ea8 to cacac8a Compare February 6, 2024 05:19

tiftran requested review from 0c0w3, bendk and linabutler February 6, 2024 05:50

0c0w3 approved these changes Feb 6, 2024

View reviewed changes

linabutler approved these changes Feb 6, 2024

View reviewed changes

tiftran force-pushed the pocket-bug branch from 491f832 to 4fbf3f8 Compare February 6, 2024 22:47

Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with du…

604f792

…plicate prefix keywords

tiftran force-pushed the pocket-bug branch from 4fbf3f8 to 604f792 Compare February 6, 2024 22:55

tiftran added this pull request to the merge queue Feb 6, 2024

Merged via the queue into main with commit dca3e8b Feb 6, 2024
16 checks passed

tiftran deleted the pocket-bug branch February 6, 2024 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098

Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098

tiftran commented Feb 2, 2024

codecov-commenter commented Feb 2, 2024 •

edited

Loading

tiftran Feb 2, 2024

bendk Feb 2, 2024

bendk Feb 2, 2024

0c0w3 Feb 3, 2024

0c0w3 Feb 5, 2024

tiftran Feb 6, 2024

linabutler Feb 6, 2024

0c0w3 Feb 6, 2024

linabutler Feb 6, 2024

tiftran Feb 6, 2024

bendk left a comment

bendk Feb 2, 2024

bendk Feb 2, 2024

0c0w3 commented Feb 5, 2024

0c0w3 left a comment

0c0w3 Feb 5, 2024

0c0w3 Feb 5, 2024

tiftran Feb 6, 2024

linabutler left a comment

0c0w3 left a comment

0c0w3 Feb 6, 2024

tiftran Feb 6, 2024

linabutler left a comment

linabutler Feb 6, 2024

ncloudioj Feb 6, 2024

tiftran Feb 6, 2024 •

edited

Loading

		"lowConfidenceKeywords": ["soft life", "soft living", "soft work", "workaholism", "toxic work culture"],
		"highConfidenceKeywords": ["burnout women", "grind culture", "women burnout"],

Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098

Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098

Conversation

tiftran commented Feb 2, 2024

Pull Request checklist

codecov-commenter commented Feb 2, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0c0w3 commented Feb 5, 2024

0c0w3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linabutler left a comment

Choose a reason for hiding this comment

0c0w3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linabutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiftran Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Feb 2, 2024 •

edited

Loading

tiftran Feb 6, 2024 •

edited

Loading