Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

tiftran
Copy link
Contributor

@tiftran tiftran commented Feb 2, 2024

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8c8e3e0) 27.44% compared to head (491f832) 51.12%.
Report is 15 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #6098       +/-   ##
===========================================
+ Coverage   27.44%   51.12%   +23.68%     
===========================================
  Files         398      114      -284     
  Lines       50603    11911    -38692     
===========================================
- Hits        13889     6090     -7799     
+ Misses      36714     5821    -30893     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tiftran tiftran requested a review from a team February 2, 2024 08:41
Comment on lines 273 to 283
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
JOIN (
SELECT
suggestion_id,
MAX(rank) as highest_rank
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
ORDER BY s.score DESC
LIMIT :suggestions_limit",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i used rank to reduce the suggestions to one suggestion, I think it should be okay for now, since we don't use the full keyword for pocket or amo at the moment? I know @linabutler mentioned it in the bugzilla, but tagging @0c0w3 again just to make sure :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you add a newline before the GROUP BY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're joining on the inner SELECT, do you still need the outer JOIN prefix_keywords k ON k.suggestion_id = s.id and WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider?

Bigger picture: is there a way to test performance? Adding the nested query seems slightly scary to me, it would be nice if there was a performance suite that would allow us to say this stayed within X% of the current preformance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tiftran, we don't use full keywords for AMO. We do use them for Pocket -- desktop picks the longest full keyword that starts with the query -- but since there are no plans to enable Pocket again, IMO there's no reason to worry about that right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @bendk's point, this can be written more simply (GitHub won't let me suggest a change for deleted lines, so I'll just paste the whole query here):

            "SELECT s.id, MAX(k.rank) AS rank, s.title, s.url, s.provider, s.score, k.confidence,
                    k.keyword_suffix
             FROM suggestions s
             JOIN prefix_keywords k ON k.suggestion_id = s.id
             WHERE k.keyword_prefix = :keyword_prefix AND
                   k.keyword_suffix LIKE :keyword_suffix || '%' AND
                   s.provider = :provider
             GROUP BY s.id
             ORDER BY rank DESC, s.score DESC
             LIMIT :suggestions_limit",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay hold on, so we want s.id and k.confidence for pocket, and just s.id for amo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all; thanks for thinking this through together!

How does the k.confidence addition look to you, @0c0w3 and @tiftran?

I'm having a hard time thinking of how that case could show up in practice—maybe in something like:

{
    "lowConfidenceKeywords": ["toxic work culture", "soft life", "work-life balance"],
    "highConfidenceKeywords": ["burnout women", "soft living"]
}

If your query is soft li, we have a low-confidence ("soft life", confidence = 0) and a high-confidence ("soft living", confidence = 1) with the same rank (1)—so we want to use the confidence and the rank to break the tie. And since we can LIMIT the number of suggestions we return, we want to sort low-confidence keywords first (ASC)—because if we sorted high-confidence first, and had LIMIT 1, we'd need an exact match for soft living, and so we wouldn't show the Pocket suggestion at all.

okay hold on, so we want s.id and k.confidence for pocket, and just s.id for amo?

@tiftran Yep, I think so! AMO (and MDN! I forgot about those 🙈 —they also use the prefix_keywords table) keywords don't have a confidence, so their rank should always be unique for suggestion_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I forgot about confidence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiftran and I talked through this a bit on Slack, and decided on:

...
GROUP BY s.id, k.confidence
ORDER BY s.score DESC, rank DESC

We group by (id, confidence) to catch the case where a suggestion has a high-confidence and a low-confidence keyword with the same first word and the same rank. But we order by just (score, rank), and drop the LIMIT.

That way, the query will return a row for the high-confidence keyword, and another row for the low-confidence keyword. The query_rows_and_then_cached closure will be called for each row, which will determine which of the keywords matches based on the confidence—and drop the other.

Since query_rows_and_then_cached returns an iterator, we can use .take() to do the limit in Rust:

)?.into_iter().flatten().take(
    query.limit.and_then(|limit|
        usize::try_from(limit).ok()
    ).unwrap_or(usize::MAX)
).collect();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feedback changes made. Lots of love and appreciation @bendk, @0c0w3 , @linabutler for helping me simplify and improve the sql 💜

Copy link
Contributor

@bendk bendk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks correct to me. I mentioned some performance questions in my comments, but I'll let you make the call on that. I don't think want to block this on optimizations that's aren't clearly motivated.

Comment on lines 273 to 283
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
JOIN (
SELECT
suggestion_id,
MAX(rank) as highest_rank
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
ORDER BY s.score DESC
LIMIT :suggestions_limit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you add a newline before the GROUP BY?

Comment on lines 273 to 283
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
JOIN (
SELECT
suggestion_id,
MAX(rank) as highest_rank
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
ORDER BY s.score DESC
LIMIT :suggestions_limit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're joining on the inner SELECT, do you still need the outer JOIN prefix_keywords k ON k.suggestion_id = s.id and WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider?

Bigger picture: is there a way to test performance? Adding the nested query seems slightly scary to me, it would be nice if there was a performance suite that would allow us to say this stayed within X% of the current preformance.

@0c0w3
Copy link
Contributor

0c0w3 commented Feb 5, 2024

Ah sorry, I didn't see I was flagged for review on this! Looking now.

Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 273 to 283
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
JOIN (
SELECT
suggestion_id,
MAX(rank) as highest_rank
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
ORDER BY s.score DESC
LIMIT :suggestions_limit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @bendk's point, this can be written more simply (GitHub won't let me suggest a change for deleted lines, so I'll just paste the whole query here):

            "SELECT s.id, MAX(k.rank) AS rank, s.title, s.url, s.provider, s.score, k.confidence,
                    k.keyword_suffix
             FROM suggestions s
             JOIN prefix_keywords k ON k.suggestion_id = s.id
             WHERE k.keyword_prefix = :keyword_prefix AND
                   k.keyword_suffix LIKE :keyword_suffix || '%' AND
                   s.provider = :provider
             GROUP BY s.id
             ORDER BY rank DESC, s.score DESC
             LIMIT :suggestions_limit",

Comment on lines 329 to 340
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
JOIN (
SELECT
suggestion_id,
MAX(rank) as highest_rank
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%'
GROUP BY suggestion_id
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank
WHERE s.provider = :provider
ORDER BY s.score DESC
LIMIT :suggestions_limit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Since this query is the same as the AMO one -- and probably any suggestion type that uses the prefix_keywords table will use it -- is it feasible to factor out either the SQL or the surrounding Rust code into a common helper? Maybe a method that queries the table given a query and provider? That's fine for a follow-up though since the SQL is already duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering we made changes to the sql statement for pocket and mdn and they slightly differ, i think i'm going to opt to do this as a follow up 🙂

Copy link
Member

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc, thanks!

@tiftran tiftran force-pushed the pocket-bug branch 3 times, most recently from acc7ea8 to cacac8a Compare February 6, 2024 05:19
Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 3303 to 3304
"lowConfidenceKeywords": ["soft life", "soft living", "soft work", "workaholism", "toxic work culture"],
"highConfidenceKeywords": ["burnout women", "grind culture", "women burnout"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a couple more Pocket tests involving a high and low keyword with the same prefix? e.g. a soft li prefix from @linabutler's example in a recent comment:

{
    "lowConfidenceKeywords": ["toxic work culture", "soft life", "work-life balance"],
    "highConfidenceKeywords": ["burnout women", "soft living"]
}
  • A query for soft li should return a non-top-pick suggesion
  • A query for soft living should return a top-pick suggestion

(Doesn't have to be those exact keywords but that's the idea)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left "soft living" in the low confidence (since we want to ensure "soft l" returns one suggestion) and added "soft lives" to high confidence to test the test case instead 🙂

Copy link
Member

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM!

@@ -365,23 +374,19 @@ impl<'a> SuggestDao<'a> {
let suggestions = self
.conn
.query_rows_and_then_cached(
r#"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like this roomier formatting for the SQL strings that @ncloudioj used for MDN—it would be nice to adopt it elsewhere, but not at all a blocker for your PR 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, Lina! I stole this script from our pipeline project so that I don't need to manually format SQL any more. The only downside is that it could get too sparse sometimes.

Copy link
Contributor Author

@tiftran tiftran Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is fancy, will have a PR up for this soon

@tiftran tiftran added this pull request to the merge queue Feb 6, 2024
Merged via the queue into main with commit dca3e8b Feb 6, 2024
16 checks passed
@tiftran tiftran deleted the pocket-bug branch February 6, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants