-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1874068 - Suggest: dedupe suggestion fetching suggestions with duplicate prefix keywords #6098
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6098 +/- ##
===========================================
+ Coverage 27.44% 51.12% +23.68%
===========================================
Files 398 114 -284
Lines 50603 11911 -38692
===========================================
- Hits 13889 6090 -7799
+ Misses 36714 5821 -30893 ☔ View full report in Codecov by Sentry. |
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN prefix_keywords k ON k.suggestion_id = s.id | ||
JOIN ( | ||
SELECT | ||
suggestion_id, | ||
MAX(rank) as highest_rank | ||
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id | ||
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank | ||
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider | ||
ORDER BY s.score DESC | ||
LIMIT :suggestions_limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i used rank to reduce the suggestions to one suggestion, I think it should be okay for now, since we don't use the full keyword for pocket or amo at the moment? I know @linabutler mentioned it in the bugzilla, but tagging @0c0w3 again just to make sure :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you add a newline before the GROUP BY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you're joining on the inner SELECT
, do you still need the outer JOIN prefix_keywords k ON k.suggestion_id = s.id
and WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
?
Bigger picture: is there a way to test performance? Adding the nested query seems slightly scary to me, it would be nice if there was a performance suite that would allow us to say this stayed within X% of the current preformance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tiftran, we don't use full keywords for AMO. We do use them for Pocket -- desktop picks the longest full keyword that starts with the query -- but since there are no plans to enable Pocket again, IMO there's no reason to worry about that right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @bendk's point, this can be written more simply (GitHub won't let me suggest a change for deleted lines, so I'll just paste the whole query here):
"SELECT s.id, MAX(k.rank) AS rank, s.title, s.url, s.provider, s.score, k.confidence,
k.keyword_suffix
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = :keyword_prefix AND
k.keyword_suffix LIKE :keyword_suffix || '%' AND
s.provider = :provider
GROUP BY s.id
ORDER BY rank DESC, s.score DESC
LIMIT :suggestions_limit",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay hold on, so we want s.id and k.confidence for pocket, and just s.id for amo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all; thanks for thinking this through together!
How does the k.confidence
addition look to you, @0c0w3 and @tiftran?
I'm having a hard time thinking of how that case could show up in practice—maybe in something like:
{
"lowConfidenceKeywords": ["toxic work culture", "soft life", "work-life balance"],
"highConfidenceKeywords": ["burnout women", "soft living"]
}
If your query is soft li
, we have a low-confidence ("soft life", confidence = 0
) and a high-confidence ("soft living", confidence = 1
) with the same rank (1)—so we want to use the confidence and the rank to break the tie. And since we can LIMIT
the number of suggestions we return, we want to sort low-confidence keywords first (ASC
)—because if we sorted high-confidence first, and had LIMIT 1
, we'd need an exact match for soft living
, and so we wouldn't show the Pocket suggestion at all.
okay hold on, so we want s.id and k.confidence for pocket, and just s.id for amo?
@tiftran Yep, I think so! AMO (and MDN! I forgot about those 🙈 —they also use the prefix_keywords
table) keywords don't have a confidence, so their rank
should always be unique for suggestion_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I forgot about confidence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiftran and I talked through this a bit on Slack, and decided on:
...
GROUP BY s.id, k.confidence
ORDER BY s.score DESC, rank DESC
We group by (id, confidence)
to catch the case where a suggestion has a high-confidence and a low-confidence keyword with the same first word and the same rank. But we order by just (score, rank)
, and drop the LIMIT
.
That way, the query will return a row for the high-confidence keyword, and another row for the low-confidence keyword. The query_rows_and_then_cached
closure will be called for each row, which will determine which of the keywords matches based on the confidence—and drop the other.
Since query_rows_and_then_cached
returns an iterator, we can use .take()
to do the limit in Rust:
)?.into_iter().flatten().take(
query.limit.and_then(|limit|
usize::try_from(limit).ok()
).unwrap_or(usize::MAX)
).collect();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feedback changes made. Lots of love and appreciation @bendk, @0c0w3 , @linabutler for helping me simplify and improve the sql 💜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks correct to me. I mentioned some performance questions in my comments, but I'll let you make the call on that. I don't think want to block this on optimizations that's aren't clearly motivated.
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN prefix_keywords k ON k.suggestion_id = s.id | ||
JOIN ( | ||
SELECT | ||
suggestion_id, | ||
MAX(rank) as highest_rank | ||
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id | ||
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank | ||
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider | ||
ORDER BY s.score DESC | ||
LIMIT :suggestions_limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you add a newline before the GROUP BY
?
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN prefix_keywords k ON k.suggestion_id = s.id | ||
JOIN ( | ||
SELECT | ||
suggestion_id, | ||
MAX(rank) as highest_rank | ||
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id | ||
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank | ||
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider | ||
ORDER BY s.score DESC | ||
LIMIT :suggestions_limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you're joining on the inner SELECT
, do you still need the outer JOIN prefix_keywords k ON k.suggestion_id = s.id
and WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider
?
Bigger picture: is there a way to test performance? Adding the nested query seems slightly scary to me, it would be nice if there was a performance suite that would allow us to say this stayed within X% of the current preformance.
Ah sorry, I didn't see I was flagged for review on this! Looking now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN prefix_keywords k ON k.suggestion_id = s.id | ||
JOIN ( | ||
SELECT | ||
suggestion_id, | ||
MAX(rank) as highest_rank | ||
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' GROUP BY suggestion_id | ||
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank | ||
WHERE k.keyword_prefix = :keyword_prefix AND k.keyword_suffix LIKE :keyword_suffix || '%' AND s.provider = :provider | ||
ORDER BY s.score DESC | ||
LIMIT :suggestions_limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @bendk's point, this can be written more simply (GitHub won't let me suggest a change for deleted lines, so I'll just paste the whole query here):
"SELECT s.id, MAX(k.rank) AS rank, s.title, s.url, s.provider, s.score, k.confidence,
k.keyword_suffix
FROM suggestions s
JOIN prefix_keywords k ON k.suggestion_id = s.id
WHERE k.keyword_prefix = :keyword_prefix AND
k.keyword_suffix LIKE :keyword_suffix || '%' AND
s.provider = :provider
GROUP BY s.id
ORDER BY rank DESC, s.score DESC
LIMIT :suggestions_limit",
components/suggest/src/db.rs
Outdated
FROM suggestions s | ||
JOIN prefix_keywords k ON k.suggestion_id = s.id | ||
JOIN ( | ||
SELECT | ||
suggestion_id, | ||
MAX(rank) as highest_rank | ||
FROM prefix_keywords WHERE keyword_prefix = :keyword_prefix AND keyword_suffix LIKE :keyword_suffix || '%' | ||
GROUP BY suggestion_id | ||
) h ON k.suggestion_id = h.suggestion_id AND k.rank = h.highest_rank | ||
WHERE s.provider = :provider | ||
ORDER BY s.score DESC | ||
LIMIT :suggestions_limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Since this query is the same as the AMO one -- and probably any suggestion type that uses the prefix_keywords
table will use it -- is it feasible to factor out either the SQL or the surrounding Rust code into a common helper? Maybe a method that queries the table given a query and provider? That's fine for a follow-up though since the SQL is already duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering we made changes to the sql statement for pocket and mdn and they slightly differ, i think i'm going to opt to do this as a follow up 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+wc, thanks!
acc7ea8
to
cacac8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
components/suggest/src/store.rs
Outdated
"lowConfidenceKeywords": ["soft life", "soft living", "soft work", "workaholism", "toxic work culture"], | ||
"highConfidenceKeywords": ["burnout women", "grind culture", "women burnout"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a couple more Pocket tests involving a high and low keyword with the same prefix? e.g. a soft li
prefix from @linabutler's example in a recent comment:
{
"lowConfidenceKeywords": ["toxic work culture", "soft life", "work-life balance"],
"highConfidenceKeywords": ["burnout women", "soft living"]
}
- A query for
soft li
should return a non-top-pick suggesion - A query for
soft living
should return a top-pick suggestion
(Doesn't have to be those exact keywords but that's the idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left "soft living" in the low confidence (since we want to ensure "soft l" returns one suggestion) and added "soft lives" to high confidence to test the test case instead 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM!
@@ -365,23 +374,19 @@ impl<'a> SuggestDao<'a> { | |||
let suggestions = self | |||
.conn | |||
.query_rows_and_then_cached( | |||
r#" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like this roomier formatting for the SQL strings that @ncloudioj used for MDN—it would be nice to adopt it elsewhere, but not at all a blocker for your PR 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks, Lina! I stole this script from our pipeline project so that I don't need to manually format SQL any more. The only downside is that it could get too sparse sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is fancy, will have a PR up for this soon
…plicate prefix keywords
Pull Request checklist
[ci full]
to the PR title.Branch builds: add
[firefox-android: branch-name]
to the PR title.