Hybrid search#4738
Conversation
There was a problem hiding this comment.
Pull request overview
Replaces the AI text search's pure-KNN Elasticsearch call with a hybrid query that combines a knn clause and a BM25 multi_match clause, plumbed through with a new optional vecWeight parameter (default 0.8 on the server) controllable from the client. A preliminary ES request computes the query's max BM25 score so the lexical clause can be rescaled into roughly the same [0, 1] range as the cosine-similarity scores before being blended.
Changes:
- New
ElasticSearch.hybridSearchthat runs a max-BM25 probe query, then abool { should: [multiMatch, knn] }request, scaling the multi-match boost by(lexicalWeight/vecWeight) * (1/maxScore). - New
vecWeightquery parameter parsed inSearchParams, threaded throughMediaApi.semanticSearchByText/performAiSearchAndRespond, defaulting to 0.8 when absent. - Kahuna wiring: state param,
$stateParamsplumbing, controller, andmediaApi.searchupdated to accept and forwardvecWeight;elastic4sbumped from 8.18.2 to 8.19.1.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| build.sbt | Bumps elastic4s to 8.19.1. |
| media-api/app/lib/elasticsearch/ElasticSearch.scala | Adds hybridSearch with max-BM25 normalisation and combined bool/knn query. |
| media-api/app/lib/elasticsearch/ElasticSearchModel.scala | Adds optional vecWeight to SearchParams and a parseDoubleFromQuery helper. |
| media-api/app/controllers/MediaApi.scala | Threads vecWeight into semanticSearchByText/performAiSearchAndRespond and the search param list; defaults to 0.8. |
| kahuna/public/js/search/index.js | Registers vecWeight as a search URL/state parameter. |
| kahuna/public/js/search/query.js | Initialises ctrl.vecWeight from $stateParams and propagates it on AI-search state transitions. |
| kahuna/public/js/search/results.js | Forwards $stateParams.vecWeight into the API search call. |
| kahuna/public/js/services/api/media-api.js | Adds vecWeight to the search API parameters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
alexduf
left a comment
There was a problem hiding this comment.
I've flagged a few things that would be nice to address, either in this PR or in a subsequent one as I think this should work as-is
| executeAndLog(withSearchQueryTimeout(searchRequest), "hybrid search").map { r => | ||
| val imageHits = r.result.hits.hits.map(resolveHit).toSeq.flatten.map(i => (i.instance.id, i)) | ||
| SearchResults(hits = imageHits, total = imageHits.length, extraCounts = None) |
There was a problem hiding this comment.
I think this whole function could do with a little bit more structure. If you define the right functions you may be able to have code that looks like this pseudo code block:
for {
maxScore <- fetchMaxBm25Score(query)
hybridQuery = makeHybridQuery(query, maxScore)
result <- executeAndLog(hybridQuery)
} yield {
// build the SearchResults object here
}This should make this function both more readable and more maintainable
There was a problem hiding this comment.
Ah that's so nice! Have tried doing this here 44eda6f - haven't tested this refactor on TEST yet but that's one for tomorrow :)
Co-authored-by: Andrew Nowak <10963046+andrew-nowak@users.noreply.github.com>
| val awsSdkVersion = "1.12.470" | ||
| val awsSdkV2Version = "2.42.25" | ||
| val elastic4sVersion = "8.18.2" | ||
| val elastic4sVersion = "8.19.1" |
There was a problem hiding this comment.
Why has this been upgraded? It looks like we're still on elasticsearch 8.18
https://amigo.gutools.co.uk/recipes/grid-elasticsearch-8
https://github.com/guardian/grid/blob/hybrid-search/docker-compose.yml#L3
| private def makeHybridSearchRequest( | ||
| query: String, | ||
| queryEmbedding: List[Double], | ||
| k: Int, | ||
| numCandidates: Int, | ||
| vecWeight: Double, | ||
| maxScore: Double | ||
| )(implicit logMarker: LogMarker): SearchRequest = { | ||
| val knn = Knn("embedding.cohereEmbedV4.image") | ||
| .queryVector(queryEmbedding) | ||
| .k(k) | ||
| .numCandidates(numCandidates) | ||
| .boost(if (vecWeight > 0.0) 1.0 else 0.0) | ||
|
|
||
| val lexicalWeight = 1.0 - vecWeight | ||
|
|
||
| // KNN results are in [0,1], but BM25 scores are unbounded and typically much | ||
| // larger than cosine similarity, so we need to apply a scaling factor to the | ||
| // BM25 score to bring it to the same range as the cosine similarity. | ||
| val scalingFactor = if (maxScore > 0.0) 1.0 / maxScore else 1.0 | ||
|
|
||
| // We want to apply only one boost if we can help it, so we scale the | ||
| // multi_match boost to be in line with the max_score and the desired | ||
| // lexical_weight/vec_weight balance | ||
| val multiMatchBoost = if (vecWeight > 0.0) (lexicalWeight / vecWeight) * scalingFactor else 1.0 | ||
|
|
||
| logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost") | ||
|
|
||
| val multiMatchQuery = createMultiMatchQuery(query, boost = Some(multiMatchBoost)) | ||
|
|
||
| ElasticDsl.search(imagesCurrentAlias) | ||
| .bool(BoolQuery().should(Seq(multiMatchQuery, knn))) | ||
| .size(k) | ||
| } |
There was a problem hiding this comment.
I'll defer to your taste on this but personally I find it a little easier to read with the 0/1 edge cases extracted
| private def makeHybridSearchRequest( | |
| query: String, | |
| queryEmbedding: List[Double], | |
| k: Int, | |
| numCandidates: Int, | |
| vecWeight: Double, | |
| maxScore: Double | |
| )(implicit logMarker: LogMarker): SearchRequest = { | |
| val knn = Knn("embedding.cohereEmbedV4.image") | |
| .queryVector(queryEmbedding) | |
| .k(k) | |
| .numCandidates(numCandidates) | |
| .boost(if (vecWeight > 0.0) 1.0 else 0.0) | |
| val lexicalWeight = 1.0 - vecWeight | |
| // KNN results are in [0,1], but BM25 scores are unbounded and typically much | |
| // larger than cosine similarity, so we need to apply a scaling factor to the | |
| // BM25 score to bring it to the same range as the cosine similarity. | |
| val scalingFactor = if (maxScore > 0.0) 1.0 / maxScore else 1.0 | |
| // We want to apply only one boost if we can help it, so we scale the | |
| // multi_match boost to be in line with the max_score and the desired | |
| // lexical_weight/vec_weight balance | |
| val multiMatchBoost = if (vecWeight > 0.0) (lexicalWeight / vecWeight) * scalingFactor else 1.0 | |
| logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost") | |
| val multiMatchQuery = createMultiMatchQuery(query, boost = Some(multiMatchBoost)) | |
| ElasticDsl.search(imagesCurrentAlias) | |
| .bool(BoolQuery().should(Seq(multiMatchQuery, knn))) | |
| .size(k) | |
| } | |
| private def combineMultiMatchAndKnn( | |
| multiMatchQuery: MultiMatchQuery, | |
| knn: Knn, | |
| vecWeight: Double, | |
| maxScore: Double | |
| )(implicit logMarker: LogMarker): BoolQuery = { | |
| val lexicalWeight = 1.0 - vecWeight | |
| // KNN results are in [0,1], but BM25 scores are unbounded and typically much | |
| // larger than cosine similarity, so we need to apply a scaling factor to the | |
| // BM25 score to bring it to the same range as the cosine similarity. | |
| val scalingFactor = 1.0 / maxScore | |
| // We want to apply only one boost if we can help it, so we scale the | |
| // multi_match boost to be in line with the max_score and the desired | |
| // lexical_weight/vec_weight balance | |
| val multiMatchBoost = (lexicalWeight / vecWeight) * scalingFactor | |
| logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost") | |
| BoolQuery().should(Seq(multiMatchQuery.boost(multiMatchBoost), knn)) | |
| } | |
| private def makeHybridSearchRequest( | |
| query: String, | |
| queryEmbedding: List[Double], | |
| k: Int, | |
| numCandidates: Int, | |
| vecWeight: Double, | |
| maxScore: Double | |
| )(implicit logMarker: LogMarker): SearchRequest = { | |
| val multiMatch = createMultiMatchQuery(query) | |
| val knn = Knn("embedding.cohereEmbedV4.image") | |
| .queryVector(queryEmbedding) | |
| .k(k) | |
| .numCandidates(numCandidates) | |
| val q = vecWeight match { | |
| case 0.0 => multiMatch | |
| case 1.0 => knn | |
| case _ => combineMultiMatchAndKnn(multiMatch, knn, vecWeight, maxScore) | |
| } | |
| ElasticDsl.search(imagesCurrentAlias) | |
| .size(k) | |
| .query(q) | |
| } |
There was a problem hiding this comment.
Also, a question more for @aliceptve about the data science side.
I notice that in their linear retriever elasticsearch use min/max normalisation for the BM25, and in fact in this example they do min/max normalisation to both the BM25 and knn sides.
We're just doing max normalisation for BM25. This could definitely change the results in certain cases vs min/max, though I'm not totally clear on which is better or how much it matters.
For instance if you have
doc A: knn 0.9, BM25 9
doc B: knn 0.1, BM25 10
And assuming vecWeight = 0.5, i.e. 50/50 split between BM25 and vector
max norm
doc A normed BM25 = 9/10 = 0.9
doc B normed BM25 = 10/10 = 1
doc A score = 0.9 + 0.9 = 1.8
doc B score = 0.1 + 1 = 1.1
Ordering is A > B
min/max norm
If only two docs, normed BM25 values are always 0 and 1
doc A score = 0.9 + 0 = 0.9
doc B score = 0.1 + 1 = 1.1
Ordering is B > A
|
Seen on image-loader, cropper (merged by @ellenmuller 8 minutes and 53 seconds ago) Please check your changes! |
|
Seen on usage, kahuna, auth, metadata-editor (merged by @ellenmuller 9 minutes and 4 seconds ago) Please check your changes! |
|
Seen on collections (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes! |
|
Seen on collections, leases (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes! |
|
Seen on leases (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes! |
|
Seen on collections (merged by @ellenmuller 9 minutes and 17 seconds ago) Please check your changes! |
|
Seen on collections, thrall, media-api (merged by @ellenmuller 9 minutes and 17 seconds ago) Please check your changes! |
Hybrid blending for AI search, matching Kahuna/media-api PR #4738. URL-only param (no UI), default 1.0 (pure KNN, no behaviour change). - vecWeight=1 or absent: pure KNN (existing path) - vecWeight=0: pure BM25 multi_match on AI text - 0 < vecWeight < 1: hybrid (probe + normalised blend)
What does this change?
This PR switches the AI search from a vector KNN search to a hybrid search that combines semantic similarity with lexical BM25 matching.
It also introduces a
vecWeightquery parameter so we can control the boost parameter between vector relevance and keyword relevance from the client. This will help us experiment with different values and eventually settle on a default score.Main changes
vecWeightparameter through Kahuna, the Media API, and search param parsingvecWeightto 0.8 for AI text search when no value is suppliedPlease note that the 'More Like This' search is unchanged :)
How should a reviewer test this change?
I have deployed to TEST, where you can do a semantic search and experiment with adding different vecWeights to the URL, eg
&vecWeight=0.4.Tested? Documented?