Hybrid search by ellenmuller · Pull Request #4738 · guardian/grid

ellenmuller · 2026-05-14T11:10:40Z

What does this change?

This PR switches the AI search from a vector KNN search to a hybrid search that combines semantic similarity with lexical BM25 matching.

It also introduces a vecWeight query parameter so we can control the boost parameter between vector relevance and keyword relevance from the client. This will help us experiment with different values and eventually settle on a default score.

Main changes

add a new hybrid Elasticsearch query for AI text search (this is the query developed by the Data Science team, see here)
combine KNN and multi_match results in a single query
normalise BM25 scores using the query's max BM25 score so lexical and vector signals can be blended more predictably
thread a new optional vecWeight parameter through Kahuna, the Media API, and search param parsing
default vecWeight to 0.8 for AI text search when no value is supplied

Please note that the 'More Like This' search is unchanged :)

How should a reviewer test this change?

I have deployed to TEST, where you can do a semantic search and experiment with adding different vecWeights to the URL, eg &vecWeight=0.4.

Tested? Documented?

locally by committer
locally by Guardian reviewer
on the Guardian's TEST environment
relevant documentation added or amended (if needed)

github-actions · 2026-05-14T11:17:48Z

Deploy build 14494 of `media-service::grid::all` to TEST

All deployment options

From guardian/actions-riff-raff.

Copilot

Pull request overview

Replaces the AI text search's pure-KNN Elasticsearch call with a hybrid query that combines a knn clause and a BM25 multi_match clause, plumbed through with a new optional vecWeight parameter (default 0.8 on the server) controllable from the client. A preliminary ES request computes the query's max BM25 score so the lexical clause can be rescaled into roughly the same [0, 1] range as the cosine-similarity scores before being blended.

Changes:

New ElasticSearch.hybridSearch that runs a max-BM25 probe query, then a bool { should: [multiMatch, knn] } request, scaling the multi-match boost by (lexicalWeight/vecWeight) * (1/maxScore).
New vecWeight query parameter parsed in SearchParams, threaded through MediaApi.semanticSearchByText/performAiSearchAndRespond, defaulting to 0.8 when absent.
Kahuna wiring: state param, $stateParams plumbing, controller, and mediaApi.search updated to accept and forward vecWeight; elastic4s bumped from 8.18.2 to 8.19.1.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
build.sbt	Bumps elastic4s to 8.19.1.
media-api/app/lib/elasticsearch/ElasticSearch.scala	Adds `hybridSearch` with max-BM25 normalisation and combined bool/knn query.
media-api/app/lib/elasticsearch/ElasticSearchModel.scala	Adds optional `vecWeight` to `SearchParams` and a `parseDoubleFromQuery` helper.
media-api/app/controllers/MediaApi.scala	Threads `vecWeight` into `semanticSearchByText`/`performAiSearchAndRespond` and the search param list; defaults to 0.8.
kahuna/public/js/search/index.js	Registers `vecWeight` as a search URL/state parameter.
kahuna/public/js/search/query.js	Initialises `ctrl.vecWeight` from `$stateParams` and propagates it on AI-search state transitions.
kahuna/public/js/search/results.js	Forwards `$stateParams.vecWeight` into the API search call.
kahuna/public/js/services/api/media-api.js	Adds `vecWeight` to the search API parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

alexduf

I've flagged a few things that would be nice to address, either in this PR or in a subsequent one as I think this should work as-is

alexduf · 2026-05-19T15:38:28Z

+      executeAndLog(withSearchQueryTimeout(searchRequest), "hybrid search").map { r =>
+        val imageHits = r.result.hits.hits.map(resolveHit).toSeq.flatten.map(i => (i.instance.id, i))
+        SearchResults(hits = imageHits, total = imageHits.length, extraCounts = None)


I think this whole function could do with a little bit more structure. If you define the right functions you may be able to have code that looks like this pseudo code block:

for { maxScore <- fetchMaxBm25Score(query) hybridQuery = makeHybridQuery(query, maxScore) result <- executeAndLog(hybridQuery) } yield { // build the SearchResults object here }

This should make this function both more readable and more maintainable

Ah that's so nice! Have tried doing this here 44eda6f - haven't tested this refactor on TEST yet but that's one for tomorrow :)

Co-authored-by: Andrew Nowak <10963046+andrew-nowak@users.noreply.github.com>

joelochlann

I think this looks good - my comments are mainly points for discussion, not blockers - but is the elasticsearch library upgrade intended?

joelochlann · 2026-05-21T16:28:33Z

 val awsSdkVersion = "1.12.470"
 val awsSdkV2Version = "2.42.25"
-val elastic4sVersion = "8.18.2"
+val elastic4sVersion = "8.19.1"


Why has this been upgraded? It looks like we're still on elasticsearch 8.18

https://amigo.gutools.co.uk/recipes/grid-elasticsearch-8
https://github.com/guardian/grid/blob/hybrid-search/docker-compose.yml#L3

joelochlann · 2026-05-22T11:41:08Z

+  private def makeHybridSearchRequest(
+    query: String,
+    queryEmbedding: List[Double],
+    k: Int,
+    numCandidates: Int,
+    vecWeight: Double,
+    maxScore: Double
+  )(implicit logMarker: LogMarker): SearchRequest = {
+    val knn = Knn("embedding.cohereEmbedV4.image")
+      .queryVector(queryEmbedding)
+      .k(k)
+      .numCandidates(numCandidates)
+      .boost(if (vecWeight > 0.0) 1.0 else 0.0)
+
+    val lexicalWeight = 1.0 - vecWeight
+
+    // KNN results are in [0,1], but BM25 scores are unbounded and typically much
+    // larger than cosine similarity, so we need to apply a scaling factor to the
+    // BM25 score to bring it to the same range as the cosine similarity.
+    val scalingFactor = if (maxScore > 0.0) 1.0 / maxScore else 1.0
+
+    //    We want to apply only one boost if we can help it, so we scale the
+    //    multi_match boost to be in line with the max_score and the desired
+    //    lexical_weight/vec_weight balance
+    val multiMatchBoost = if (vecWeight > 0.0) (lexicalWeight / vecWeight) * scalingFactor else 1.0
+
+    logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost")
+
+    val multiMatchQuery = createMultiMatchQuery(query, boost = Some(multiMatchBoost))
+
+    ElasticDsl.search(imagesCurrentAlias)
+      .bool(BoolQuery().should(Seq(multiMatchQuery, knn)))
+      .size(k)
+  }


I'll defer to your taste on this but personally I find it a little easier to read with the 0/1 edge cases extracted

Suggested change

private def makeHybridSearchRequest(

query: String,

queryEmbedding: List[Double],

k: Int,

numCandidates: Int,

vecWeight: Double,

maxScore: Double

)(implicit logMarker: LogMarker): SearchRequest = {

val knn = Knn("embedding.cohereEmbedV4.image")

.queryVector(queryEmbedding)

.k(k)

.numCandidates(numCandidates)

.boost(if (vecWeight > 0.0) 1.0 else 0.0)

val lexicalWeight = 1.0 - vecWeight

// KNN results are in [0,1], but BM25 scores are unbounded and typically much

// larger than cosine similarity, so we need to apply a scaling factor to the

// BM25 score to bring it to the same range as the cosine similarity.

val scalingFactor = if (maxScore > 0.0) 1.0 / maxScore else 1.0

// We want to apply only one boost if we can help it, so we scale the

// multi_match boost to be in line with the max_score and the desired

// lexical_weight/vec_weight balance

val multiMatchBoost = if (vecWeight > 0.0) (lexicalWeight / vecWeight) * scalingFactor else 1.0

logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost")

val multiMatchQuery = createMultiMatchQuery(query, boost = Some(multiMatchBoost))

ElasticDsl.search(imagesCurrentAlias)

.bool(BoolQuery().should(Seq(multiMatchQuery, knn)))

.size(k)

}

private def combineMultiMatchAndKnn(

multiMatchQuery: MultiMatchQuery,

knn: Knn,

vecWeight: Double,

maxScore: Double

)(implicit logMarker: LogMarker): BoolQuery = {

val lexicalWeight = 1.0 - vecWeight

// KNN results are in [0,1], but BM25 scores are unbounded and typically much

// larger than cosine similarity, so we need to apply a scaling factor to the

// BM25 score to bring it to the same range as the cosine similarity.

val scalingFactor = 1.0 / maxScore

// We want to apply only one boost if we can help it, so we scale the

// multi_match boost to be in line with the max_score and the desired

// lexical_weight/vec_weight balance

val multiMatchBoost = (lexicalWeight / vecWeight) * scalingFactor

logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost")

BoolQuery().should(Seq(multiMatchQuery.boost(multiMatchBoost), knn))

}

private def makeHybridSearchRequest(

query: String,

queryEmbedding: List[Double],

k: Int,

numCandidates: Int,

vecWeight: Double,

maxScore: Double

)(implicit logMarker: LogMarker): SearchRequest = {

val multiMatch = createMultiMatchQuery(query)

val knn = Knn("embedding.cohereEmbedV4.image")

.queryVector(queryEmbedding)

.k(k)

.numCandidates(numCandidates)

val q = vecWeight match {

case 0.0 => multiMatch

case 1.0 => knn

case _ => combineMultiMatchAndKnn(multiMatch, knn, vecWeight, maxScore)

}

ElasticDsl.search(imagesCurrentAlias)

.size(k)

.query(q)

}

Also, a question more for @aliceptve about the data science side.

I notice that in their linear retriever elasticsearch use min/max normalisation for the BM25, and in fact in this example they do min/max normalisation to both the BM25 and knn sides.

We're just doing max normalisation for BM25. This could definitely change the results in certain cases vs min/max, though I'm not totally clear on which is better or how much it matters.

For instance if you have
doc A: knn 0.9, BM25 9
doc B: knn 0.1, BM25 10

And assuming vecWeight = 0.5, i.e. 50/50 split between BM25 and vector

max norm

doc A normed BM25 = 9/10 = 0.9
doc B normed BM25 = 10/10 = 1
doc A score = 0.9 + 0.9 = 1.8
doc B score = 0.1 + 1 = 1.1
Ordering is A > B

min/max norm

If only two docs, normed BM25 values are always 0 and 1
doc A score = 0.9 + 0 = 0.9
doc B score = 0.1 + 1 = 1.1
Ordering is B > A

gu-prout · 2026-05-22T13:36:58Z

Seen on image-loader, cropper (merged by @ellenmuller 8 minutes and 53 seconds ago) Please check your changes!

gu-prout · 2026-05-22T13:37:09Z

Seen on usage, kahuna, auth, metadata-editor (merged by @ellenmuller 9 minutes and 4 seconds ago) Please check your changes!

gu-prout · 2026-05-22T13:37:17Z

Seen on collections (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes!

gu-prout · 2026-05-22T13:37:17Z

Seen on collections, leases (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes!

gu-prout · 2026-05-22T13:37:17Z

Seen on leases (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes!

gu-prout · 2026-05-22T13:37:22Z

Seen on collections (merged by @ellenmuller 9 minutes and 17 seconds ago) Please check your changes!

gu-prout · 2026-05-22T13:37:23Z

Seen on collections, thrall, media-api (merged by @ellenmuller 9 minutes and 17 seconds ago) Please check your changes!

Hybrid blending for AI search, matching Kahuna/media-api PR #4738. URL-only param (no UI), default 1.0 (pure KNN, no behaviour change). - vecWeight=1 or absent: pure KNN (existing path) - vecWeight=0: pure BM25 multi_match on AI text - 0 < vecWeight < 1: hybrid (probe + normalised blend)

alexduf and others added 3 commits May 6, 2026 15:17

WIP: hybrid search

14965b0

Merge branch 'main' of github.com:guardian/grid into hybrid-search

3fbfbac

move to hybrid search

9a89253

ellenmuller added 2 commits May 14, 2026 12:24

add comments and handle divide by 0 case

0096880

Merge branch 'main' of github.com:guardian/grid into hybrid-search

581524b

ellenmuller added the feature Departmental tracking: work on a new feature label May 14, 2026

ellenmuller added 6 commits May 14, 2026 12:29

oops remove scratch that had example cerebro request

279a439

not working, intermediate commit

cf10157

add vecweight parameter to control hybrid search

6587acd

remove trailing comma that broke kahuna linting

1b16745

Merge branch 'main' of github.com:guardian/grid into hybrid-search

f364568

total is 200 in semantic mode, so matches is the correct number

90416b4

ellenmuller marked this pull request as ready for review May 19, 2026 15:00

ellenmuller requested a review from a team as a code owner May 19, 2026 15:00

ellenmuller requested a review from Copilot May 19, 2026 15:01

Copilot AI reviewed May 19, 2026

View reviewed changes

ellenmuller and others added 3 commits May 19, 2026 16:19

Apply suggestions from code review

7f91639

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

c4e8b8c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

53a9cbc

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

alexduf approved these changes May 19, 2026

View reviewed changes

ellenmuller added 5 commits May 19, 2026 17:09

account for vecWeight = 0.0

8a9f214

action small comments from Alex

12fabd1

create match query helper function

a9a1d4b

refactor into for-comprehension

44eda6f

put comments back in (copilot removed them)

51c06e9

andrew-nowak reviewed May 20, 2026

View reviewed changes

Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated

ellenmuller and others added 2 commits May 20, 2026 12:12

set vecWeight to 1.0 if not in url param

8e409d4

Update media-api/app/lib/elasticsearch/ElasticSearch.scala

f2bfde7

Co-authored-by: Andrew Nowak <10963046+andrew-nowak@users.noreply.github.com>

joelochlann approved these changes May 22, 2026

View reviewed changes

gu-prout Bot added Seen-on-auth Seen-on-usage Seen-on-kahuna Seen-on-metadata-editor and removed Pending-on-metadata-editor Pending-on-usage Pending-on-kahuna Pending-on-auth labels May 22, 2026

gu-prout Bot added Seen-on-collections Seen-on-leases Pending-on-collections and removed Pending-on-collections Pending-on-leases Seen-on-collections labels May 22, 2026

gu-prout Bot added Seen-on-collections and removed Pending-on-collections labels May 22, 2026

gu-prout Bot added Seen-on-media-api Seen-on-thrall and removed Pending-on-media-api Pending-on-thrall labels May 22, 2026

Conversation

ellenmuller commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this change?

How should a reviewer test this change?

Tested? Documented?

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploy build 14494 of media-service::grid::all to TEST

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexduf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexduf May 19, 2026

Choose a reason for hiding this comment

Uh oh!

ellenmuller May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joelochlann left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joelochlann May 21, 2026

Choose a reason for hiding this comment

Uh oh!

joelochlann May 22, 2026

Choose a reason for hiding this comment

Uh oh!

joelochlann May 22, 2026

Choose a reason for hiding this comment

max norm

min/max norm

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

gu-prout Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ellenmuller commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

Deploy build 14494 of `media-service::grid::all` to TEST

joelochlann left a comment •

edited

Loading