Skip to content

Hybrid search#4738

Merged
ellenmuller merged 23 commits into
mainfrom
hybrid-search
May 22, 2026
Merged

Hybrid search#4738
ellenmuller merged 23 commits into
mainfrom
hybrid-search

Conversation

@ellenmuller
Copy link
Copy Markdown
Contributor

@ellenmuller ellenmuller commented May 14, 2026

What does this change?

This PR switches the AI search from a vector KNN search to a hybrid search that combines semantic similarity with lexical BM25 matching.

It also introduces a vecWeight query parameter so we can control the boost parameter between vector relevance and keyword relevance from the client. This will help us experiment with different values and eventually settle on a default score.

Main changes

  • add a new hybrid Elasticsearch query for AI text search (this is the query developed by the Data Science team, see here)
  • combine KNN and multi_match results in a single query
  • normalise BM25 scores using the query's max BM25 score so lexical and vector signals can be blended more predictably
  • thread a new optional vecWeight parameter through Kahuna, the Media API, and search param parsing
  • default vecWeight to 0.8 for AI text search when no value is supplied

Please note that the 'More Like This' search is unchanged :)

How should a reviewer test this change?

I have deployed to TEST, where you can do a semantic search and experiment with adding different vecWeights to the URL, eg &vecWeight=0.4.

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

@ellenmuller ellenmuller added the feature Departmental tracking: work on a new feature label May 14, 2026
@ellenmuller ellenmuller marked this pull request as ready for review May 19, 2026 15:00
@ellenmuller ellenmuller requested a review from a team as a code owner May 19, 2026 15:00
@ellenmuller ellenmuller requested a review from Copilot May 19, 2026 15:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Replaces the AI text search's pure-KNN Elasticsearch call with a hybrid query that combines a knn clause and a BM25 multi_match clause, plumbed through with a new optional vecWeight parameter (default 0.8 on the server) controllable from the client. A preliminary ES request computes the query's max BM25 score so the lexical clause can be rescaled into roughly the same [0, 1] range as the cosine-similarity scores before being blended.

Changes:

  • New ElasticSearch.hybridSearch that runs a max-BM25 probe query, then a bool { should: [multiMatch, knn] } request, scaling the multi-match boost by (lexicalWeight/vecWeight) * (1/maxScore).
  • New vecWeight query parameter parsed in SearchParams, threaded through MediaApi.semanticSearchByText/performAiSearchAndRespond, defaulting to 0.8 when absent.
  • Kahuna wiring: state param, $stateParams plumbing, controller, and mediaApi.search updated to accept and forward vecWeight; elastic4s bumped from 8.18.2 to 8.19.1.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
build.sbt Bumps elastic4s to 8.19.1.
media-api/app/lib/elasticsearch/ElasticSearch.scala Adds hybridSearch with max-BM25 normalisation and combined bool/knn query.
media-api/app/lib/elasticsearch/ElasticSearchModel.scala Adds optional vecWeight to SearchParams and a parseDoubleFromQuery helper.
media-api/app/controllers/MediaApi.scala Threads vecWeight into semanticSearchByText/performAiSearchAndRespond and the search param list; defaults to 0.8.
kahuna/public/js/search/index.js Registers vecWeight as a search URL/state parameter.
kahuna/public/js/search/query.js Initialises ctrl.vecWeight from $stateParams and propagates it on AI-search state transitions.
kahuna/public/js/search/results.js Forwards $stateParams.vecWeight into the API search call.
kahuna/public/js/services/api/media-api.js Adds vecWeight to the search API parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kahuna/public/js/search/query.js Outdated
Comment thread kahuna/public/js/search/query.js Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearchModel.scala Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
ellenmuller and others added 3 commits May 19, 2026 16:19
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@alexduf alexduf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've flagged a few things that would be nice to address, either in this PR or in a subsequent one as I think this should work as-is

Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
Comment on lines +274 to +276
executeAndLog(withSearchQueryTimeout(searchRequest), "hybrid search").map { r =>
val imageHits = r.result.hits.hits.map(resolveHit).toSeq.flatten.map(i => (i.instance.id, i))
SearchResults(hits = imageHits, total = imageHits.length, extraCounts = None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole function could do with a little bit more structure. If you define the right functions you may be able to have code that looks like this pseudo code block:

for {
  maxScore <- fetchMaxBm25Score(query)
  hybridQuery = makeHybridQuery(query, maxScore)
  result <- executeAndLog(hybridQuery)
} yield {
  // build the SearchResults object here
}

This should make this function both more readable and more maintainable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's so nice! Have tried doing this here 44eda6f - haven't tested this refactor on TEST yet but that's one for tomorrow :)

Comment thread media-api/app/lib/elasticsearch/ElasticSearchModel.scala
Comment thread media-api/app/lib/elasticsearch/ElasticSearch.scala Outdated
ellenmuller and others added 2 commits May 20, 2026 12:12
Co-authored-by: Andrew Nowak <10963046+andrew-nowak@users.noreply.github.com>
Copy link
Copy Markdown
Member

@joelochlann joelochlann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good - my comments are mainly points for discussion, not blockers - but is the elasticsearch library upgrade intended?

Comment thread build.sbt
val awsSdkVersion = "1.12.470"
val awsSdkV2Version = "2.42.25"
val elastic4sVersion = "8.18.2"
val elastic4sVersion = "8.19.1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +224 to +257
private def makeHybridSearchRequest(
query: String,
queryEmbedding: List[Double],
k: Int,
numCandidates: Int,
vecWeight: Double,
maxScore: Double
)(implicit logMarker: LogMarker): SearchRequest = {
val knn = Knn("embedding.cohereEmbedV4.image")
.queryVector(queryEmbedding)
.k(k)
.numCandidates(numCandidates)
.boost(if (vecWeight > 0.0) 1.0 else 0.0)

val lexicalWeight = 1.0 - vecWeight

// KNN results are in [0,1], but BM25 scores are unbounded and typically much
// larger than cosine similarity, so we need to apply a scaling factor to the
// BM25 score to bring it to the same range as the cosine similarity.
val scalingFactor = if (maxScore > 0.0) 1.0 / maxScore else 1.0

// We want to apply only one boost if we can help it, so we scale the
// multi_match boost to be in line with the max_score and the desired
// lexical_weight/vec_weight balance
val multiMatchBoost = if (vecWeight > 0.0) (lexicalWeight / vecWeight) * scalingFactor else 1.0

logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost")

val multiMatchQuery = createMultiMatchQuery(query, boost = Some(multiMatchBoost))

ElasticDsl.search(imagesCurrentAlias)
.bool(BoolQuery().should(Seq(multiMatchQuery, knn)))
.size(k)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to your taste on this but personally I find it a little easier to read with the 0/1 edge cases extracted

Suggested change
private def makeHybridSearchRequest(
query: String,
queryEmbedding: List[Double],
k: Int,
numCandidates: Int,
vecWeight: Double,
maxScore: Double
)(implicit logMarker: LogMarker): SearchRequest = {
val knn = Knn("embedding.cohereEmbedV4.image")
.queryVector(queryEmbedding)
.k(k)
.numCandidates(numCandidates)
.boost(if (vecWeight > 0.0) 1.0 else 0.0)
val lexicalWeight = 1.0 - vecWeight
// KNN results are in [0,1], but BM25 scores are unbounded and typically much
// larger than cosine similarity, so we need to apply a scaling factor to the
// BM25 score to bring it to the same range as the cosine similarity.
val scalingFactor = if (maxScore > 0.0) 1.0 / maxScore else 1.0
// We want to apply only one boost if we can help it, so we scale the
// multi_match boost to be in line with the max_score and the desired
// lexical_weight/vec_weight balance
val multiMatchBoost = if (vecWeight > 0.0) (lexicalWeight / vecWeight) * scalingFactor else 1.0
logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost")
val multiMatchQuery = createMultiMatchQuery(query, boost = Some(multiMatchBoost))
ElasticDsl.search(imagesCurrentAlias)
.bool(BoolQuery().should(Seq(multiMatchQuery, knn)))
.size(k)
}
private def combineMultiMatchAndKnn(
multiMatchQuery: MultiMatchQuery,
knn: Knn,
vecWeight: Double,
maxScore: Double
)(implicit logMarker: LogMarker): BoolQuery = {
val lexicalWeight = 1.0 - vecWeight
// KNN results are in [0,1], but BM25 scores are unbounded and typically much
// larger than cosine similarity, so we need to apply a scaling factor to the
// BM25 score to bring it to the same range as the cosine similarity.
val scalingFactor = 1.0 / maxScore
// We want to apply only one boost if we can help it, so we scale the
// multi_match boost to be in line with the max_score and the desired
// lexical_weight/vec_weight balance
val multiMatchBoost = (lexicalWeight / vecWeight) * scalingFactor
logger.info(logMarker, s"Scaling factor for BM25 score is $scalingFactor, multi-match boost is $multiMatchBoost")
BoolQuery().should(Seq(multiMatchQuery.boost(multiMatchBoost), knn))
}
private def makeHybridSearchRequest(
query: String,
queryEmbedding: List[Double],
k: Int,
numCandidates: Int,
vecWeight: Double,
maxScore: Double
)(implicit logMarker: LogMarker): SearchRequest = {
val multiMatch = createMultiMatchQuery(query)
val knn = Knn("embedding.cohereEmbedV4.image")
.queryVector(queryEmbedding)
.k(k)
.numCandidates(numCandidates)
val q = vecWeight match {
case 0.0 => multiMatch
case 1.0 => knn
case _ => combineMultiMatchAndKnn(multiMatch, knn, vecWeight, maxScore)
}
ElasticDsl.search(imagesCurrentAlias)
.size(k)
.query(q)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a question more for @aliceptve about the data science side.

I notice that in their linear retriever elasticsearch use min/max normalisation for the BM25, and in fact in this example they do min/max normalisation to both the BM25 and knn sides.

We're just doing max normalisation for BM25. This could definitely change the results in certain cases vs min/max, though I'm not totally clear on which is better or how much it matters.

For instance if you have
doc A: knn 0.9, BM25 9
doc B: knn 0.1, BM25 10

And assuming vecWeight = 0.5, i.e. 50/50 split between BM25 and vector

max norm

doc A normed BM25 = 9/10 = 0.9
doc B normed BM25 = 10/10 = 1
doc A score = 0.9 + 0.9 = 1.8
doc B score = 0.1 + 1 = 1.1
Ordering is A > B

min/max norm

If only two docs, normed BM25 values are always 0 and 1
doc A score = 0.9 + 0 = 0.9
doc B score = 0.1 + 1 = 1.1
Ordering is B > A

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on image-loader, cropper (merged by @ellenmuller 8 minutes and 53 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on usage, kahuna, auth, metadata-editor (merged by @ellenmuller 9 minutes and 4 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on collections (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on collections, leases (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on leases (merged by @ellenmuller 9 minutes and 12 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on collections (merged by @ellenmuller 9 minutes and 17 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented May 22, 2026

Seen on collections, thrall, media-api (merged by @ellenmuller 9 minutes and 17 seconds ago) Please check your changes!

paperboyo added a commit that referenced this pull request May 25, 2026
Hybrid blending for AI search, matching Kahuna/media-api PR #4738.
URL-only param (no UI), default 1.0 (pure KNN, no behaviour change).

- vecWeight=1 or absent: pure KNN (existing path)
- vecWeight=0: pure BM25 multi_match on AI text
- 0 < vecWeight < 1: hybrid (probe + normalised blend)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants