Embed incoming images with cohere and write to S3 vector store by ellenmuller · Pull Request #4539 · guardian/grid

ellenmuller · 2025-10-08T10:47:05Z

What does this change?

This PR writes image vector embeddings to the S3 Vector Store when new images are uploaded. The images are embedded using Cohere's cohere-embed-v3 model via Bedrock.

Changes

Integrated a new class called Embedder into the image upload pipeline, this class has two dependencies - the Bedrock class and the S3Vectors class.
Implemented config setting s3.vectors.shouldEmbed (accessed as shouldEmbed in Scala code) to control embedding generation for safe collaboration with the BBC, this is also added to the script that generates the config locally (dev/script/generate-config/service-config.js)
On failure, we log out an error message but don't halt upload so that we don't mess with the upload pipeline.
Added a script, dev/script/get-s3-vector-store-records.sh, which returns the contents of the test, dev or prod vector store. Annoyingly we can't just inspect the records in the AWS console because that's not supported yet!
We also check if the image is >5MB/a Tiff and we don't embed those.

Future Work

The following items are planned for subsequent PRs:

Convert incompatible images: Use the BrowserViewableImage for Tiff files (these are converted to JPEGs/PNGs in the browser) and downscale images >5MB to embed
Semantic search: Add cosine similarity-based semantic search to the GET /images search endpoint (likely activated by a feature switch)
Metadata strategy: Determine which metadata fields should be stored in the S3 Vector Store to add filtering capabilities in semantic search mode (not strictly necessary if we keep the MVP super minimal, but if time allows it would be good to add this after we have our search endpoint)
Backfill: generate embeddings for historical images (likely backfilled using our embedding lambda in the AI AWS account, we'll discuss in detail with someone from the Media & Feeds team before we do anything on this obviously!). In order to mimic the eventual production environment we need to be able to search over a large enough dataset.

How should a reviewer test this change?

Run locally and upload a new image. You should see logging that your image has been embedded and uploaded to the S3 vector store. You can then run ./dev/script/get-s3-vector-store-records.sh dev and check if the image ID of the image you've just uploaded is in the vector store.

You can go through the same steps on test, and run ./dev/script/get-s3-vector-store-records.sh test to confirm instead. I've added cloudformation permissions to bedrock and the S3 vector store so you can safely deploy to test (https://github.com/guardian/editorial-tools-platform/pull/987)

Tested? Documented?

locally by committer
locally by Guardian reviewer
on the Guardian's TEST environment
relevant documentation added or amended (if needed)

github-actions · 2025-10-09T10:37:59Z

Deploy build 13601 of `media-service::grid::all` to TEST

All deployment options

From guardian/actions-riff-raff.

ellenmuller · 2025-10-16T13:34:41Z

+class Bedrock(config: CommonConfig)
+  extends AwsClientV2BuilderUtils {
+
+  // TODO: figure out what the more usual pattern for turning off localstack behaviour is


Please do advise on this if there is a better way of doing this!

joelochlann · 2025-10-16T14:08:58Z

+    val body = Bedrock.BedrockRequest(
+      input_type = "image",
+      embedding_types = List("float"),
+      images = List(s"data:image/jpg;base64,$base64EncodedImage")


Ah! Can we be certain that every image is a JPEG? I think the Grid does technically support PNGs, though I feel like they are rare...

well remembered! no you can't, we'll see both .png and .tiff files

It looks like bedrock/cohere won't accept .tiff files so we shouldn't send those. Fortunately for other reasons during the pipeline we'll create a jpg/png "browser-viewable" version of a tiff, so we can send that instead. That'll require a slightly different wiring in Uploader/Projector). We can look at that next weds?

(Also it looks like they only accept images up to 5MB, so we'll need to deal with that too)

Just to play "ruthless MVP's advocate" for a sec, what if initially we could just filter out images that are not JPEGs and over 5MB? (Depending on how big a proportion of incoming images these constitute)

This is obviously a terrible idea for the real production feature, but until we do a backfill we're going to have swathes of unembedded images anyway so the prototype feature will be: "of what we've embedded so far, here are the best 30". Given that, maaaaybe it's OK to kick this down the road a little, especially because we may shortly be upgrading to Cohere v4 and having to re-embed everything

Agreed, I think we can filter out ones that are >5MB or not JPEG. I think we should log out the file size and the file type though so we can see how many images fall in those categories :)

Of a random sample of 38,864 PROD images (the 45,398 in the image bucket with hash ending 00, minus those with no corresponding elasticsearch record) with we have:

38,291 JPEG (98.5%)
490 PNG (1.3%)
83 TIFF (0.2%)

From the same sample, 10,326 (27%) are over 5MB, according to

jq 'select(.data.source.size > 5000000) | .data.id' metadata.jsonl | wc -l

Also, I've confirmed that requests with images over 5MB will fail: https://github.com/guardian/data-science-embedding-experiments/pull/8

FYI, apparently even Cohere v4 has a 5MB limit, in addition to the "max 2,458,624 pixels before downsampling": "The original image file type must be in a png, jpeg, webp, or gif format and can be up to 5 MB in size" https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed-v4.html

joelochlann · 2025-10-16T14:11:19Z

+        .accept("*/*")
+        .body(SdkBytes.fromUtf8String(jsonBody))
+        .contentType("application/json")
+        .modelId("cohere.embed-english-v3")


guess we may want to upgrade to v4 soon but maybe best to keep to the model we evaluated with for now...? When we upgrade we'll need to throw away existing embeddings I think!

Yes that's right, we can embed to a new index for a while for v4, or get two embeddings to and write to two indexes, whilst keeping the search endpoint pointed towards the v3 index.

Yeah that's a smart idea and a useful general strategy for embedding model upgrades

gu-prout · 2025-10-30T11:58:42Z

Seen on auth, usage, image-loader, metadata-editor, thrall, leases, cropper, collections, media-api, kahuna (merged by @ellenmuller 11 minutes and 46 seconds ago) Please check your changes!

ellenmuller added 12 commits October 1, 2025 09:56

add bedrock class

bbe606f

copy over useful bits from old branch

4c90556

start to s3 vectors class

1987034

create putVector function

f2fa09e

wip commit

e9459bf

successful write to s3 vector store

9e51229

rename and tidy up

721a6be

tidy up more logging

bff866c

tidy up error handling

333854c

pull in main

d92802e

update tests

a617989

sort out mock (hopefully?)

7da7c0d

ellenmuller added 7 commits October 14, 2025 12:13

add image embed config setting

4484f21

refactor into embedding class and rename fetch to create

1878a4f

update tests

fa69542

refactor to embedder

b05985e

update script

7bafc0e

update pacakge lock

e232b07

merge in main

4d80752

ellenmuller changed the title ~~Em embed to s3 vector store~~ Embed incoming images with cohere and write to S3 vector store Oct 16, 2025

ellenmuller marked this pull request as ready for review October 16, 2025 13:34

ellenmuller requested a review from a team as a code owner October 16, 2025 13:34

ellenmuller commented Oct 16, 2025

View reviewed changes

Comment thread common-lib/src/main/scala/com/gu/mediaservice/lib/aws/S3Vectors.scala Outdated

ellenmuller requested a review from andrew-nowak October 16, 2025 13:38

joelochlann reviewed Oct 16, 2025

View reviewed changes

Comment thread common-lib/src/main/scala/com/gu/mediaservice/lib/aws/Embedder.scala Outdated

andrew-nowak reviewed Oct 16, 2025

View reviewed changes

Comment thread common-lib/src/main/scala/com/gu/mediaservice/model/ImageEmbedding.scala Outdated

joelochlann reviewed Oct 16, 2025

View reviewed changes

gu-prout Bot added Seen-on-leases Seen-on-usage Seen-on-kahuna Seen-on-image-loader Seen-on-cropper Seen-on-media-api Seen-on-collections Seen-on-metadata-editor Seen-on-thrall labels Oct 30, 2025

ellenmuller mentioned this pull request Nov 13, 2025

Add semantic search functionality behind feature switch #4550

Merged

4 tasks

Conversation

ellenmuller commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this change?

Changes

Future Work

How should a reviewer test this change?

Tested? Documented?

Uh oh!

github-actions Bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gu-prout Bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ellenmuller commented Oct 8, 2025 •

edited

Loading

github-actions Bot commented Oct 9, 2025 •

edited

Loading