Skip to content

Embed incoming images with cohere and write to S3 vector store#4539

Merged
ellenmuller merged 36 commits intomainfrom
em-embed-to-s3-vector-store
Oct 30, 2025
Merged

Embed incoming images with cohere and write to S3 vector store#4539
ellenmuller merged 36 commits intomainfrom
em-embed-to-s3-vector-store

Conversation

@ellenmuller
Copy link
Copy Markdown
Contributor

@ellenmuller ellenmuller commented Oct 8, 2025

What does this change?

This PR writes image vector embeddings to the S3 Vector Store when new images are uploaded. The images are embedded using Cohere's cohere-embed-v3 model via Bedrock.

Changes

  • Integrated a new class called Embedder into the image upload pipeline, this class has two dependencies - the Bedrock class and the S3Vectors class.
  • Implemented config setting s3.vectors.shouldEmbed (accessed as shouldEmbed in Scala code) to control embedding generation for safe collaboration with the BBC, this is also added to the script that generates the config locally (dev/script/generate-config/service-config.js)
  • On failure, we log out an error message but don't halt upload so that we don't mess with the upload pipeline.
  • Added a script, dev/script/get-s3-vector-store-records.sh, which returns the contents of the test, dev or prod vector store. Annoyingly we can't just inspect the records in the AWS console because that's not supported yet!
  • We also check if the image is >5MB/a Tiff and we don't embed those.

Future Work

The following items are planned for subsequent PRs:

  • Convert incompatible images: Use the BrowserViewableImage for Tiff files (these are converted to JPEGs/PNGs in the browser) and downscale images >5MB to embed
  • Semantic search: Add cosine similarity-based semantic search to the GET /images search endpoint (likely activated by a feature switch)
  • Metadata strategy: Determine which metadata fields should be stored in the S3 Vector Store to add filtering capabilities in semantic search mode (not strictly necessary if we keep the MVP super minimal, but if time allows it would be good to add this after we have our search endpoint)
  • Backfill: generate embeddings for historical images (likely backfilled using our embedding lambda in the AI AWS account, we'll discuss in detail with someone from the Media & Feeds team before we do anything on this obviously!). In order to mimic the eventual production environment we need to be able to search over a large enough dataset.

How should a reviewer test this change?

Run locally and upload a new image. You should see logging that your image has been embedded and uploaded to the S3 vector store. You can then run ./dev/script/get-s3-vector-store-records.sh dev and check if the image ID of the image you've just uploaded is in the vector store.

You can go through the same steps on test, and run ./dev/script/get-s3-vector-store-records.sh test to confirm instead. I've added cloudformation permissions to bedrock and the S3 vector store so you can safely deploy to test (https://github.com/guardian/editorial-tools-platform/pull/987)

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 9, 2025

@ellenmuller ellenmuller changed the title Em embed to s3 vector store Embed incoming images with cohere and write to S3 vector store Oct 16, 2025
@ellenmuller ellenmuller marked this pull request as ready for review October 16, 2025 13:34
@ellenmuller ellenmuller requested a review from a team as a code owner October 16, 2025 13:34
class Bedrock(config: CommonConfig)
extends AwsClientV2BuilderUtils {

// TODO: figure out what the more usual pattern for turning off localstack behaviour is
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do advise on this if there is a better way of doing this!

Comment thread common-lib/src/main/scala/com/gu/mediaservice/lib/aws/S3Vectors.scala Outdated
Comment thread common-lib/src/main/scala/com/gu/mediaservice/lib/aws/Embedder.scala Outdated
Comment thread common-lib/src/main/scala/com/gu/mediaservice/model/ImageEmbedding.scala Outdated
val body = Bedrock.BedrockRequest(
input_type = "image",
embedding_types = List("float"),
images = List(s"data:image/jpg;base64,$base64EncodedImage")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Can we be certain that every image is a JPEG? I think the Grid does technically support PNGs, though I feel like they are rare...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well remembered! no you can't, we'll see both .png and .tiff files

It looks like bedrock/cohere won't accept .tiff files so we shouldn't send those. Fortunately for other reasons during the pipeline we'll create a jpg/png "browser-viewable" version of a tiff, so we can send that instead. That'll require a slightly different wiring in Uploader/Projector). We can look at that next weds?

(Also it looks like they only accept images up to 5MB, so we'll need to deal with that too)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to play "ruthless MVP's advocate" for a sec, what if initially we could just filter out images that are not JPEGs and over 5MB? (Depending on how big a proportion of incoming images these constitute)

This is obviously a terrible idea for the real production feature, but until we do a backfill we're going to have swathes of unembedded images anyway so the prototype feature will be: "of what we've embedded so far, here are the best 30". Given that, maaaaybe it's OK to kick this down the road a little, especially because we may shortly be upgrading to Cohere v4 and having to re-embed everything

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we can filter out ones that are >5MB or not JPEG. I think we should log out the file size and the file type though so we can see how many images fall in those categories :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of a random sample of 38,864 PROD images (the 45,398 in the image bucket with hash ending 00, minus those with no corresponding elasticsearch record) with we have:

38,291 JPEG (98.5%)
490 PNG (1.3%)
83 TIFF (0.2%)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the same sample, 10,326 (27%) are over 5MB, according to

jq 'select(.data.source.size > 5000000) | .data.id' metadata.jsonl | wc -l

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I've confirmed that requests with images over 5MB will fail: https://github.com/guardian/data-science-embedding-experiments/pull/8

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, apparently even Cohere v4 has a 5MB limit, in addition to the "max 2,458,624 pixels before downsampling": "The original image file type must be in a png, jpeg, webp, or gif format and can be up to 5 MB in size" https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed-v4.html

.accept("*/*")
.body(SdkBytes.fromUtf8String(jsonBody))
.contentType("application/json")
.modelId("cohere.embed-english-v3")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess we may want to upgrade to v4 soon but maybe best to keep to the model we evaluated with for now...? When we upgrade we'll need to throw away existing embeddings I think!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, we can embed to a new index for a while for v4, or get two embeddings to and write to two indexes, whilst keeping the search endpoint pointed towards the v3 index.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a smart idea and a useful general strategy for embedding model upgrades

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented Oct 30, 2025

Seen on auth, usage, image-loader, metadata-editor, thrall, leases, cropper, collections, media-api, kahuna (merged by @ellenmuller 11 minutes and 46 seconds ago) Please check your changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants