Embed incoming images with cohere and write to S3 vector store#4539
Embed incoming images with cohere and write to S3 vector store#4539ellenmuller merged 36 commits intomainfrom
Conversation
| class Bedrock(config: CommonConfig) | ||
| extends AwsClientV2BuilderUtils { | ||
|
|
||
| // TODO: figure out what the more usual pattern for turning off localstack behaviour is |
There was a problem hiding this comment.
Please do advise on this if there is a better way of doing this!
| val body = Bedrock.BedrockRequest( | ||
| input_type = "image", | ||
| embedding_types = List("float"), | ||
| images = List(s"data:image/jpg;base64,$base64EncodedImage") |
There was a problem hiding this comment.
Ah! Can we be certain that every image is a JPEG? I think the Grid does technically support PNGs, though I feel like they are rare...
There was a problem hiding this comment.
well remembered! no you can't, we'll see both .png and .tiff files
It looks like bedrock/cohere won't accept .tiff files so we shouldn't send those. Fortunately for other reasons during the pipeline we'll create a jpg/png "browser-viewable" version of a tiff, so we can send that instead. That'll require a slightly different wiring in Uploader/Projector). We can look at that next weds?
(Also it looks like they only accept images up to 5MB, so we'll need to deal with that too)
There was a problem hiding this comment.
Just to play "ruthless MVP's advocate" for a sec, what if initially we could just filter out images that are not JPEGs and over 5MB? (Depending on how big a proportion of incoming images these constitute)
This is obviously a terrible idea for the real production feature, but until we do a backfill we're going to have swathes of unembedded images anyway so the prototype feature will be: "of what we've embedded so far, here are the best 30". Given that, maaaaybe it's OK to kick this down the road a little, especially because we may shortly be upgrading to Cohere v4 and having to re-embed everything
There was a problem hiding this comment.
Agreed, I think we can filter out ones that are >5MB or not JPEG. I think we should log out the file size and the file type though so we can see how many images fall in those categories :)
There was a problem hiding this comment.
Of a random sample of 38,864 PROD images (the 45,398 in the image bucket with hash ending 00, minus those with no corresponding elasticsearch record) with we have:
38,291 JPEG (98.5%)
490 PNG (1.3%)
83 TIFF (0.2%)
There was a problem hiding this comment.
From the same sample, 10,326 (27%) are over 5MB, according to
jq 'select(.data.source.size > 5000000) | .data.id' metadata.jsonl | wc -lThere was a problem hiding this comment.
Also, I've confirmed that requests with images over 5MB will fail: https://github.com/guardian/data-science-embedding-experiments/pull/8
There was a problem hiding this comment.
FYI, apparently even Cohere v4 has a 5MB limit, in addition to the "max 2,458,624 pixels before downsampling": "The original image file type must be in a png, jpeg, webp, or gif format and can be up to 5 MB in size" https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed-v4.html
| .accept("*/*") | ||
| .body(SdkBytes.fromUtf8String(jsonBody)) | ||
| .contentType("application/json") | ||
| .modelId("cohere.embed-english-v3") |
There was a problem hiding this comment.
guess we may want to upgrade to v4 soon but maybe best to keep to the model we evaluated with for now...? When we upgrade we'll need to throw away existing embeddings I think!
There was a problem hiding this comment.
Yes that's right, we can embed to a new index for a while for v4, or get two embeddings to and write to two indexes, whilst keeping the search endpoint pointed towards the v3 index.
There was a problem hiding this comment.
Yeah that's a smart idea and a useful general strategy for embedding model upgrades
|
Seen on auth, usage, image-loader, metadata-editor, thrall, leases, cropper, collections, media-api, kahuna (merged by @ellenmuller 11 minutes and 46 seconds ago) Please check your changes! |
What does this change?
This PR writes image vector embeddings to the S3 Vector Store when new images are uploaded. The images are embedded using Cohere's cohere-embed-v3 model via Bedrock.
Changes
Embedderinto the image upload pipeline, this class has two dependencies - theBedrockclass and theS3Vectorsclass.s3.vectors.shouldEmbed(accessed as shouldEmbed in Scala code) to control embedding generation for safe collaboration with the BBC, this is also added to the script that generates the config locally (dev/script/generate-config/service-config.js)dev/script/get-s3-vector-store-records.sh, which returns the contents of the test, dev or prod vector store. Annoyingly we can't just inspect the records in the AWS console because that's not supported yet!Future Work
The following items are planned for subsequent PRs:
How should a reviewer test this change?
Run locally and upload a new image. You should see logging that your image has been embedded and uploaded to the S3 vector store. You can then run
./dev/script/get-s3-vector-store-records.sh devand check if the image ID of the image you've just uploaded is in the vector store.You can go through the same steps on test, and run
./dev/script/get-s3-vector-store-records.sh testto confirm instead. I've added cloudformation permissions to bedrock and the S3 vector store so you can safely deploy to test (https://github.com/guardian/editorial-tools-platform/pull/987)Tested? Documented?