Image Search with Subindices #606

vnguye65 · 2023-12-01T16:32:53Z

Could you provide an example of this image search can used with other data using subindices?
I get this error when I try to upsert the vector-database with image data.

embeddings = Embeddings(
    content=True,
    defaults=False,

    indexes={
        "text-data": {
            "tokenize": True,
            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
            "columns": {
                "text": "text-data"
            },

        "image":{
            "tokenize": True,
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
            "objects": "image",
            "columns": {
                "object": "image"
            }
        }
    })

embeddings.insert([(uid, {"object": image, "format": image.format, "width": image.width, "height": image.height}}, image_file)])

TypeError: Object of type JpegImageFile is not JSON serializable

https://github.com/neuml/txtai/blob/master/examples/13_Similarity_search_with_images.ipynb

The text was updated successfully, but these errors were encountered:

vnguye65 · 2023-12-01T18:24:04Z

@davidmezzetti This code seems to work without the "columns" parameter in the Embedding config

davidmezzetti · 2023-12-02T00:48:55Z

Hello, that is correct. Since you are using the default object field name of object. If you add the columns config back in for the subindex, it would have to look like this:

"columns": {
  "object": "object"
}

vnguye65 · 2023-12-03T21:34:32Z

@davidmezzetti My apologies, i mistyped the code in the question.

I got it to work now. It seemed to have been confused between the columns in the subindexes
Adding "object": '_' to the other subindex fixed the issue.

However, changing columns": { "object": "object" } in the image subindex to any name other than object causes an error even when I change the data insertion code to ex: embeddings.index([(0, {"other name": image}, None)])

    indexes={
        "text-data": {
            "tokenize": True,
            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
            "columns": {
                "text": "text-data",
                "object": '_'
            },
        },

        "image":{
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
            # 'objects': 'image',
             "columns": {
                "object": "object"
            }
            
        },

I'm using object as the column name so I'm able to get it to work.

Could you provide an example of how you would do a image search?
results = embeddings.search("select object from txtai where similar('machine learning')") doesn't seem to work

vnguye65 · 2023-12-03T22:10:19Z

Another question I have is is there a way to weigh the subindexes when during combined search?

davidmezzetti · 2023-12-04T00:21:26Z

Would it be possible to provide a few notional example records? It would be easier for me to follow exactly what you're trying to accomplish.

vnguye65 · 2023-12-04T14:48:11Z

We want to leverage subindexes to perform both text and image search. We have 3 more subindexes other than what is shown here in the config above.
The user should be able to search for images or texts or both.

To run a search on text-data, we run embeddings.search('select text-data, score from txtai where similar("machine learning", "text-data")')

Could you provide an example of how we could run a search on only image data (the image index)
embeddings.search('select object, score from txtai where similar("machine learning")') does not seem to work

davidmezzetti · 2023-12-04T15:14:38Z

Did you try?

embeddings.search('select object, score from txtai where similar("machine learning", "image")')

or

embeddings.search('select object, score from txtai where similar("machine learning")', index="image")

vnguye65 · 2023-12-04T15:41:13Z

I figured out what the issue is.

When I run embeddings.search('select object, score from txtai where similar("machine learning", "image")')

The output for the first 3 items are:
[{'object': None, 'score': 0.7698213458061218}, {'object': None, 'score': 0.7577175498008728}, {'object': None, 'score': 0.7567495107650757}]

The problem is with the similarity scores. Typically for text-to-image comparisons, the similarity scores are lower as supposed to text-to-text comparisons. The algorithm seems to return all the scores for all subindexes, not just image, which is why the matching images are pushed to the bottom.

Is there a workaround that you know of to only return objects from the image index?

I have tried
embeddings.search('select object, score from txtai where similar("machine learning", "image") and object is not null')
but it returns an empty list

davidmezzetti · 2023-12-04T15:50:04Z

I will try to find time to come up with an example that does what you want to do with the appropriate configuration. I think some of the configuration you have could be throwing something off.

vnguye65 · 2023-12-04T21:28:42Z

With the current configuration, I noticed that it only calculates similarity using the text field in data not object. The only difference I saw with my configuration with subindexes and one without subindexes is the text fields

Configuration with subindexes returns the following when I run embeddings.search(f'''select * from txtai where similar("machine learning")''')

[{'indexid': 8,
  'id': '8',
  'text': None,
  'tags': '2014-0110-F_Hi_j0007_4.JPG',
  'entry': '2023-12-04 16:26:47.702813',
  'data': '{"tags": "2014-0110-F_Hi_j0007_4.JPG"}',
  'object': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1000x667>,
  'score': 0.2972120940685272},
...]

Configuration with subindexes returns:

[{'indexid': 8,
  'id': '8',
  'text': None,
  'tags': '2014-0110-F_Hi_j0007_4.JPG',
  'entry': '2023-12-04 16:26:47.702813',
  'data': '{"tags": "2014-0110-F_Hi_j0007_4.JPG", "text": "8"}',
  'object': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1000x667>,
  'score': 0.9972120940685272},
...

This is what i have set up for subindexes:

embeddings = Embeddings(
    content=True,
    defaults=False,
    objects='image',

    indexes={
        
        "text-data": {
            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
            "columns": {
                "text": "text-data",
                "object": "_"
            },
        },
        
         "image":{
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
             "columns": {
                "object": "object"
            }
        },

davidmezzetti · 2023-12-05T15:27:19Z

The following code should give you what you're looking for.

from PIL import Image

from txtai import Embeddings

embeddings = Embeddings(
    content=True,
    objects="image",
    defaults=False,
    indexes={
        "text-data": {
            "columns": {
                "text": "body",
                "object": "content"
            }
        },
        "image":{
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
            "columns": {
                "text": "image-text"
            }
        }
    }
)

embeddings.index([{"object": Image.open("books.jpg")}, {"body": "machine learning"}])

And supports the following queries.

embeddings.search("select id, score from txtai where similar(:x, :y)", parameters={"x": "books", "y": "text-data"})
# [{'id': '1', 'score': 0.3759748339653015}]

embeddings.search("select id, score from txtai where similar(:x, :y)", parameters={"x": "books", "y": "image"})
# [{'id': '0', 'score': 0.2725779414176941}]

embeddings.search("select id, object, score from txtai where similar(:x, :y)", parameters={"x": Image.open("books.jpg"), "y": "image"})
# [{'id': '0', 'object': <Image>, 'score': 0.9999999403953552}]

embeddings.search("select id, score from txtai where similar(:x, :y) or similar(:x, :z)", parameters={"x": "books", "y": "image", "z": "text-data"})
# [{'id': '1', 'score': 0.3759748339653015}, {'id': '0', 'score': 0.2725779414176941}]

embeddings.search("select id, score from txtai where similar(:x, :y) and similar(:x, :z)", parameters={"x": "books", "y": "image", "z": "text-data"})
# []

The tricky thing is that the CLIP model encodes both text and images. So the config needs to be setup to skip text records. Note how both indexes have non-standard names for the text column.

vnguye65 · 2023-12-05T15:44:39Z

Thank you for your help

davidmezzetti · 2023-12-09T13:04:13Z

Closing this issue. If there are further questions, please re-open or open a new issue.

vnguye65 · 2023-12-11T15:42:10Z

@davidmezzetti
I was able to run image search using the above configuration you provided in my local testing environment. However, when we used the same configuration in our deployment, we get the same error using the post and get methods to ingest images:

TypeError: Object of type JpegImageFile is not JSON serializable

This is the code to ingest images:

image = Image.open(stream)
inputs = [{"object": image}]
response = requests.post(f"{base_url}/add", json = inputs)
response = requests.get(f"{base_url}/upsert")

The only difference between the local version that works and the deployed version that does not is the path parameter for the vector database. The local version points to local directory while the other points to /mnt/data

writable: true
path: /mnt/data

Any insights you can provide on this issue would be greatly appreciated!

davidmezzetti · 2023-12-11T22:44:36Z

It looks like in dev you're using Python directly and prod is using the API?

vnguye65 · 2023-12-12T14:07:40Z

@davidmezzetti Yes. We're using the API in prod. Do you have any insights into what might be causing this issue?

vnguye65 · 2023-12-13T15:27:15Z

@davidmezzetti Do you have insights into why it's throwing an error with API but working perfectly when calling txtai.app.Application in Python? Any information you have would be greatly appreciated.

davidmezzetti · 2023-12-17T12:37:16Z

The API does not currently support multi-part form submissions for embeddings inputs. I will try to work on an example to demonstrate how to do that but it will involve adding a custom API endpoint. This article has an example of that.

https://neuml.hashnode.dev/custom-api-endpoints

vnguye65 · 2023-12-27T14:07:38Z

@davidmezzetti Do you have any updates on this issue?

davidmezzetti · 2023-12-27T21:48:50Z

Sending binary data through the API isn't currently supported. It's something I want to add and showing how to do it in the meantime with a custom endpoint is on my list.

davidmezzetti · 2024-01-05T13:27:01Z

Closing here and tracking this in #606.

davidmezzetti closed this as completed Dec 9, 2023

davidmezzetti reopened this Dec 18, 2023

davidmezzetti mentioned this issue Jan 5, 2024

Add support for binary content via API #630

Closed

davidmezzetti closed this as completed Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image Search with Subindices #606

Image Search with Subindices #606

vnguye65 commented Dec 1, 2023 •

edited

Loading

vnguye65 commented Dec 1, 2023

davidmezzetti commented Dec 2, 2023

vnguye65 commented Dec 3, 2023 •

edited

Loading

vnguye65 commented Dec 3, 2023

davidmezzetti commented Dec 4, 2023

vnguye65 commented Dec 4, 2023

davidmezzetti commented Dec 4, 2023

vnguye65 commented Dec 4, 2023 •

edited

Loading

davidmezzetti commented Dec 4, 2023

vnguye65 commented Dec 4, 2023 •

edited

Loading

davidmezzetti commented Dec 5, 2023

vnguye65 commented Dec 5, 2023 •

edited

Loading

davidmezzetti commented Dec 9, 2023

vnguye65 commented Dec 11, 2023 •

edited

Loading

davidmezzetti commented Dec 11, 2023

vnguye65 commented Dec 12, 2023 •

edited

Loading

vnguye65 commented Dec 13, 2023

davidmezzetti commented Dec 17, 2023

vnguye65 commented Dec 27, 2023

davidmezzetti commented Dec 27, 2023

davidmezzetti commented Jan 5, 2024

Image Search with Subindices #606

Image Search with Subindices #606

Comments

vnguye65 commented Dec 1, 2023 • edited Loading

vnguye65 commented Dec 1, 2023

davidmezzetti commented Dec 2, 2023

vnguye65 commented Dec 3, 2023 • edited Loading

vnguye65 commented Dec 3, 2023

davidmezzetti commented Dec 4, 2023

vnguye65 commented Dec 4, 2023

davidmezzetti commented Dec 4, 2023

vnguye65 commented Dec 4, 2023 • edited Loading

davidmezzetti commented Dec 4, 2023

vnguye65 commented Dec 4, 2023 • edited Loading

davidmezzetti commented Dec 5, 2023

vnguye65 commented Dec 5, 2023 • edited Loading

davidmezzetti commented Dec 9, 2023

vnguye65 commented Dec 11, 2023 • edited Loading

davidmezzetti commented Dec 11, 2023

vnguye65 commented Dec 12, 2023 • edited Loading

vnguye65 commented Dec 13, 2023

davidmezzetti commented Dec 17, 2023

vnguye65 commented Dec 27, 2023

davidmezzetti commented Dec 27, 2023

davidmezzetti commented Jan 5, 2024

vnguye65 commented Dec 1, 2023 •

edited

Loading

vnguye65 commented Dec 3, 2023 •

edited

Loading

vnguye65 commented Dec 4, 2023 •

edited

Loading

vnguye65 commented Dec 4, 2023 •

edited

Loading

vnguye65 commented Dec 5, 2023 •

edited

Loading

vnguye65 commented Dec 11, 2023 •

edited

Loading

vnguye65 commented Dec 12, 2023 •

edited

Loading