Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image Search with Subindices #606

Closed
vnguye65 opened this issue Dec 1, 2023 · 21 comments
Closed

Image Search with Subindices #606

vnguye65 opened this issue Dec 1, 2023 · 21 comments

Comments

@vnguye65
Copy link

vnguye65 commented Dec 1, 2023

Could you provide an example of this image search can used with other data using subindices?
I get this error when I try to upsert the vector-database with image data.

embeddings = Embeddings(
    content=True,
    defaults=False,

    indexes={
        "text-data": {
            "tokenize": True,
            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
            "columns": {
                "text": "text-data"
            },

        "image":{
            "tokenize": True,
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
            "objects": "image",
            "columns": {
                "object": "image"
            }
        }
    })
embeddings.insert([(uid, {"object": image, "format": image.format, "width": image.width, "height": image.height}}, image_file)])
TypeError: Object of type JpegImageFile is not JSON serializable

https://github.com/neuml/txtai/blob/master/examples/13_Similarity_search_with_images.ipynb

@vnguye65
Copy link
Author

vnguye65 commented Dec 1, 2023

@davidmezzetti This code seems to work without the "columns" parameter in the Embedding config

@davidmezzetti
Copy link
Member

Hello, that is correct. Since you are using the default object field name of object. If you add the columns config back in for the subindex, it would have to look like this:

"columns": {
  "object": "object"
}

@vnguye65
Copy link
Author

vnguye65 commented Dec 3, 2023

@davidmezzetti My apologies, i mistyped the code in the question.

I got it to work now. It seemed to have been confused between the columns in the subindexes
Adding "object": '_' to the other subindex fixed the issue.

However, changing columns": { "object": "object" } in the image subindex to any name other than object causes an error even when I change the data insertion code to ex: embeddings.index([(0, {"other name": image}, None)])

    indexes={
        "text-data": {
            "tokenize": True,
            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
            "columns": {
                "text": "text-data",
                "object": '_'
            },
        },

        "image":{
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
            # 'objects': 'image',
             "columns": {
                "object": "object"
            }
            
        },

I'm using object as the column name so I'm able to get it to work.

Could you provide an example of how you would do a image search?
results = embeddings.search("select object from txtai where similar('machine learning')") doesn't seem to work

@vnguye65
Copy link
Author

vnguye65 commented Dec 3, 2023

Another question I have is is there a way to weigh the subindexes when during combined search?

@davidmezzetti
Copy link
Member

Would it be possible to provide a few notional example records? It would be easier for me to follow exactly what you're trying to accomplish.

@vnguye65
Copy link
Author

vnguye65 commented Dec 4, 2023

We want to leverage subindexes to perform both text and image search. We have 3 more subindexes other than what is shown here in the config above.
The user should be able to search for images or texts or both.

To run a search on text-data, we run embeddings.search('select text-data, score from txtai where similar("machine learning", "text-data")')

Could you provide an example of how we could run a search on only image data (the image index)
embeddings.search('select object, score from txtai where similar("machine learning")') does not seem to work

@davidmezzetti
Copy link
Member

Did you try?

embeddings.search('select object, score from txtai where similar("machine learning", "image")')

or

embeddings.search('select object, score from txtai where similar("machine learning")', index="image")

@vnguye65
Copy link
Author

vnguye65 commented Dec 4, 2023

I figured out what the issue is.

When I run embeddings.search('select object, score from txtai where similar("machine learning", "image")')

The output for the first 3 items are:
[{'object': None, 'score': 0.7698213458061218}, {'object': None, 'score': 0.7577175498008728}, {'object': None, 'score': 0.7567495107650757}]

The problem is with the similarity scores. Typically for text-to-image comparisons, the similarity scores are lower as supposed to text-to-text comparisons. The algorithm seems to return all the scores for all subindexes, not just image, which is why the matching images are pushed to the bottom.

Is there a workaround that you know of to only return objects from the image index?

I have tried
embeddings.search('select object, score from txtai where similar("machine learning", "image") and object is not null')
but it returns an empty list

@davidmezzetti
Copy link
Member

I will try to find time to come up with an example that does what you want to do with the appropriate configuration. I think some of the configuration you have could be throwing something off.

@vnguye65
Copy link
Author

vnguye65 commented Dec 4, 2023

With the current configuration, I noticed that it only calculates similarity using the text field in data not object. The only difference I saw with my configuration with subindexes and one without subindexes is the text fields

Configuration with subindexes returns the following when I run embeddings.search(f'''select * from txtai where similar("machine learning")''')

[{'indexid': 8,
  'id': '8',
  'text': None,
  'tags': '2014-0110-F_Hi_j0007_4.JPG',
  'entry': '2023-12-04 16:26:47.702813',
  'data': '{"tags": "2014-0110-F_Hi_j0007_4.JPG"}',
  'object': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1000x667>,
  'score': 0.2972120940685272},
...]

Configuration with subindexes returns:

[{'indexid': 8,
  'id': '8',
  'text': None,
  'tags': '2014-0110-F_Hi_j0007_4.JPG',
  'entry': '2023-12-04 16:26:47.702813',
  'data': '{"tags": "2014-0110-F_Hi_j0007_4.JPG", "text": "8"}',
  'object': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1000x667>,
  'score': 0.9972120940685272},
... 

This is what i have set up for subindexes:

embeddings = Embeddings(
    content=True,
    defaults=False,
    objects='image',

    indexes={
        
        "text-data": {
            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
            "columns": {
                "text": "text-data",
                "object": "_"
            },
        },
        
         "image":{
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
             "columns": {
                "object": "object"
            }
        },
        

@davidmezzetti
Copy link
Member

The following code should give you what you're looking for.

from PIL import Image

from txtai import Embeddings

embeddings = Embeddings(
    content=True,
    objects="image",
    defaults=False,
    indexes={
        "text-data": {
            "columns": {
                "text": "body",
                "object": "content"
            }
        },
        "image":{
            "method": "sentence-transformers",
            "path": "sentence-transformers/clip-ViT-B-32",
            "columns": {
                "text": "image-text"
            }
        }
    }
)

embeddings.index([{"object": Image.open("books.jpg")}, {"body": "machine learning"}])

And supports the following queries.

embeddings.search("select id, score from txtai where similar(:x, :y)", parameters={"x": "books", "y": "text-data"})
# [{'id': '1', 'score': 0.3759748339653015}]

embeddings.search("select id, score from txtai where similar(:x, :y)", parameters={"x": "books", "y": "image"})
# [{'id': '0', 'score': 0.2725779414176941}]

embeddings.search("select id, object, score from txtai where similar(:x, :y)", parameters={"x": Image.open("books.jpg"), "y": "image"})
# [{'id': '0', 'object': <Image>, 'score': 0.9999999403953552}]

embeddings.search("select id, score from txtai where similar(:x, :y) or similar(:x, :z)", parameters={"x": "books", "y": "image", "z": "text-data"})
# [{'id': '1', 'score': 0.3759748339653015}, {'id': '0', 'score': 0.2725779414176941}]

embeddings.search("select id, score from txtai where similar(:x, :y) and similar(:x, :z)", parameters={"x": "books", "y": "image", "z": "text-data"})
# []

The tricky thing is that the CLIP model encodes both text and images. So the config needs to be setup to skip text records. Note how both indexes have non-standard names for the text column.

@vnguye65
Copy link
Author

vnguye65 commented Dec 5, 2023

Thank you for your help

@davidmezzetti
Copy link
Member

Closing this issue. If there are further questions, please re-open or open a new issue.

@vnguye65
Copy link
Author

vnguye65 commented Dec 11, 2023

@davidmezzetti
I was able to run image search using the above configuration you provided in my local testing environment. However, when we used the same configuration in our deployment, we get the same error using the post and get methods to ingest images:

TypeError: Object of type JpegImageFile is not JSON serializable

This is the code to ingest images:

image = Image.open(stream)
inputs = [{"object": image}]
response = requests.post(f"{base_url}/add", json = inputs)
response = requests.get(f"{base_url}/upsert")

The only difference between the local version that works and the deployed version that does not is the path parameter for the vector database. The local version points to local directory while the other points to /mnt/data

writable: true
path: /mnt/data

Any insights you can provide on this issue would be greatly appreciated!

@davidmezzetti
Copy link
Member

It looks like in dev you're using Python directly and prod is using the API?

@vnguye65
Copy link
Author

vnguye65 commented Dec 12, 2023

@davidmezzetti Yes. We're using the API in prod. Do you have any insights into what might be causing this issue?

@vnguye65
Copy link
Author

@davidmezzetti Do you have insights into why it's throwing an error with API but working perfectly when calling txtai.app.Application in Python? Any information you have would be greatly appreciated.

@davidmezzetti
Copy link
Member

The API does not currently support multi-part form submissions for embeddings inputs. I will try to work on an example to demonstrate how to do that but it will involve adding a custom API endpoint. This article has an example of that.

https://neuml.hashnode.dev/custom-api-endpoints

@davidmezzetti davidmezzetti reopened this Dec 18, 2023
@vnguye65
Copy link
Author

@davidmezzetti Do you have any updates on this issue?

@davidmezzetti
Copy link
Member

Sending binary data through the API isn't currently supported. It's something I want to add and showing how to do it in the meantime with a custom endpoint is on my list.

@davidmezzetti
Copy link
Member

Closing here and tracking this in #606.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants