Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External vectors #280

Closed
luquitared opened this issue May 17, 2022 · 13 comments
Closed

External vectors #280

luquitared opened this issue May 17, 2022 · 13 comments

Comments

@luquitared
Copy link

luquitared commented May 17, 2022

First of all, thank you for adding this feature!

I am attempting to recreate the example locally from the docs:

import numpy as np
import requests
from txtai.embeddings import Embeddings

data = ['test', 'test2']
def transform(inputs):
  response = requests.post("https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/nli-mpnet-base-v2",
                           json={"inputs": inputs})

  return np.array(response.json(), dtype=np.float32)

# Index data using vectors from Inference API
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

print(embeddings.search('test', 1))

when I run the search, I receive an empty array. Perhaps data must be a tuple?

@luquitared
Copy link
Author

luquitared commented May 17, 2022

Attempted with minimal changes:

import numpy as np
import requests
from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

def transform(inputs):
  response = requests.post("https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/nli-mpnet-base-v2",
                           json={"inputs": inputs})

  return np.array(response.json(), dtype=np.float32)

# Index data using vectors from Inference API
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
    # Extract text field from result
    text = embeddings.search(f"select id, text, score from txtai where similar('{query}')", 1)[0]["text"]

    # Print text
    print("%-20s %s" % (query, text))

And see this exception:

Query                Best Match
--------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 28, in <module>
    text = embeddings.search(f"select id, text, score from txtai where similar('{query}')", 1)[0]["text"]
IndexError: list index out of range

@davidmezzetti
Copy link
Member

Hi Lucas. txtai 4.5 is scheduled to be released shortly but hasn't been released yet. Did you install from PyPI? If you install from GitHub does it fix the issue?

pip install git+https://github.com/neuml/txtai

@luquitared
Copy link
Author

Hi David, that worked! Should have checked the version number. Thanks again!!

@luquitared
Copy link
Author

luquitared commented May 17, 2022

Ok I have a new issue but it may not be relevant. No problem if this is out of scope.

Here I have a file to grab embeddings from openai:

import numpy as np
import requests
from txtai.embeddings import Embeddings
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")


def transform(inputs):

    response = openai.Embedding.create(
        input=inputs,
        engine="text-similarity-ada-001"
    )

    return np.array(response.data[0].embedding, dtype=np.float32)

data = ['test', 'test2']

obj = transform(data)
print(obj.shape)
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print(embeddings.search('test', 1))

The printed dimensions show:
(1024,)

This gets the error:

File "/home/vscode/.local/lib/python3.8/site-packages/txtai/vectors/base.py", line 127, in batch
    dimensions = embeddings.shape[1]
IndexError: tuple index out of range

I edited base.py to grab the first element of the tuple as a test, and it gets a different error:

  File "/home/vscode/.local/lib/python3.8/site-packages/faiss/__init__.py", line 307, in replacement_search
    n, d = x.shape
ValueError: not enough values to unpack (expected 2, got 1)

I'll keep working on this so I may end up solving it. I think I need to understand how I need to reshape the array before indexing on FAISS.

@davidmezzetti
Copy link
Member

Hi Lucas. I think openai.Embedding.create returns an array of embeddings. The code looks to only take the first embeddings array, hence the (1024,) shape.

Maybe something like this:

return np.array([x.embedding for x in response.data], dtype=np.float32)

@luquitared
Copy link
Author

Worked like a charm!!!!

@bramses
Copy link

bramses commented Jun 3, 2022

Hi @davidmezzetti when I hit .info() I received this obj that the model backend seems to be using faiss:

{
  "backend": "faiss",
  "build": {
    "create": "2022-06-03T22:50:56Z",
    "python": "3.10.4",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Darwin (arm64)",
    "txtai": "4.5.0"
  },
  "content": "sqlite",
  "dimensions": 1024,
  "method": "external",
  "offset": 11,
  "transform": "<function transform at 0x102c17d90>",
  "update": "2022-06-03T22:50:56Z"
}

The embeddings init:

embeddings = Embeddings({"method": "external", "transform": transform, "content": True})

is this just a print error? Thanks!

@davidmezzetti
Copy link
Member

Hi @bramses

I see the settings here use external vectors but faiss is the default ANN storage engine. When external vectors are used, it just uses the transform function to vectorize text but vectors are still stored internally.

Did you intend to have data stored externally?

@luquitared luquitared reopened this Jun 4, 2022
@bramses
Copy link

bramses commented Jun 4, 2022

Hi @davidmezzetti thank you for the response! The reason I raised this issue was because the performance of the vector search wasn't working well for OpenAI Embeddings.

@luquitared and I ran an offline search to compare and our findings were that the OpenAI embeddings rely heavily on having both models to embed against text-search-ada-doc-001 for documents and text-search-ada-query-001 for querying.

def transform(inputs):
    response = openai.Embedding.create(
        input=inputs,
        engine="text-search-ada-doc-001"
    )

    return [x.embedding for x in response.data]

def qry(inputs):
    response = openai.Embedding.create(
        input=inputs,
        engine="text-search-ada-query-001"
    )

    return [x.embedding for x in response.data]

We then used cosine_similarity and got much better results:

import numpy as np
from numpy.linalg import norm
 
# define two lists or array
A = np.array(query)
B = np.array(embeddings) 

print("A:", A)
print("B:", B)

def cosine_wrapper(a, b):
    # compute cosine similarity
    cosine = np.dot(a,b)/(norm(a)*norm(b))
    print("Cosine Similarity:", cosine)

for b in B:
    cosine_wrapper(A, b)
Cosine Similarity: [0.17670046]
Cosine Similarity: [0.19275996]
Cosine Similarity: [0.1923126]
Cosine Similarity: [0.19251478]
**Cosine Similarity: [0.19828194]** <-- Correct answer; docs + query embedded separately
Cosine Similarity: [0.1548088]
Cosine Similarity: [0.14926458]
Cosine Similarity: [0.16341928]
Cosine Similarity: [0.37755975]
Cosine Similarity: [0.35022966]
Cosine Similarity: [0.32667258]
Cosine Similarity: [0.37577804]
Cosine Similarity: [0.29246046]
Cosine Similarity: [0.32608339]
**Cosine Similarity: [0.39383586]** <-- wrong answer; docs and query embedded with the same model `text-search-ada-doc-001`
Cosine Similarity: [0.36037904]

Is there a way to extend Embeddings({"method": "external", "transform": transform, "content": True}) so that we can embed the query in a different external model if we so choose?

Thanks!

@davidmezzetti
Copy link
Member

davidmezzetti commented Jun 6, 2022

I remember seeing this on the OpenAI website and was wondering how it would perform just with the indexing model.

A change could be made to txtai to have the transform function take a 2nd argument with the operation (i.e. indexing or search).

As a workaround for now, what if after indexing you set the transform function to something new?

embeddings.config["transform"] = <new transform function>
embeddings.search(...)

@bramses
Copy link

bramses commented Jun 6, 2022

Thanks for the prompt reply!

I remember seeing this on the OpenAI website and was wondering how it would perform just with the indexing model.

Yeah the results we found we're pretty interesting haha. This may just be an OpenAI thing but it'd be nice for other models that follow a similar methodology.

A change could be made to txtai to have the transform function take a 2nd argument with the operation (i.e. indexing or search).

I think this would be great, a flag that is on the object could be really cool. But would both transformations be able to be setup in the same instantiation?

As a workaround for now, what if after indexing you set the transform function to something new?

I'll give this a go and reply here

@bramses
Copy link

bramses commented Jun 8, 2022

@davidmezzetti the new transform config seemed to work!

def transform_q(inputs):
    response = openai.Embedding.create(
        input=inputs,
        engine="text-search-ada-query-001"
    )

    return np.array([x.embedding for x in response.data], dtype=np.float32)
    
...

embeddings.config["transform"] = transform_q
print(embeddings.search('reality show set in japan', 1))

# ...Terrace House: Tokyo 2019-2020... 'score': 0.2944885790348053 ...

An open question on my end: would you need to toggle back and forth between the two? What happens to the saved and compressed embeddings?

@davidmezzetti
Copy link
Member

Glad to hear it. I think a change that would be helpful to txtai would be how the transform function is passed in. If it's a function, use it for both indexing and search. If it's a tuple assume it's (index function, search function).

The workaround you have now doesn't affect the saved embeddings at all. It's simply setting a dictionary element in the configuration. You would need to toggle if you upsert/index more data and back again when searching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants