New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External vectors #280
Comments
Attempted with minimal changes: import numpy as np
import requests
from txtai.embeddings import Embeddings
data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
def transform(inputs):
response = requests.post("https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/nli-mpnet-base-v2",
json={"inputs": inputs})
return np.array(response.json(), dtype=np.float32)
# Index data using vectors from Inference API
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
# Extract text field from result
text = embeddings.search(f"select id, text, score from txtai where similar('{query}')", 1)[0]["text"]
# Print text
print("%-20s %s" % (query, text)) And see this exception: Query Best Match
--------------------------------------------------
Traceback (most recent call last):
File "test.py", line 28, in <module>
text = embeddings.search(f"select id, text, score from txtai where similar('{query}')", 1)[0]["text"]
IndexError: list index out of range |
Hi Lucas. txtai 4.5 is scheduled to be released shortly but hasn't been released yet. Did you install from PyPI? If you install from GitHub does it fix the issue?
|
Hi David, that worked! Should have checked the version number. Thanks again!! |
Ok I have a new issue but it may not be relevant. No problem if this is out of scope. Here I have a file to grab embeddings from openai: import numpy as np
import requests
from txtai.embeddings import Embeddings
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
def transform(inputs):
response = openai.Embedding.create(
input=inputs,
engine="text-similarity-ada-001"
)
return np.array(response.data[0].embedding, dtype=np.float32)
data = ['test', 'test2']
obj = transform(data)
print(obj.shape)
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
print(embeddings.search('test', 1)) The printed dimensions show: This gets the error: File "/home/vscode/.local/lib/python3.8/site-packages/txtai/vectors/base.py", line 127, in batch
dimensions = embeddings.shape[1]
IndexError: tuple index out of range I edited base.py to grab the first element of the tuple as a test, and it gets a different error: File "/home/vscode/.local/lib/python3.8/site-packages/faiss/__init__.py", line 307, in replacement_search
n, d = x.shape
ValueError: not enough values to unpack (expected 2, got 1) I'll keep working on this so I may end up solving it. I think I need to understand how I need to reshape the array before indexing on FAISS. |
Hi Lucas. I think openai.Embedding.create returns an array of embeddings. The code looks to only take the first embeddings array, hence the (1024,) shape. Maybe something like this:
|
Worked like a charm!!!! |
Hi @davidmezzetti when I hit
The embeddings init:
is this just a print error? Thanks! |
Hi @bramses I see the settings here use external vectors but faiss is the default ANN storage engine. When external vectors are used, it just uses the transform function to vectorize text but vectors are still stored internally. Did you intend to have data stored externally? |
Hi @davidmezzetti thank you for the response! The reason I raised this issue was because the performance of the vector search wasn't working well for OpenAI Embeddings. @luquitared and I ran an offline search to compare and our findings were that the OpenAI embeddings rely heavily on having both models to embed against def transform(inputs):
response = openai.Embedding.create(
input=inputs,
engine="text-search-ada-doc-001"
)
return [x.embedding for x in response.data]
def qry(inputs):
response = openai.Embedding.create(
input=inputs,
engine="text-search-ada-query-001"
)
return [x.embedding for x in response.data] We then used cosine_similarity and got much better results: import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array(query)
B = np.array(embeddings)
print("A:", A)
print("B:", B)
def cosine_wrapper(a, b):
# compute cosine similarity
cosine = np.dot(a,b)/(norm(a)*norm(b))
print("Cosine Similarity:", cosine)
for b in B:
cosine_wrapper(A, b)
Is there a way to extend Thanks! |
I remember seeing this on the OpenAI website and was wondering how it would perform just with the indexing model. A change could be made to txtai to have the transform function take a 2nd argument with the operation (i.e. indexing or search). As a workaround for now, what if after indexing you set the transform function to something new? embeddings.config["transform"] = <new transform function>
embeddings.search(...) |
Thanks for the prompt reply!
Yeah the results we found we're pretty interesting haha. This may just be an OpenAI thing but it'd be nice for other models that follow a similar methodology.
I think this would be great, a flag that is on the object could be really cool. But would both transformations be able to be setup in the same instantiation?
I'll give this a go and reply here |
@davidmezzetti the new transform config seemed to work! def transform_q(inputs):
response = openai.Embedding.create(
input=inputs,
engine="text-search-ada-query-001"
)
return np.array([x.embedding for x in response.data], dtype=np.float32)
...
embeddings.config["transform"] = transform_q
print(embeddings.search('reality show set in japan', 1))
# ...Terrace House: Tokyo 2019-2020... 'score': 0.2944885790348053 ... An open question on my end: would you need to toggle back and forth between the two? What happens to the saved and compressed embeddings? |
Glad to hear it. I think a change that would be helpful to txtai would be how the transform function is passed in. If it's a function, use it for both indexing and search. If it's a tuple assume it's (index function, search function). The workaround you have now doesn't affect the saved embeddings at all. It's simply setting a dictionary element in the configuration. You would need to toggle if you upsert/index more data and back again when searching. |
First of all, thank you for adding this feature!
I am attempting to recreate the example locally from the docs:
when I run the search, I receive an empty array. Perhaps data must be a tuple?
The text was updated successfully, but these errors were encountered: