External vectors #280

luquitared · 2022-05-17T00:18:18Z

First of all, thank you for adding this feature!

I am attempting to recreate the example locally from the docs:

import numpy as np
import requests
from txtai.embeddings import Embeddings

data = ['test', 'test2']
def transform(inputs):
  response = requests.post("https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/nli-mpnet-base-v2",
                           json={"inputs": inputs})

  return np.array(response.json(), dtype=np.float32)

# Index data using vectors from Inference API
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

print(embeddings.search('test', 1))

when I run the search, I receive an empty array. Perhaps data must be a tuple?

luquitared · 2022-05-17T00:45:28Z

Attempted with minimal changes:

import numpy as np
import requests
from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

def transform(inputs):
  response = requests.post("https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/nli-mpnet-base-v2",
                           json={"inputs": inputs})

  return np.array(response.json(), dtype=np.float32)

# Index data using vectors from Inference API
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
    # Extract text field from result
    text = embeddings.search(f"select id, text, score from txtai where similar('{query}')", 1)[0]["text"]

    # Print text
    print("%-20s %s" % (query, text))

And see this exception:

Query                Best Match
--------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 28, in <module>
    text = embeddings.search(f"select id, text, score from txtai where similar('{query}')", 1)[0]["text"]
IndexError: list index out of range

davidmezzetti · 2022-05-17T01:26:07Z

Hi Lucas. txtai 4.5 is scheduled to be released shortly but hasn't been released yet. Did you install from PyPI? If you install from GitHub does it fix the issue?

pip install git+https://github.com/neuml/txtai

luquitared · 2022-05-17T01:40:31Z

Hi David, that worked! Should have checked the version number. Thanks again!!

luquitared · 2022-05-17T02:17:22Z

Ok I have a new issue but it may not be relevant. No problem if this is out of scope.

Here I have a file to grab embeddings from openai:

import numpy as np
import requests
from txtai.embeddings import Embeddings
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")


def transform(inputs):

    response = openai.Embedding.create(
        input=inputs,
        engine="text-similarity-ada-001"
    )

    return np.array(response.data[0].embedding, dtype=np.float32)

data = ['test', 'test2']

obj = transform(data)
print(obj.shape)
embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print(embeddings.search('test', 1))

The printed dimensions show:
(1024,)

This gets the error:

File "/home/vscode/.local/lib/python3.8/site-packages/txtai/vectors/base.py", line 127, in batch
    dimensions = embeddings.shape[1]
IndexError: tuple index out of range

I edited base.py to grab the first element of the tuple as a test, and it gets a different error:

  File "/home/vscode/.local/lib/python3.8/site-packages/faiss/__init__.py", line 307, in replacement_search
    n, d = x.shape
ValueError: not enough values to unpack (expected 2, got 1)

I'll keep working on this so I may end up solving it. I think I need to understand how I need to reshape the array before indexing on FAISS.

davidmezzetti · 2022-05-17T12:03:54Z

Hi Lucas. I think openai.Embedding.create returns an array of embeddings. The code looks to only take the first embeddings array, hence the (1024,) shape.

Maybe something like this:

return np.array([x.embedding for x in response.data], dtype=np.float32)

luquitared · 2022-05-17T19:23:14Z

Worked like a charm!!!!

bramses · 2022-06-03T22:58:21Z

Hi @davidmezzetti when I hit .info() I received this obj that the model backend seems to be using faiss:

{
  "backend": "faiss",
  "build": {
    "create": "2022-06-03T22:50:56Z",
    "python": "3.10.4",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Darwin (arm64)",
    "txtai": "4.5.0"
  },
  "content": "sqlite",
  "dimensions": 1024,
  "method": "external",
  "offset": 11,
  "transform": "<function transform at 0x102c17d90>",
  "update": "2022-06-03T22:50:56Z"
}

The embeddings init:

embeddings = Embeddings({"method": "external", "transform": transform, "content": True})

is this just a print error? Thanks!

davidmezzetti · 2022-06-04T13:05:13Z

Hi @bramses

I see the settings here use external vectors but faiss is the default ANN storage engine. When external vectors are used, it just uses the transform function to vectorize text but vectors are still stored internally.

Did you intend to have data stored externally?

bramses · 2022-06-04T18:57:52Z

Hi @davidmezzetti thank you for the response! The reason I raised this issue was because the performance of the vector search wasn't working well for OpenAI Embeddings.

@luquitared and I ran an offline search to compare and our findings were that the OpenAI embeddings rely heavily on having both models to embed against text-search-ada-doc-001 for documents and text-search-ada-query-001 for querying.

def transform(inputs):
    response = openai.Embedding.create(
        input=inputs,
        engine="text-search-ada-doc-001"
    )

    return [x.embedding for x in response.data]

def qry(inputs):
    response = openai.Embedding.create(
        input=inputs,
        engine="text-search-ada-query-001"
    )

    return [x.embedding for x in response.data]

We then used cosine_similarity and got much better results:

import numpy as np
from numpy.linalg import norm
 
# define two lists or array
A = np.array(query)
B = np.array(embeddings) 

print("A:", A)
print("B:", B)

def cosine_wrapper(a, b):
    # compute cosine similarity
    cosine = np.dot(a,b)/(norm(a)*norm(b))
    print("Cosine Similarity:", cosine)

for b in B:
    cosine_wrapper(A, b)

Cosine Similarity: [0.17670046]
Cosine Similarity: [0.19275996]
Cosine Similarity: [0.1923126]
Cosine Similarity: [0.19251478]
**Cosine Similarity: [0.19828194]** <-- Correct answer; docs + query embedded separately
Cosine Similarity: [0.1548088]
Cosine Similarity: [0.14926458]
Cosine Similarity: [0.16341928]

Cosine Similarity: [0.37755975]
Cosine Similarity: [0.35022966]
Cosine Similarity: [0.32667258]
Cosine Similarity: [0.37577804]
Cosine Similarity: [0.29246046]
Cosine Similarity: [0.32608339]
**Cosine Similarity: [0.39383586]** <-- wrong answer; docs and query embedded with the same model `text-search-ada-doc-001`
Cosine Similarity: [0.36037904]

Is there a way to extend Embeddings({"method": "external", "transform": transform, "content": True}) so that we can embed the query in a different external model if we so choose?

Thanks!

davidmezzetti · 2022-06-06T13:43:30Z

I remember seeing this on the OpenAI website and was wondering how it would perform just with the indexing model.

A change could be made to txtai to have the transform function take a 2nd argument with the operation (i.e. indexing or search).

As a workaround for now, what if after indexing you set the transform function to something new?

embeddings.config["transform"] = <new transform function>
embeddings.search(...)

bramses · 2022-06-06T17:58:41Z

Thanks for the prompt reply!

I remember seeing this on the OpenAI website and was wondering how it would perform just with the indexing model.

Yeah the results we found we're pretty interesting haha. This may just be an OpenAI thing but it'd be nice for other models that follow a similar methodology.

A change could be made to txtai to have the transform function take a 2nd argument with the operation (i.e. indexing or search).

I think this would be great, a flag that is on the object could be really cool. But would both transformations be able to be setup in the same instantiation?

As a workaround for now, what if after indexing you set the transform function to something new?

I'll give this a go and reply here

bramses · 2022-06-08T06:25:12Z

@davidmezzetti the new transform config seemed to work!

def transform_q(inputs):
    response = openai.Embedding.create(
        input=inputs,
        engine="text-search-ada-query-001"
    )

    return np.array([x.embedding for x in response.data], dtype=np.float32)
    
...

embeddings.config["transform"] = transform_q
print(embeddings.search('reality show set in japan', 1))

# ...Terrace House: Tokyo 2019-2020... 'score': 0.2944885790348053 ...

An open question on my end: would you need to toggle back and forth between the two? What happens to the saved and compressed embeddings?

davidmezzetti · 2022-06-09T12:38:11Z

Glad to hear it. I think a change that would be helpful to txtai would be how the transform function is passed in. If it's a function, use it for both indexing and search. If it's a tuple assume it's (index function, search function).

The workaround you have now doesn't affect the saved embeddings at all. It's simply setting a dictionary element in the configuration. You would need to toggle if you upsert/index more data and back again when searching.

luquitared closed this as completed May 17, 2022

luquitared reopened this May 17, 2022

luquitared closed this as completed May 17, 2022

luquitared reopened this Jun 4, 2022

luquitared closed this as completed Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External vectors #280

External vectors #280

luquitared commented May 17, 2022 •

edited

luquitared commented May 17, 2022 •

edited

davidmezzetti commented May 17, 2022

luquitared commented May 17, 2022

luquitared commented May 17, 2022 •

edited

davidmezzetti commented May 17, 2022

luquitared commented May 17, 2022

bramses commented Jun 3, 2022

davidmezzetti commented Jun 4, 2022

bramses commented Jun 4, 2022

davidmezzetti commented Jun 6, 2022 •

edited

bramses commented Jun 6, 2022

bramses commented Jun 8, 2022 •

edited

davidmezzetti commented Jun 9, 2022

External vectors #280

External vectors #280

Comments

luquitared commented May 17, 2022 • edited

luquitared commented May 17, 2022 • edited

davidmezzetti commented May 17, 2022

luquitared commented May 17, 2022

luquitared commented May 17, 2022 • edited

davidmezzetti commented May 17, 2022

luquitared commented May 17, 2022

bramses commented Jun 3, 2022

davidmezzetti commented Jun 4, 2022

bramses commented Jun 4, 2022

davidmezzetti commented Jun 6, 2022 • edited

bramses commented Jun 6, 2022

bramses commented Jun 8, 2022 • edited

davidmezzetti commented Jun 9, 2022

luquitared commented May 17, 2022 •

edited

luquitared commented May 17, 2022 •

edited

luquitared commented May 17, 2022 •

edited

davidmezzetti commented Jun 6, 2022 •

edited

bramses commented Jun 8, 2022 •

edited