Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any relation between embedding data and /similarity API ? #378

Closed
akset2X opened this issue Nov 8, 2022 · 10 comments
Closed

Is there any relation between embedding data and /similarity API ? #378

akset2X opened this issue Nov 8, 2022 · 10 comments

Comments

@akset2X
Copy link

akset2X commented Nov 8, 2022

I have a lot of test data and embedding data /add-ed and /upsert-ed in a system,
In that system when I do /search request

(with an SQL query like
Select id,text, score, data from txtai where similar("what is the specialtiy of this ?",'8000') AND something ='something' ... )

it takes around 0 to 3 seconds to pick 10 results. I'm ok with that(because it may take time since the number data I have stored is larger like thousands of text documents' paragraphs extracted and stored ).

I think only /search API or /batchsearch API is gonna look into the 1000s of paragraphs to find the response. But /similarity API is not like them... (please correct me if I'm wrong) Because when we use http://localhost:8000/similarity we ourselves giving the list of texts and question in the request body.. and txtai is gonna return the list of id,score pairs only.

What I'm trying to say is, when I did the following (POST) request by port forwarding the txtAI from a Linux system that already has a lot of embeddings stored. It took 600ms to 1200ms to get the response (I checked this timing by sending requests few more times using Postman .)

curl -X POST "http://localhost:8000/similarity" -H "Content-Type: application/json" -d '{"query": "feel good story", "texts": ["Maine man wins $1M from $25 lottery ticket", "Dont sacrifice slower friends in a bear attack"]}'

But when I do the same in a fresh system or system with lesser embedding data, it is only taking 50ms or lesser than that to find the response.

Is there any relation between embedding data and /similarity API ?

(For a text list with 2 or 3 sentence, it is fine to take 600 to 1200 ms, but when I try to do the same with 100s of texts and 1 question.. it is taking 13 or more seconds)

My usecase is to do the /similarity operarion and calculate id,score for 100s of texts and 1 question in 1 to 3 seconds or lesser.
I could achieve it in a system having lesser trained(embedding) data but not in the other case

So that's why I want to know, Is there any relation between embedding data and /similarity API ?

Actually I'm having a semantic search program in golang in which I'm using txtAI's APIs to perform searching, I'm making use of /search and /similarity APIs to fetch the results for the user query. Which is like after getting the 10 results from /search , I will be doing /similarity operation on each of 10 results's metadata keywords (that I have trained along with the "text" and "id" at the time of indexing them which will be in the "data" field ) following this approach I could improve the search result accuracy that I have expected, but the time it takes is more than 13 seconds when I have more keywords or tags in the "data" field.. I noticed that this time taken is only because of the /similarity API.
Im pointing out the same in the following to make it clear to understand,

  1. Do /search with a limit 10, Get 10 results and scores
  2. Recalculate each result's score - 1 by 1 by.. finding /similarity between "user question"(query) and metadata content(texts) of each result - get the list of id and scores.
  3. Merge these /similarity scores with /search scores and compute the scores again.
  4. Now sort the results based on score again
  5. Return 10 results to user.

Please suggest me an efficient way to improve the search speed and search accuracy.

@akset2X akset2X changed the title /similarity API is slower in Is there any relation between embedding data and /similarity API ? Nov 8, 2022
@davidmezzetti
Copy link
Member

A lot to unpack here but I believe the main question is: is there a relationship between the similarity call and the size of an embeddings index.

What models are you running for the embeddings model and similarity model? Does the system have a GPU or only CPUs? One thing I can think of is memory running low when both models are loaded.

A good way to debug is with a Python prompt locally. For example, you can create an similarity instance and run your tests. Then create an embeddings instance and see if the response times increase. Then I'd look at the OS to see if memory is running low.

@akset2X
Copy link
Author

akset2X commented Nov 8, 2022

Thank you for the reply,

Hope, I have mentioned the required things
embeddings:
path: sentence-transformers/all-MiniLM-L6-v2
content: true
No Similarity instance created

System information: - which I mentioned slower in finding /Similarity

cpuinfo :

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
stepping : 7
microcode : 0xffffffff
cpu MHz : 2593.907
cache size : 36608 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

meminfo:

MemTotal: 32879820 kB
MemFree: 302504 kB
MemAvailable: 4136192 kB
Buffers: 2566244 kB
Cached: 1313200 kB
SwapCached: 0 kB
Active: 1172320 kB
Inactive: 30385236 kB
Active(anon): 87964 kB
Inactive(anon): 27588144 kB
Active(file): 1084356 kB
Inactive(file): 2797092 kB
Unevictable: 18672 kB
Mlocked: 18672 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 316 kB
Writeback: 0 kB
AnonPages: 27690416 kB
Mapped: 372724 kB
Shmem: 5288 kB
KReclaimable: 426856 kB
Slab: 683836 kB
SReclaimable: 426856 kB
SUnreclaim: 256980 kB
KernelStack: 34752 kB
PageTables: 91072 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16439908 kB
Committed_AS: 45515284 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 76488 kB
VmallocChunk: 0 kB
Percpu: 20704 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1736704 kB
Hugepagesize: 2048
DirectMap4k: 3202312 kB
DirectMap2M: 30351360 kB
DirectMap1G: 2097152 kB

Second system(My PC's) information: - which is faster in both Search and Similarity

System Info:
OS Name                       Microsoft Windows 10
Version                       10.0.19042 Build 19042
Processor               11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz, 2611 Mhz, 4 Core(s), 8 Logical Processor(s)
Installed Physical Memory (RAM)     16.0 GB
Total Physical Memory         15.7 GB
System Type             x64-based PC

Display:
Name              Intel(R) Iris(R) Xe Graphics
Adapter Type            Intel(R) Iris(R) Xe Graphics Family, Intel Corporation compatible
Adapter Description     Intel(R) Iris(R) Xe Graphics
Adapter RAM       1.00 GB (1,073,741,824 bytes)

@davidmezzetti
Copy link
Member

Thank you for sharing the additional info.

From looking at the specs, it appears that both are CPU-only, at least in terms of a CUDA capable GPU.

To confirm GPU usage, you can run the following from a Python prompt on each system. I would expect the output to be False on both if they are indeed CPU only.

>>> import torch
>>> torch.cuda.is_available()

If they are both indeed CPU only and system 2 runs faster, the only thing that stands out is that the 2nd processor is 2 years newer.

@akset2X
Copy link
Author

akset2X commented Nov 15, 2022

Well, after running the above Python prompt, I could see that both are CPU only(since I got False as output). So, can I take it as no relationship between similarity call and size of embeddings index ? Since the issue may be because of RAM and so on.

can I get any solution to speed up things in my scenario. ? Like to combine the usage of search and similarity call at once and to get the output in a short time.

@davidmezzetti
Copy link
Member

In terms of speeding things up, I am not sure you need a separate similarity call given you're using the same model as used with the embeddings instance. Initially, I thought you were using a separate model to re-rank.

What is the thinking behind a separate similarity call? It's re-doing what the embeddings.index call has already done.

@akset2X
Copy link
Author

akset2X commented Nov 17, 2022

That's right, but the reason behind calling it separately is.. in my case I'm storing id, text, keywords, and some more tags to filter out the result. When I initially search with a single SQL query, I still get results. But, those results are not very accurate to my question input.

The limit I'm giving in the SQL query is 10, since I expect best 10 results only.

(But.. Sometimes the expected result not at all appearing in the 10 results because of its score.. )

So, I increased the limit to 20(but still I need best 10 only..)

So, I decided to use similarity call separately just as I mentioned in my first comment.(Now, with the help of similarity call I could re-arrange the 20 results.. and take only top 10..) Also, I saw the improvement in the accuracy as the rank is very proper this time.

What similarity call does here is, it is comparing every 20 result's keywords and tags(that I use for filtering purpose) with the question input.

Here question input is same for all 20 results but I want to match it with some dictionaries of the 20 results as follows,
localhost:8000/similarity
{
"query": "same question as in the SQL query",
"texts":[
"result 1's keyword, tags, description",
"result 2's keyword, tags, etc.,",
...
"result 20's keyword, tags, etc.,"
]
}

So, with this I will be getting 20 IDs and scores.. which I use to recalculate the scoring .
That's right, the search call itself, may do all these things like search itself might have compared the user question with text, keywords, tags and what are all present in the data of an embedding. But, through my observation I could say that search/batchsearch is giving me results mainly considering the "text" field only.

Or may be using different embeddings model might improve, I don't know.
For my use case position of every result and it's rank/score is important.

By the way thank you for this, I really appreciate your work and especially your kindness.

@davidmezzetti
Copy link
Member

davidmezzetti commented Nov 18, 2022

Have you tried setting the number of similar candidates to return? Set to 100 in the example below.

select id, text, score, data from txtai
where similar("what is the specialty of this ?", 100) AND something ='something' 

@akset2X
Copy link
Author

akset2X commented Nov 19, 2022

Yes, I tried setting it. Initially I was giving '1000'.. then after some days I was using '3000' then I increased it to '8000'. How much candidate should I provide if I have 200k of data?

FYI,
I tried running txtAI's /count API in few systems and their output is below,
System 1: count: 747, in this doing a similarity call (with 120 list of text and 1 query) takes around 1 second - this is my personal laptop which I mentioned faster in search and similarity call.
System 2: count: 2,02,397, similarity call (with 120 list of text and 1 query) takes 8 seconds - this is the one that took lot of time in search and similarity call.
(For testing purpose I included system 3)
System 3: count: 46,859, similarity call with (120 list of text and 1 query takes) 3 second

This is why I have that question "Is there any relation between embedding data and similarity call ?"
Here, System 2 and 3 both are having almost same configuration, and no system has dedicated GPU, but still there is a variation in timing to fetch results.

I have another doubt.. regarding SQL query.. can we have 2 select queries at the same time like in the batch search, but can we get 10 result only ?

You know, if we use batchsearch and if we give 2 select queries in it with limit 10, we will get 20 results (10 each)

What I'm asking is simply I want to do UNION of the 20 results because it may have duplicate results. I think UNION is not supported by txtAI, is it ? If I'm wrong, please share me how to do it. I feel like I could get some improvement with that.

@davidmezzetti
Copy link
Member

Regarding the metrics, I would expect search to be slower when there is more data.

Have you analyzed the data to see if there is anything different between the 3 systems? One thing that could make similarity slower is that the inputs are larger in size. In other words, of the 120, each row on average has more text.

Search and re-rank is a common use case but using the exact same similarity model to re-rank doesn't seem like it would do much. The only thing I can think of is that the query results aren't being ordered by score descending.

@akset2X
Copy link
Author

akset2X commented Nov 20, 2022

There is no difference in data between the 3 systems, since I was executing the same JSON payload as a request body in all 3 systems with localhost:8000/similarity as URL. Sure, I'll try to change the model and check what happens.

Thank you

@akset2X akset2X closed this as completed Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants