New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any relation between embedding data and /similarity API ? #378
Comments
A lot to unpack here but I believe the main question is: is there a relationship between the similarity call and the size of an embeddings index. What models are you running for the embeddings model and similarity model? Does the system have a GPU or only CPUs? One thing I can think of is memory running low when both models are loaded. A good way to debug is with a Python prompt locally. For example, you can create an similarity instance and run your tests. Then create an embeddings instance and see if the response times increase. Then I'd look at the OS to see if memory is running low. |
Thank you for the reply, Hope, I have mentioned the required things System information: - which I mentioned slower in finding /Similaritycpuinfo :processor : 7 meminfo:MemTotal: 32879820 kB Second system(My PC's) information: - which is faster in both Search and SimilaritySystem Info: Display: |
Thank you for sharing the additional info. From looking at the specs, it appears that both are CPU-only, at least in terms of a CUDA capable GPU. To confirm GPU usage, you can run the following from a Python prompt on each system. I would expect the output to be False on both if they are indeed CPU only. >>> import torch
>>> torch.cuda.is_available() If they are both indeed CPU only and system 2 runs faster, the only thing that stands out is that the 2nd processor is 2 years newer. |
Well, after running the above Python prompt, I could see that both are CPU only(since I got False as output). So, can I take it as no relationship between similarity call and size of embeddings index ? Since the issue may be because of RAM and so on. can I get any solution to speed up things in my scenario. ? Like to combine the usage of search and similarity call at once and to get the output in a short time. |
In terms of speeding things up, I am not sure you need a separate similarity call given you're using the same model as used with the embeddings instance. Initially, I thought you were using a separate model to re-rank. What is the thinking behind a separate similarity call? It's re-doing what the |
That's right, but the reason behind calling it separately is.. in my case I'm storing id, text, keywords, and some more tags to filter out the result. When I initially search with a single SQL query, I still get results. But, those results are not very accurate to my question input. The limit I'm giving in the SQL query is 10, since I expect best 10 results only. (But.. Sometimes the expected result not at all appearing in the 10 results because of its score.. ) So, I increased the limit to 20(but still I need best 10 only..) So, I decided to use similarity call separately just as I mentioned in my first comment.(Now, with the help of similarity call I could re-arrange the 20 results.. and take only top 10..) Also, I saw the improvement in the accuracy as the rank is very proper this time. What similarity call does here is, it is comparing every 20 result's keywords and tags(that I use for filtering purpose) with the question input. Here question input is same for all 20 results but I want to match it with some dictionaries of the 20 results as follows, So, with this I will be getting 20 IDs and scores.. which I use to recalculate the scoring . Or may be using different embeddings model might improve, I don't know. By the way thank you for this, I really appreciate your work and especially your kindness. |
Have you tried setting the number of similar candidates to return? Set to 100 in the example below. select id, text, score, data from txtai
where similar("what is the specialty of this ?", 100) AND something ='something' |
Yes, I tried setting it. Initially I was giving '1000'.. then after some days I was using '3000' then I increased it to '8000'. How much candidate should I provide if I have 200k of data? FYI, This is why I have that question "Is there any relation between embedding data and similarity call ?" I have another doubt.. regarding SQL query.. can we have 2 select queries at the same time like in the batch search, but can we get 10 result only ? You know, if we use batchsearch and if we give 2 select queries in it with limit 10, we will get 20 results (10 each) What I'm asking is simply I want to do UNION of the 20 results because it may have duplicate results. I think UNION is not supported by txtAI, is it ? If I'm wrong, please share me how to do it. I feel like I could get some improvement with that. |
Regarding the metrics, I would expect search to be slower when there is more data. Have you analyzed the data to see if there is anything different between the 3 systems? One thing that could make similarity slower is that the inputs are larger in size. In other words, of the 120, each row on average has more text. Search and re-rank is a common use case but using the exact same similarity model to re-rank doesn't seem like it would do much. The only thing I can think of is that the query results aren't being ordered by score descending. |
There is no difference in data between the 3 systems, since I was executing the same JSON payload as a request body in all 3 systems with localhost:8000/similarity as URL. Sure, I'll try to change the model and check what happens. Thank you |
I have a lot of test data and embedding data /add-ed and /upsert-ed in a system,
In that system when I do /search request
(with an SQL query like
Select id,text, score, data from txtai where similar("what is the specialtiy of this ?",'8000') AND something ='something' ... )
it takes around 0 to 3 seconds to pick 10 results. I'm ok with that(because it may take time since the number data I have stored is larger like thousands of text documents' paragraphs extracted and stored ).
I think only /search API or /batchsearch API is gonna look into the 1000s of paragraphs to find the response. But /similarity API is not like them... (please correct me if I'm wrong) Because when we use http://localhost:8000/similarity we ourselves giving the list of texts and question in the request body.. and txtai is gonna return the list of id,score pairs only.
What I'm trying to say is, when I did the following (POST) request by port forwarding the txtAI from a Linux system that already has a lot of embeddings stored. It took 600ms to 1200ms to get the response (I checked this timing by sending requests few more times using Postman .)
curl -X POST "http://localhost:8000/similarity" -H "Content-Type: application/json" -d '{"query": "feel good story", "texts": ["Maine man wins $1M from $25 lottery ticket", "Dont sacrifice slower friends in a bear attack"]}'
But when I do the same in a fresh system or system with lesser embedding data, it is only taking 50ms or lesser than that to find the response.
Is there any relation between embedding data and /similarity API ?
(For a text list with 2 or 3 sentence, it is fine to take 600 to 1200 ms, but when I try to do the same with 100s of texts and 1 question.. it is taking 13 or more seconds)
My usecase is to do the /similarity operarion and calculate id,score for 100s of texts and 1 question in 1 to 3 seconds or lesser.
I could achieve it in a system having lesser trained(embedding) data but not in the other case
So that's why I want to know, Is there any relation between embedding data and /similarity API ?
Actually I'm having a semantic search program in golang in which I'm using txtAI's APIs to perform searching, I'm making use of /search and /similarity APIs to fetch the results for the user query. Which is like after getting the 10 results from /search , I will be doing /similarity operation on each of 10 results's metadata keywords (that I have trained along with the "text" and "id" at the time of indexing them which will be in the "data" field ) following this approach I could improve the search result accuracy that I have expected, but the time it takes is more than 13 seconds when I have more keywords or tags in the "data" field.. I noticed that this time taken is only because of the /similarity API.
Im pointing out the same in the following to make it clear to understand,
Please suggest me an efficient way to improve the search speed and search accuracy.
The text was updated successfully, but these errors were encountered: