-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request - to add support for nomic-ai/nomic-embed-text-v1 embeddings. #1418
Comments
Probably bge m3 is better. Did you try that one? Also has long context and much smaller than other LLM based models. |
|
For nomic, it uses an unreleased sentence transformers 2.4.0dev. An attempt to pip install that dev version leads to issues failures in their API. So nothing is usable. Once sentence transformers 2.4.0 is released without bugs, we can upgrade h2oGPT and pass the required option trust_remote_code, that option does not exist in prior sentence transformers. |
@pseudotensor no, I didn't tried bge m3, will try. The usage is just to specify BAAI/bge-m3 or do I need to install anything ? |
@pseudotensor wanted to try bge, but it's failed (though regardless of the embeddings model). A few month ago that was working fine, when I did ingestion with another embeddings (same dataset)
No other info is available, the process just terminated in the middle. |
@pseudotensor Hi, can you please advise on the issue I mentioned above with make_db ? |
For docTR you can't use their repo, has to be installed from our fork as in the readme_linux.md or its linux_install.sh The model being missing seems like the original DocTR repo is used. |
@pseudotensor Thanks,
Thanks. |
Looks like system OOM. You can check |
@pseudotensor indeed OOM, but don't know why this started to happen, the memory size of the process reached 86GB. When I have Mac with 32GB and swap. But in the past I was able to create db from those files (tried now default embeddings). Please advise. |
Maybe one of the parsers went nuts, e.g. tesseract may have a bug. On gpt.h2o.ai I had one case when memory went up to the peak of 512GB. Are you able to see from the verbose logging which document might have been an issue? |
@pseudotensor No, I don't see , and now I have only text files (removed pdfs) and still got same issue. |
@pseudotensor seems that this happens issue is after the parsing
|
Nice tool. It looks as if the chroma team changed something. Maybe the batching size is larger and they send arbitrarily large batch to the embedding model. Can you add some prints to this code? Lines 318 to 346 in 190310d
Specifically, the |
@pseudotensor Max batch size 83333, coming from max_batch_size attr. |
Try the latest changes. 83333 is very large. I made the max 4096. Or you can control via env CHROMA_MAX_BATCH_SIZE |
@pseudotensor much better, thanks. Also, maybe it's a good idea to add device to make_db, as on mac it can use metal. |
4096 is on high end, yes can make it smaller as required. If on CPU I expect should work pretty ok, but issue is bge-m3 has 8k context so uses alot more memory despite size if chunks are large. I think issue is that for summarization purposes, we double the chunks, and there's no limit to their size, so that might be hitting the bge-m3 model hard since it'll take 8k. One will have to tell the model to truncate at (say) smaller token counts or (yes) limit batch size. |
Hi,
Can you please add support for nomic-ai/nomic-embed-text-v1.
Trying to run it as is, and getting next errors :
H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
And
ValueError: Loading nomic-ai/nomic-embed-text-v1 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option
trust_remote_code=True
to remove this error.Thanks
The text was updated successfully, but these errors were encountered: