-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: No onnx files found #225
Comments
Yeah, you need an onnx model. https://huggingface.co/Xenova/all-MiniLM-L6-v2 |
Does this work @netw0rkf10w ? |
@michaelfeil Thanks for your reply. I managed to get it working but the latency is too high, something must be wrong: import os
import asyncio
import time
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
from sentence_transformers import SentenceTransformer
DEVICE = os.environ.get("DEVICE", "cpu")
MODEL_NAME = 'onnx_models/sentence-transformers/all-MiniLM-L6-v2'
engine = AsyncEmbeddingEngine.from_args(
EngineArgs(
model_name_or_path=MODEL_NAME,
device=DEVICE,
batch_size=1,
lengths_via_tokenize=False,
model_warmup=True,
engine="torch" if DEVICE.startswith("cuda") else "optimum",
)
)
async def encode_infinity(sentences: list[str]):
async with engine: # engine starts with engine.astart()
embeddings, usage = await engine.embed(sentences)
return embeddings
async def test_infty(sentences):
start = time.monotonic()
embeddings_inf = await encode_infinity(sentences)
print('infinity time: ', time.monotonic() - start)
def test_sbert(sentences):
model_minilm = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model_minilm.eval()
start = time.time()
embeddings = model_minilm.encode(sentences)
print('sbert time: ', time.time() - start)
if __name__ == "__main__":
sentences = ["Un avion est en train de décoller.",
"Un homme joue d'une grande flûte.",
"Un homme étale du fromage râpé sur une pizza.",
"Une personne jette un chat au plafond.",
"Une personne est en train de plier un morceau de papier.",
]
asyncio.run(test_infty(sentences))
test_sbert(sentences) Running on a CPU-only machine, I obtained: infinity time: 1.0698386139999911
sbert time: 0.09578514099121094 What did I do wrong please? The full output is appended for your information. Thanks! $ python infty.py
INFO 2024-05-18 11:58:42,796 datasets INFO: PyTorch version 2.3.0+cpu available. config.py:58
INFO 2024-05-18 11:58:44,357 infinity_emb INFO: model=`onnx_models/sentence-transformers/all-MiniLM-L6-v2` selected, using engine=`optimum` and select_model.py:54
device=`cpu`
INFO 2024-05-18 11:58:44,360 infinity_emb INFO: Found 2 onnx files: utils_optimum.py:193
[PosixPath('onnx_models/sentence-transformers/all-MiniLM-L6-v2/model_optimized.onnx'),
PosixPath('onnx_models/sentence-transformers/all-MiniLM-L6-v2/model.onnx')]
INFO 2024-05-18 11:58:44,362 infinity_emb INFO: Using onnx_models/sentence-transformers/all-MiniLM-L6-v2/model.onnx as the model utils_optimum.py:197
INFO 2024-05-18 11:58:44,364 infinity_emb INFO: Optimized model found at onnx_models/sentence-transformers/all-MiniLM-L6-v2/model_optimized.onnx, utils_optimum.py:99
skipping optimization
INFO 2024-05-18 11:58:44,691 infinity_emb INFO: Getting timings for batch_size=1 and avg tokens per sentence=3 select_model.py:77
0.16 ms tokenization
7.08 ms inference
0.15 ms post-processing
7.39 ms total
embeddings/sec: 135.29
INFO 2024-05-18 11:58:45,270 infinity_emb INFO: Getting timings for batch_size=1 and avg tokens per sentence=512 select_model.py:83
1.99 ms tokenization
282.79 ms inference
0.17 ms post-processing
284.95 ms total
embeddings/sec: 3.51
INFO 2024-05-18 11:58:45,273 infinity_emb INFO: model warmed up, between 3.51-135.29 embeddings/sec at batch_size=1 select_model.py:84
INFO 2024-05-18 11:58:45,276 infinity_emb INFO: creating batching engine batch_handler.py:291
INFO 2024-05-18 11:58:45,278 infinity_emb INFO: ready to batch requests. batch_handler.py:354
infinity time: 1.0698386139999911
INFO 2024-05-18 11:58:46,347 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: SentenceTransformer.py:113
sentence-transformers/all-MiniLM-L6-v2
/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 2024-05-18 11:58:47,485 sentence_transformers.SentenceTransformer INFO: Use pytorch device_name: cpu SentenceTransformer.py:219
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.77it/s]
sbert time: 0.09578514099121094
|
In this case, youre starting / stopping the engine. Instead of .. async with , you csn also call engine.astart() and engine.astop(). This should take the most time. |
@michaelfeil Thanks, but could you please tell me how to do it correctly? I couldn't find it in the doc, sorry. |
Updated the docs and the readme! @netw0rkf10w . Note that it should not be significantly faster for 1 embedding with 1 short sentence. Expect significant speedups for large batches / long sequences. import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
engine = AsyncEmbeddingEngine.from_args(EngineArgs(model_name_or_path = "michaelfeil/bge-small-en-v1.5", engine="optimum"))
async def main():
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
t_start = time.time()
embeddings, usage = await engine.embed(sentences=sentences)
print(time.time() - time.start())
await engine.astop()
asyncio.run(main()) |
Hello,
First of all thank you very much for this tool!
I am trying it out (on CPU) with the following code:
and obtained the following error:
In the documentation it is said that any model from Sentence Transformer could be used, but that doesn't seem to be the case. I guess I'll have to manually convert the weights to ONNX and place it somewhere for it to work?
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: