Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: No onnx files found #225

Open
netw0rkf10w opened this issue May 17, 2024 · 6 comments
Open

ValueError: No onnx files found #225

netw0rkf10w opened this issue May 17, 2024 · 6 comments

Comments

@netw0rkf10w
Copy link

Hello,

First of all thank you very much for this tool!

I am trying it out (on CPU) with the following code:

DEVICE = os.environ.get("DEVICE", "cpu")
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
engine = AsyncEmbeddingEngine.from_args(
        EngineArgs(
            model_name_or_path=MODEL_NAME,
            device=DEVICE,
            batch_size=1,
            lengths_via_tokenize=False,
            model_warmup=True,
            engine="torch" if DEVICE.startswith("cuda") else "optimum",
        )
    )

async def encode_infinity(sentences: list[str]):
    return np.array((await engine.embed(sentences))[0])

encode_infinity(["Hello"])

and obtained the following error:

 File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/engine.py", line 62, in from_args
    engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/engine.py", line 48, in __init__
    self._model, self._min_inference_t, self._max_inference_t = select_model(
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/inference/select_model.py", line 62, in select_model
    loaded_engine = unloaded_engine.value(engine_args=engine_args)
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/transformer/embedder/optimum.py", line 38, in __init__
    onnx_file = get_onnx_files(
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/transformer/utils_optimum.py", line 202, in get_onnx_files
    raise ValueError(
ValueError: No onnx files found for sentence-transformers/all-MiniLM-L6-v2 and revision None

In the documentation it is said that any model from Sentence Transformer could be used, but that doesn't seem to be the case. I guess I'll have to manually convert the weights to ONNX and place it somewhere for it to work?

Thank you in advance for your help!

@michaelfeil
Copy link
Owner

Yeah, you need an onnx model. https://huggingface.co/Xenova/all-MiniLM-L6-v2

@michaelfeil
Copy link
Owner

Does this work @netw0rkf10w ?

@netw0rkf10w
Copy link
Author

@michaelfeil Thanks for your reply. I managed to get it working but the latency is too high, something must be wrong:

import os
import asyncio
import time

from infinity_emb import AsyncEmbeddingEngine, EngineArgs
from sentence_transformers import SentenceTransformer

DEVICE = os.environ.get("DEVICE", "cpu")
MODEL_NAME = 'onnx_models/sentence-transformers/all-MiniLM-L6-v2'
engine = AsyncEmbeddingEngine.from_args(
        EngineArgs(
            model_name_or_path=MODEL_NAME,
            device=DEVICE,
            batch_size=1,
            lengths_via_tokenize=False,
            model_warmup=True,
            engine="torch" if DEVICE.startswith("cuda") else "optimum",
        )
    )

async def encode_infinity(sentences: list[str]):
    async with engine: # engine starts with engine.astart()
        embeddings, usage = await engine.embed(sentences)
    return embeddings

async def test_infty(sentences):
    start = time.monotonic()
    embeddings_inf = await encode_infinity(sentences)
    print('infinity time: ', time.monotonic() - start)

def test_sbert(sentences):
    model_minilm = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    model_minilm.eval()
    start = time.time()
    embeddings = model_minilm.encode(sentences)
    print('sbert time: ', time.time() - start)

if __name__ == "__main__":
    sentences = ["Un avion est en train de décoller.",
            "Un homme joue d'une grande flûte.",
            "Un homme étale du fromage râpé sur une pizza.",
            "Une personne jette un chat au plafond.",
            "Une personne est en train de plier un morceau de papier.",
            ]
    asyncio.run(test_infty(sentences))
    test_sbert(sentences)

Running on a CPU-only machine, I obtained:

infinity time:  1.0698386139999911
sbert time:  0.09578514099121094

What did I do wrong please? The full output is appended for your information. Thanks!

$ python infty.py 
INFO     2024-05-18 11:58:42,796 datasets INFO: PyTorch version 2.3.0+cpu available.                                                                         config.py:58
INFO     2024-05-18 11:58:44,357 infinity_emb INFO: model=`onnx_models/sentence-transformers/all-MiniLM-L6-v2` selected, using engine=`optimum` and    select_model.py:54
         device=`cpu`                                                                                                                                                    
INFO     2024-05-18 11:58:44,360 infinity_emb INFO: Found 2 onnx files:                                                                              utils_optimum.py:193
         [PosixPath('onnx_models/sentence-transformers/all-MiniLM-L6-v2/model_optimized.onnx'),                                                                          
         PosixPath('onnx_models/sentence-transformers/all-MiniLM-L6-v2/model.onnx')]                                                                                     
INFO     2024-05-18 11:58:44,362 infinity_emb INFO: Using onnx_models/sentence-transformers/all-MiniLM-L6-v2/model.onnx as the model                 utils_optimum.py:197
INFO     2024-05-18 11:58:44,364 infinity_emb INFO: Optimized model found at onnx_models/sentence-transformers/all-MiniLM-L6-v2/model_optimized.onnx, utils_optimum.py:99
         skipping optimization                                                                                                                                           
INFO     2024-05-18 11:58:44,691 infinity_emb INFO: Getting timings for batch_size=1 and avg tokens per sentence=3                                     select_model.py:77
                 0.16     ms tokenization                                                                                                                                
                 7.08     ms inference                                                                                                                                   
                 0.15     ms post-processing                                                                                                                             
                 7.39     ms total                                                                                                                                       
         embeddings/sec: 135.29                                                                                                                                          
INFO     2024-05-18 11:58:45,270 infinity_emb INFO: Getting timings for batch_size=1 and avg tokens per sentence=512                                   select_model.py:83
                 1.99     ms tokenization                                                                                                                                
                 282.79   ms inference                                                                                                                                   
                 0.17     ms post-processing                                                                                                                             
                 284.95   ms total                                                                                                                                       
         embeddings/sec: 3.51                                                                                                                                            
INFO     2024-05-18 11:58:45,273 infinity_emb INFO: model warmed up, between 3.51-135.29 embeddings/sec at batch_size=1                                select_model.py:84
INFO     2024-05-18 11:58:45,276 infinity_emb INFO: creating batching engine                                                                         batch_handler.py:291
INFO     2024-05-18 11:58:45,278 infinity_emb INFO: ready to batch requests.                                                                         batch_handler.py:354
infinity time:  1.0698386139999911
INFO     2024-05-18 11:58:46,347 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer:                          SentenceTransformer.py:113
         sentence-transformers/all-MiniLM-L6-v2                                                                                                                          
/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO     2024-05-18 11:58:47,485 sentence_transformers.SentenceTransformer INFO: Use pytorch device_name: cpu                                  SentenceTransformer.py:219
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.77it/s]
sbert time:  0.09578514099121094

@michaelfeil
Copy link
Owner

In this case, youre starting / stopping the engine. Instead of .. async with , you csn also call engine.astart() and engine.astop(). This should take the most time.

@netw0rkf10w
Copy link
Author

@michaelfeil Thanks, but could you please tell me how to do it correctly? I couldn't find it in the doc, sorry.

@michaelfeil
Copy link
Owner

michaelfeil commented May 29, 2024

Updated the docs and the readme! @netw0rkf10w . Note that it should not be significantly faster for 1 embedding with 1 short sentence. Expect significant speedups for large batches / long sequences.

import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
engine = AsyncEmbeddingEngine.from_args(EngineArgs(model_name_or_path = "michaelfeil/bge-small-en-v1.5", engine="optimum"))

async def main(): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    t_start = time.time()
    embeddings, usage = await engine.embed(sentences=sentences)
    print(time.time() - time.start())
    await engine.astop()
asyncio.run(main())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants