Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ollama as an embeddings provider #4456

Merged
merged 3 commits into from Mar 13, 2024

Conversation

jakobklemm
Copy link
Contributor

Pull Request

Related issue

Related Discord Thread

What does this PR do?

  • Adds Ollama as a provider of Embeddings besides HuggingFace and OpenAI under the name ollama
  • Adds the environment variable MEILI_OLLAMA_URL to set the embeddings URL of an Ollama instance with a default value of http://localhost:11434/api/embeddings if no variable is set
  • Changes some of the structs and functions in openai.rs to be public so that they can be shared.
  • Added more error variants for Ollama specific errors
  • It uses the model nomic-embed-text as default, but any string value is allowed, however it won't automatically check if the model actually exists or is an embedding model

Tested against Ollama version v0.1.27 and the nomic-embed-text model.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Initial prototype of Ollama embeddings actually working, error handlign / retries still missing.

Allow model to be any String and require dimensions parameter

Fixed rustfmt formatting issues

There were some formatting issues in the initial PR and this should not make the changes comply with the Rust style guidelines

Because I accidentally didn't follow the style guide for commits in my commit messages I squashed them into one to comply
@dureuill
Copy link
Contributor

dureuill commented Mar 6, 2024

Hello,

thank you for the PR! I'll spend some time reviewing it in the future.

Like you said, there is some duplication with OpenAI code, so I'm thinking we might want to factor the implementations at some point.

I was wondering how the code would behave if the input text contains too many tokens?

The current behavior for the OpenAI embedder is the following:

  1. Try to embed the text without counting the tokens. If it works, we successfully avoided tokenization 🎉
  2. If it returns a bad request, assume it is due to the number of tokens. Use tiktoken_rs to tokenize the text.
  3. Send the tokens directly in an OpenAI query, as this is supported.

Meanwhile, the Hugging Face embedder in Meilisearch is implemented such that any tokens after the maximum value are removed (since we're doing the work locally for this embedder, we're always embedding the documents).

We really don't want to error and cancel an entire indexing operation because a document went above the token limits. Would there be a way to implement windowing like OpenAI, or at least cutting any tokens beyond the maximum like the Hugging Face embedder?

@dureuill
Copy link
Contributor

dureuill commented Mar 6, 2024

I started a notion page listing some of the planned improvements for v1.8. I added the ollama embedder as a proposal: find it here (it is pending review from product)

@jakobklemm
Copy link
Contributor Author

Regarding the code structure:
The code could certainly use some restructuring and making some of the functions generic over the backend. I wanted to change as little as possible about the surrounding code for the Ollama support just in case somebody was also working on this at the same time, but if there is a more generic REST provider coming, then I assume a lot of that will happen with that restructuring anyways. Do you want me to make any changes to try and combine the Ollama and OpenAI embedders, or should I wait until the structure of the REST backend exists?

Too many token issues:
I must admit I completely forgot about that edge case when implementing this initially, but I just tested it against my local Ollama instance with an input of ~500k words of lorem ipsum and I got back an embedding. I don't know what Ollama does (if it trims at the beginning or end or something else), but the request doesn't seem to fail. Sadly, I can't find anything about this in the Ollama documentation, but in my mind, as long as it provides consistent outputs, it should still work for vector search. Ollama doesn't seem to have an endpoint to send already tokenized requests too, and I'm not entirely sure how that would work anyways. Correct me if I'm wrong, but wouldn't the text have to be tokenized using the correct tokenizer and dictionary to be useful to the model? And those can't be known because Ollama can run arbitrary models.

@dureuill
Copy link
Contributor

dureuill commented Mar 7, 2024

Hello,

Thank you for that great answer, and also again for contributing to Meilisearch, this is really appreciated ❤️.

Regarding the max tokens issue

No problem that you didn't think of addressing it, that's what reviews are for :-). I'd like more details about your tests: is the behavior consistent when testing with multiple distinct models?
To be clear if the behavior is to cut the input then that's an acceptable outcome and that will work for us.

Maybe we should open an issue on ollama or at least look a bit at the code to ascertain that the behavior is consistent?

Correct me if I'm wrong, but wouldn't the text have to be tokenized using the correct tokenizer and dictionary to be useful to the model?

Yes exactly, this is what we're doing for models downloaded from the HuggingFace Hub. Here, it means the user would need a way to provide the correct tokenizer to us, which would be very inconvenient to the user.

An alternative would be to have an ollama endpoint that would just tokenize, but I'm don't think this exists.

Regarding code structure

You don't need to change the code structure of your PR. If we can ascertain the behavior in the max token case, then I think we'll accept your PR after I do a more detailed review, and land it without structural change. This will be useful to people running Meilisearch on main, or perhaps even as a patch release (depending on products), and I'll take care of the refactoring by Meilisearch v1.8

@jakobklemm
Copy link
Contributor Author

After some experimentation with my local Ollama instance and asking about this on the Ollama Discord server, it looks like Ollama will never completely fail to embed an input, no matter how long it is, but I'm not entirely sure how it does that. It doesn't look like it's just truncating the inputs to fit the length, because adding a single word at the end of a very long prompt would result in a slightly different output.

But as far as I can tell, the output is still useful and correct. I first tried getting the embeddings of a really long document and then did it again with the document appended to itself. The two embeddings were not identical, but the cosine distance between them was about 0.997, which, as far as I understand, means the two vectors are almost identical except for some scaling factor.

One additional edge case I only just discovered is what happens if the text to be embedded is completely blank. In this case, Ollama will immediately return json just containing "null" instead of the embeddings array. Is this a situation that should be handled, or can it be ignored because it would mean the document or the template are not configured correctly anyways?

@dureuill
Copy link
Contributor

dureuill commented Mar 8, 2024

Thanks for your findings, it is possible that ollama splits the text, produce multiple embeddings, and then average them. Doing that would yield the kind of results you're observing.

About empty input, we could have you check that, but since there are multiple embedders that might have variously erratic behaviours when given an empty input, I think it is probably best if I check for the condition prior to calling embed.

In other words, you can assume that the input is never the empty string.

@dureuill dureuill added this to the v1.8.0 milestone Mar 11, 2024
@dureuill
Copy link
Contributor

Hello, I have been testing your PR, I got an ollama server running, I have a few questions:

  1. Could we handle the 404 that ollama is returning when asking for a model that hasn't been pulled?
    Currently, Meili's behavior is to fail the embedding request with:
"error": {
  "message": "internal: Error while generating embeddings: coding error: received unhandled HTTP status code 404 from Ollama.",
  "code": "internal",
  "type": "internal",
  "link": "https://docs.meilisearch.com/errors#internal"
},

Could we handle the 404 return code so that this isn't a coding error but a user error and it displays a message along the line of "model xxx wasn't pulled on the ollama server"?
2. The dimensions check happened, but surprisingly late in the process. I managed to give a wrong dimension for "nomic-embed-text", and it accepted to index documents. Then, at search time, it would reject it, citing a mismatch between the dimension I provided and the actual dimensions. Could we perform a test embedding like we do in huggingFace.rs to check that the provided dimensions are correct? This way, we could even infer the dimensions if not provided instead of having them mandatory.

Instead of the user manually specifying the model dimensions it will now automatically get determined
Just like with hf.rs the word "test" gets embedded to determine the dimensions of the output
Add a dedicated error type for if the model doesn't exist (don't automatically pull it though) and set the fault of that error to be the user
@jakobklemm
Copy link
Contributor Author

The dimension inference is now working the same way it does in hf.rs by simply getting the embeddings for "test" when the embedder is created. I completely removed the option to manually set the dimensions, I'm not sure if there is a use case where specifying it manually would still be required.
Regarding the errors, if the model doesn't exist, that will now be FaultSource::User and it gets displayed properly in the failed tasks list. I just added another error type to the giant error enum, that could definitely be done in a nicer way, but it's best if that gets refactored together with the other embedding backends.

milli/src/vector/error.rs Outdated Show resolved Hide resolved
Fix Meilisearch capitalization
Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me.

Thank you for your contribution, and your patience in listening to my feedback 👍

bors merge

@dureuill dureuill mentioned this pull request Mar 13, 2024
7 tasks
Copy link
Contributor

meili-bors bot commented Mar 13, 2024

Build succeeded:

@meili-bors meili-bors bot merged commit 5ed7b6a into meilisearch:main Mar 13, 2024
10 checks passed
@curquiza curquiza linked an issue Mar 13, 2024 that may be closed by this pull request
7 tasks
@meili-bot meili-bot added the v1.8.0 PRs/issues solved in v1.8.0 released on 2024-05-06 label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.8.0 PRs/issues solved in v1.8.0 released on 2024-05-06
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve hybrid search for v1.8
3 participants