Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFace embeddings fails #4271

Closed
scottaglia opened this issue Dec 19, 2023 · 4 comments · Fixed by #4272
Closed

HuggingFace embeddings fails #4271

scottaglia opened this issue Dec 19, 2023 · 4 comments · Fixed by #4272
Labels
bug Something isn't working as expected v1.6.0 PRs/issues solved in v1.6.0 released on 2024-01-15
Milestone

Comments

@scottaglia
Copy link

Describe the bug
The upload of documents to an index where HuggingFace embedders is enabled fails. When uploading the documents the server logs:

[2023-12-19T18:27:28Z INFO  hf_hub] Token file not found "/root/.cache/huggingface/token"

and the task fails with the following error message:

      "error": {
        "message": "internal: Error while generating embeddings: error: fetching file from HG_HUB failed: request error: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/resolve/617ca489d9e86b49b8167676d8220688b99db36e/config.json: status code 404.",
        "code": "internal",
        "type": "internal",
        "link": "https://docs.meilisearch.com/errors#internal"
      },

To Reproduce

  1. docker run -it --rm -p 7700:7700 -v pwd:/meili_data getmeili/meilisearch:v1.6.0-rc.1
  2. Enable vectorStore
curl \
-X PATCH 'http://localhost:7700/experimental-features/' \
-H 'Content-Type: application/json'  \
--data-binary '{"vectorStore": true}'
  1. Create index with the embedder:
curl \
-X PATCH 'http://localhost:7700/indexes/products/settings' \
-H 'Content-Type: application/json' --data-binary \
'{ "embedders": { "default": { "source": { "huggingFace": { "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" } }, "documentTemplate": { "template": "A product titled '{{doc.title}}'"} } } }'

Meilisearch version:
v1.6.0-rc.1

@dureuill
Copy link
Contributor

Hello 👋

Thank you for the report ❤️

I can confirm this is indeed a bug affecting v1.6.0-rc.1 😞

Currently you can workaround it by passing an explicit additional revision field to the huggingFace object:

 curl \
-X PATCH 'http://localhost:7700/indexes/products/settings' \
-H 'Content-Type: application/json' --data-binary \
'{ 
  "embedders": { 
    "default": {
      "source": {
        "huggingFace": {
          "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", 
          "revision": "a9c555277f9bcf24f28fa5e092e665fc6f7c49cd" 
        } 
      }, 
      "documentTemplate": { "template": "A product titled '{{doc.title}}'"}
    } 
  } 
}'

I'm on it to fix it so this is not necessary anymore 👍

@dureuill dureuill added the bug Something isn't working as expected label Dec 20, 2023
@dureuill dureuill added this to the v1.6.0 milestone Dec 20, 2023
meili-bors bot added a commit that referenced this issue Dec 20, 2023
4272: Don't pass default revision when the model is explicitly set in config r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #4271 

## What does this PR do?

- When the `model` is explicitly set in the `embedders` setting, we reset the `revision` to `None`, such that if the user doesn't specify a revision, the head of the model repository is chosen. 
- Not changed: If the user specifies a revision, it applies, like previously. 
- Not changed: If the user doesn't specify a model, the default model with the default revision applies, like previously.

## Manual testing on a fresh DB

1. Enable experimental feature:
```sh
curl \
  -X PATCH 'http://localhost:7700/experimental-features/' \
  -H 'Content-Type: application/json' -H 'Authorization: Bearer foo' \
--data-binary '{ "vectorStore": true
  }'
```
2. Send settings with a specified model but no specified revision:
```sh
curl \
-X PATCH 'http://localhost:7700/indexes/products/settings' \
-H 'Content-Type: application/json' --data-binary \
'{ "embedders": { "default": { "source": { "huggingFace": { "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" } }, "documentTemplate": { "template": "A product titled '{{doc.title}}'"} } } }'
```
3. Check that the task was successful:
```sh
curl 'http://localhost:7700/tasks/0'

{"uid":0,"indexUid":"products","status":"succeeded","type":"settingsUpdate","canceledBy":null,"details":{"embedders":{"default":{"source":{"huggingFace":{"model":"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"}},"documentTemplate":{"template":"A product titled {{doc.title}}"}}}},"error":null,"duration":"PT0.001892S","enqueuedAt":"2023-12-20T09:17:01.73789Z","startedAt":"2023-12-20T09:17:01.73854Z","finishedAt":"2023-12-20T09:17:01.740432Z"}
```
4. Send documents to index:
```sh
curl 'https://localhost:7700/indexes/products/documents' -H 'Content-Type: application/json' --data-binary '{"id": 0, "title": "Best product"}'
```

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
@curquiza
Copy link
Member

Fixed by #4272 and integrated into v1.6.0-rc.2 I will create in a few hours.

@meili-bot meili-bot added the v1.6.0 PRs/issues solved in v1.6.0 released on 2024-01-15 label Jan 16, 2024
@anaibol
Copy link

anaibol commented Mar 8, 2024

I'm still having this issue on v.1.6.2: Token file not found "/root/.cache/huggingface/token"
Are there some token declaration requirements?

@dureuill
Copy link
Contributor

dureuill commented Mar 11, 2024

Hello @anaibol 👋

Thank you for your report 😊 can you be more specific as to what you are prevented from doing?

The message you're reporting is a warning, and is expected if you didn't manually set a Hugging Face token at the described location.

A Hugging Face token is only required to download specific models from Hugging Face. In particular, the model Meilisearch uses by default doesn't require a token, so embedding should work even in presence of this warning.

So I have a few questions for clarification:

  1. Can you share your embedder settings?
  2. Are you seeing an issue with embedding, such as a failed indexing task in the task queue. If so, can you show the relevant part of the task queue?
  3. Are you requesting a model that requires a token?
  4. If so, did you try putting the token at the described location?

Thanks for any clarification ☀️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected v1.6.0 PRs/issues solved in v1.6.0 released on 2024-01-15
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants