Skip to content

Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #19061

@alvynabranches

Description

@alvynabranches

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import os
from langchain.vectorstores.qdrant import Qdrant
from langchain.document_loaders.pdf import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings


def load_pdf(
    file: str,
    collection_name: str,
    chunk_size: int = 512,
    chunk_overlap: int = 32,
):
    loader = PyPDFLoader(file)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    texts = text_splitter.split_documents(documents)
    embeddings = HuggingFaceBgeEmbeddings(
        model_name=os.environ.get("MODEL_NAME", "microsoft/phi-2"),
        model_kwargs=dict(device="cpu"),
        encode_kwargs=dict(normalize_embeddings=False),
    )

    url = os.environ.get("VDB_URL", "http://localhost:6333")

    qdrant = Qdrant.from_documents(
        texts,
        embeddings,
        url=url,
        collection_name=collection_name,
        prefer_grpc=False,
    )

    return qdrant

load_pdf("/Users/alvynabranches/Downloads/bh1.pdf", "buget2425")

Error Message and Stack Trace (if applicable)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/Users/alvynabranches/rag/rag/ingest.py", line 39, in <module>
    load_pdf("/Users/alvynabranches/Downloads/bh1.pdf", "buget2425")
  File "/Users/alvynabranches/rag/rag/ingest.py", line 29, in load_pdf
    qdrant = Qdrant.from_documents(
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/langchain_core/vectorstores.py", line 528, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/langchain_community/vectorstores/qdrant.py", line 1334, in from_texts
    qdrant = cls.construct_instance(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/langchain_community/vectorstores/qdrant.py", line 1591, in construct_instance
    partial_embeddings = embedding.embed_documents(texts[:1])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/langchain_community/embeddings/huggingface.py", line 257, in embed_documents
    embeddings = self.client.encode(texts, **self.encode_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/sentence_transformers/SentenceTransformer.py", line 345, in encode
    features = self.tokenize(sentences_batch)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/sentence_transformers/SentenceTransformer.py", line 553, in tokenize
    return self._first_module().tokenize(texts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/sentence_transformers/models/Transformer.py", line 146, in tokenize
    self.tokenizer(
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2829, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2915, in _call_one
    return self.batch_encode_plus(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 3097, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2734, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Description

I am using the langchain library to store pdf on qdrant. The model which I am using is phi-2. I tried to solve it by changing the kwargs but it is still not working.

System Info

pip3 freeze | grep langchain
langchain==0.1.12
langchain-community==0.0.28
langchain-core==0.1.31
langchain-text-splitters==0.0.1

Platform

Apple M3 Max
macOS Sonoma
Version 14.1

python3 -V
Python 3.12.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions