-
Notifications
You must be signed in to change notification settings - Fork 20.3k
Closed as not planned
Labels
bugRelated to a bug, vulnerability, unexpected error with an existing featureRelated to a bug, vulnerability, unexpected error with an existing feature
Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
import os
from langchain.vectorstores.qdrant import Qdrant
from langchain.document_loaders.pdf import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
def load_pdf(
file: str,
collection_name: str,
chunk_size: int = 512,
chunk_overlap: int = 32,
):
loader = PyPDFLoader(file)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceBgeEmbeddings(
model_name=os.environ.get("MODEL_NAME", "microsoft/phi-2"),
model_kwargs=dict(device="cpu"),
encode_kwargs=dict(normalize_embeddings=False),
)
url = os.environ.get("VDB_URL", "http://localhost:6333")
qdrant = Qdrant.from_documents(
texts,
embeddings,
url=url,
collection_name=collection_name,
prefer_grpc=False,
)
return qdrant
load_pdf("/Users/alvynabranches/Downloads/bh1.pdf", "buget2425")Error Message and Stack Trace (if applicable)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/Users/alvynabranches/rag/rag/ingest.py", line 39, in <module>
load_pdf("/Users/alvynabranches/Downloads/bh1.pdf", "buget2425")
File "/Users/alvynabranches/rag/rag/ingest.py", line 29, in load_pdf
qdrant = Qdrant.from_documents(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/langchain_core/vectorstores.py", line 528, in from_documents
return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/langchain_community/vectorstores/qdrant.py", line 1334, in from_texts
qdrant = cls.construct_instance(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/langchain_community/vectorstores/qdrant.py", line 1591, in construct_instance
partial_embeddings = embedding.embed_documents(texts[:1])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/langchain_community/embeddings/huggingface.py", line 257, in embed_documents
embeddings = self.client.encode(texts, **self.encode_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/sentence_transformers/SentenceTransformer.py", line 345, in encode
features = self.tokenize(sentences_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/sentence_transformers/SentenceTransformer.py", line 553, in tokenize
return self._first_module().tokenize(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/sentence_transformers/models/Transformer.py", line 146, in tokenize
self.tokenizer(
File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2829, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2915, in _call_one
return self.batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 3097, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2734, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.Description
I am using the langchain library to store pdf on qdrant. The model which I am using is phi-2. I tried to solve it by changing the kwargs but it is still not working.
System Info
pip3 freeze | grep langchainlangchain==0.1.12
langchain-community==0.0.28
langchain-core==0.1.31
langchain-text-splitters==0.0.1
Platform
Apple M3 Max
macOS Sonoma
Version 14.1
python3 -VPython 3.12.2
dosubot and fangmulligan
Metadata
Metadata
Assignees
Labels
bugRelated to a bug, vulnerability, unexpected error with an existing featureRelated to a bug, vulnerability, unexpected error with an existing feature