Running into this error while creating embeddings out of pdf file #5384

catchlui · 2023-05-29T01:10:32Z

System Info

File "d:\langchain\pdfqa-app.py", line 46, in _upload_data
Pinecone.from_texts(self.doc_chunk,embeddings,batch_size=16,index_name=self.index_name)
File "E:\anaconda\envs\langchain\lib\site-packages\langchain\vectorstores\pinecone.py", line 232, in from_texts
embeds = embedding.embed_documents(lines_batch)
File "E:\anaconda\envs\langchain\lib\site-packages\langchain\embeddings\openai.py", line 297, in embed_documents
return self._get_len_safe_embeddings(texts, engine=self.deployment)
File "E:\anaconda\envs\langchain\lib\site-packages\langchain\embeddings\openai.py", line 221, in _get_len_safe_embeddings
token = encoding.encode(
File "E:\anaconda\envs\langchain\lib\site-packages\tiktoken\core.py", line 117, in encode
if match := _special_token_regex(disallowed_special).search(text):
TypeError: expected string or buffer

Who can help?

No response

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

def _load_docs(self):
loader=PyPDFLoader("D:\langchain\data_source\1706.03762.pdf")
self.doc= loader.load()

def _split_docs (self):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap,separators= ["\n\n", ""])
    self.doc_chunk = text_splitter.split_documents(self.doc)

def _upload_data(self):
    embeddings = OpenAIEmbeddings()
    list_of_index = pinecone.list_indexes()
    if self.index_name in list_of_index:
        Pinecone.from_texts(self.doc_chunk,embeddings,batch_size=16,index_name=self.index_name)
    else:
        pinecone.create_index(self.index_name, dimension=1024) # for open AI
        Pinecone.from_texts(self.doc_chunk,embeddings,batch_size=16,index_name=self.index_name)
def dataloader(self):
    self._load_docs()
    self._split_docs()
    self._upload_data()

Expected behavior

Please help with the solution

The text was updated successfully, but these errors were encountered:

YLFxGen · 2023-05-29T03:20:55Z

Use Pinecone.from_documents(self.doc_chunk, embeddings, batch_zie=16, index_name=self.index_name) instead.
text_splitter.split_documents chunks file into a list of documents instead of text.

dosubot · 2023-11-11T16:01:12Z

Hi, @catchlui! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you raised an issue regarding a TypeError when encoding a special token in the tiktoken library. User YLFxGen suggested using Pinecone.from_documents instead of _text_splitter.split_documents as a potential solution, and user Rinisha160391 has approved this suggestion.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 11, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 18, 2023

dosubot bot mentioned this issue Feb 8, 2024

how to index the data into FAISS without using RecursiveCharacterTextSplitter? #17262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running into this error while creating embeddings out of pdf file #5384

Running into this error while creating embeddings out of pdf file #5384

catchlui commented May 29, 2023

YLFxGen commented May 29, 2023

dosubot bot commented Nov 11, 2023

Running into this error while creating embeddings out of pdf file #5384

Running into this error while creating embeddings out of pdf file #5384

Comments

catchlui commented May 29, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

YLFxGen commented May 29, 2023

dosubot bot commented Nov 11, 2023