Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running into this error while creating embeddings out of pdf file #5384

Closed
2 of 14 tasks
catchlui opened this issue May 29, 2023 · 2 comments
Closed
2 of 14 tasks

Running into this error while creating embeddings out of pdf file #5384

catchlui opened this issue May 29, 2023 · 2 comments

Comments

@catchlui
Copy link

System Info

File "d:\langchain\pdfqa-app.py", line 46, in _upload_data
Pinecone.from_texts(self.doc_chunk,embeddings,batch_size=16,index_name=self.index_name)
File "E:\anaconda\envs\langchain\lib\site-packages\langchain\vectorstores\pinecone.py", line 232, in from_texts
embeds = embedding.embed_documents(lines_batch)
File "E:\anaconda\envs\langchain\lib\site-packages\langchain\embeddings\openai.py", line 297, in embed_documents
return self._get_len_safe_embeddings(texts, engine=self.deployment)
File "E:\anaconda\envs\langchain\lib\site-packages\langchain\embeddings\openai.py", line 221, in _get_len_safe_embeddings
token = encoding.encode(
File "E:\anaconda\envs\langchain\lib\site-packages\tiktoken\core.py", line 117, in encode
if match := _special_token_regex(disallowed_special).search(text):
TypeError: expected string or buffer

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

def _load_docs(self):
loader=PyPDFLoader("D:\langchain\data_source\1706.03762.pdf")
self.doc= loader.load()

def _split_docs (self):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap,separators= ["\n\n", ""])
    self.doc_chunk = text_splitter.split_documents(self.doc)

def _upload_data(self):
    embeddings = OpenAIEmbeddings()
    list_of_index = pinecone.list_indexes()
    if self.index_name in list_of_index:
        Pinecone.from_texts(self.doc_chunk,embeddings,batch_size=16,index_name=self.index_name)
    else:
        pinecone.create_index(self.index_name, dimension=1024) # for open AI
        Pinecone.from_texts(self.doc_chunk,embeddings,batch_size=16,index_name=self.index_name)
def dataloader(self):
    self._load_docs()
    self._split_docs()
    self._upload_data()

Expected behavior

Please help with the solution

@YLFxGen
Copy link

YLFxGen commented May 29, 2023

Use Pinecone.from_documents(self.doc_chunk, embeddings, batch_zie=16, index_name=self.index_name) instead.
text_splitter.split_documents chunks file into a list of documents instead of text.

Copy link

dosubot bot commented Nov 11, 2023

Hi, @catchlui! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you raised an issue regarding a TypeError when encoding a special token in the tiktoken library. User YLFxGen suggested using Pinecone.from_documents instead of _text_splitter.split_documents as a potential solution, and user Rinisha160391 has approved this suggestion.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 11, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants