Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Failed to tokenize (LlamaCpp and QAWithSourcesChain) #2645

Closed
darth-veitcher opened this issue Apr 9, 2023 · 15 comments
Closed

Comments

@darth-veitcher
Copy link

Hi there, getting the following error when attempting to run a QAWithSourcesChain using a local GPT4All model. The code works fine with OpenAI but seems to break if I swap in a local LLM model for the response. Embeddings work fine in the VectorStore (using OpenSearch).

def query_datastore(
    query: str,
    print_output: bool = True,
    temperature: float = settings.models.DEFAULT_TEMP,
) -> list[Document]:
    """Uses the `get_relevant_documents` from langchains to query a result from vectordb and returns a matching list of Documents.

    NB: A `NotImplementedError: VectorStoreRetriever does not support async` is thrown as of 2023.04.04 so we need to run this in a synchronous fashion.

    Args:
        query: string representing the question we want to use as a prompt for the QA chain.
        print_output: whether to pretty print the returned answer to stdout. Default is True.
        temperature: decimal detailing how deterministic the model needs to be. Zero is fully, 2 gives it artistic licences.

    Returns:
        A list of langchain `Document` objects. These contain primarily a `page_content` string and a `metadata` dictionary of fields.
    """
    retriever = db().as_retriever()  # use our existing persisted document repo in opensearch
    docs: list[Document] = retriever.get_relevant_documents(query)
    llm = LlamaCpp(
        model_path=os.path.join(settings.models.DIRECTORY, settings.models.LLM),
        n_batch=8192,
        temperature=temperature,
        max_tokens=20480,
    )
    chain: QAWithSourcesChain = QAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff")
    answer: list[Document] = chain({"docs": docs, "question": query}, return_only_outputs=True)
    logger.info(answer)
    if print_output:
        pprint(answer)
    return answer

Exception as below.

RuntimeError: Failed to tokenize: text="b' Given the following extracted parts of a long document and a question, create a final answer with 
references ("SOURCES"). \nIf you don\'t know the answer, just say that you don\'t know. Don\'t try to make up an answer.\nALWAYS return a "SOURCES" 
part in your answer.\n\nQUESTION: Which state/country\'s law governs the interpretation of the contract?\n=========\nContent: This Agreement is 
governed by English law and the parties submit to the exclusive jurisdiction of the English courts in  relation to any dispute (contractual or 
non-contractual) concerning this Agreement save that either party may apply to any court for an  injunction or other relief to protect its 
Intellectual Property Rights.\nSource: 28-pl\nContent: No Waiver. Failure or delay in exercising any right or remedy under this Agreement shall not 
constitute a waiver of such (or any other)  right or remedy.\n\n11.7 Severability. The invalidity, illegality or unenforceability of any term (or 
part of a term) of this Agreement shall not affect the continuation  in force of the remainder of the term (if any) and this Agreement.\n\n11.8 No 
Agency. Except as expressly stated otherwise, nothing in this Agreement shall create an agency, partnership or joint venture of any  kind between the
parties.\n\n11.9 No Third-Party Beneficiaries.\nSource: 30-pl\nContent: (b) if Google believes, in good faith, that the Distributor has violated or 
caused Google to violate any Anti-Bribery Laws (as  defined in Clause 8.5) or that such a violation is reasonably likely to occur,\nSource: 
4-pl\n=========\nFINAL ANSWER: This Agreement is governed by English law.\nSOURCES: 28-pl\n\nQUESTION: What did the president say about Michael 
Jackson?\n=========\nContent: Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices
of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as 
Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution.
\n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia\xe2\x80\x99s Vladimir Putin sought to 
shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll
into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President 
Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizens blocking tanks with 
their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.\nSource: 0-pl\nContent: And we won\xe2\x80\x99t 
stop. \n\nWe have lost so much to COVID-19. Time with one another. And worst of all, so much loss of life. \n\nLet\xe2\x80\x99s use this moment to 
reset. Let\xe2\x80\x99s stop looking at COVID-19 as a partisan dividing line and see it for what it is: A God-awful disease.  \n\nLet\xe2\x80\x99s 
stop seeing each other as enemies, and start seeing each other for who we really are: Fellow Americans.

From what I can tell the model is struggling to interpret the prompt template that's being passed to it?

@alibakh62
Copy link

I am facing the same issue. I am getting embeddings from LlamaCppEmbeddings and using Chroma for storing the embeddings. I also noticed it's very slow, and after hours of running, I got that error!

@darth-veitcher
Copy link
Author

darth-veitcher commented Apr 13, 2023

After a bit more experimenting I think it’s potentially linked to the number of tokens being returned and attempted to be stuffed into the LLM. Errors out with the prompt and multiple documents from the VectorStore but works if I constrain it to a single document. Changing models has similar issues - max_token and ctx_size limits seem to be at play but error messages are a bit opaque. You then start to end up with messages such as ValueError: Requested tokens exceed context window of 10240.

@darth-veitcher
Copy link
Author

For others hitting this issue, using the stuff chaintype (as most of the examples push you towards) results in all of the content being sent to the LLM without any batching.

Swapping this for something like map_reduce, refine or map_rerank as a chain type fixes. See qa_with_sources for further details.

@catbears
Copy link

Same issue, but only with a file of about 400 words. If I use a smaller file with about 200 words, the following runs without problems.

from langchain.embeddings import LlamaCppEmbeddings

llama = LlamaCppEmbeddings(model_path="ggml-model-q4_0.bin")

# file merger.txt has 400 words, file mini.txt 7
with open('pdfs/mini.txt', 'r') as file:
    text = file.read().replace('\n', ' ')

query_result = llama.embed_query(text)

doc_result = llama.embed_documents([text])

@darth-veitcher Where could I put the map_reduce? I tried after model_path, but this seems very wrong.

@jploski
Copy link

jploski commented Apr 23, 2023

Try passing in n_ctx=2048 as parameter.

@darth-veitcher
Copy link
Author

darth-veitcher commented Apr 23, 2023

EDIT: Have just seen your code and it looks like you're at the stage of loading a document and attempting to generate embeddings from it. My issue stemmed from returning results from a VectorStore (in my case OpenSearch) and then pushing this into the language model for summarisation.

You seem to be loading the document in its entirety and then trying to generate an embedding with the whole document. Whilst you could increase the token size I'd recommend you look into document loaders and text splitters as a strategy in the docs as your current approach won't scale.

High level you want to:

  1. Load an unstructured document
  2. Split this into manageable chunks (potentially with an overlap)
  3. Pass these chunks into the model to generate embeddings
  4. Persist these embeddings in a store of your choice (the docs use Chroma as a very simple example)

An implementation set of stubs might look like this.

utils.py

def load_unstructured_document(document: str) -> list[Document]:
    """Loads a given local unstructured document and returns the contained data synchronously.

    Args:
        document: string local path location of the document to load.

    Returns:
        A list of langchain `Document` objects. These contain primarily a `page_content` string and a `metadata` dictionary of fields.
    """
    data: list[Document] = None
    try:
        loader: UnstructuredFileLoader = UnstructuredFileLoader(os.path.expanduser(str(document)))
        data = loader.load()
        logger.debug(f"Loaded {document}")
    except Exception as e:
        logger.exception(e)
    return data
def split_documents(documents: list[Document], chunk_size: int = 100, chunk_overlap: int = 0) -> list[Document]:
    """As documents can be large in size we need to split them down for models to interpret otherwise we run the risk of breaching token limits. Returns a list of split Documents.

    Args:
        documents: a list of langchain documents to split. This can be obtained through a loader.
        chunk_size: integer size of the window we are creating to split the document by. Default: 100
        chunk_overlap: integer size of the amount that each chunk should overlap the previous by. Default: 0

    Returns:
        A list of langchain `Document` objects. These contain primarily a `page_content` string and a `metadata` dictionary of fields.
    """
    docs: list[Document] = Document(page_content="", metadata={})
    try:
        text_splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        docs: list[Document] = text_splitter.split_documents(documents)
    except Exception as e:
        logger.exception(e)
    return docs
def store_embeddings_from_docs(
    documents: list[Document],
    opensearch_url: str = settings.opensearch.url,
    index_name: str = settings.opensearch.default_index,
    embeddings: Optional[Embeddings] = None,
) -> OpenSearchVectorSearch:
    """Sends the text string from loaded documents to a LLaMACPP embeddings.

    We then store the embedding in OpenSearch, our vector search engine, and persist it on a file system.

    Args:
        documents: A list of langchain `Document` objects. These contain primarily a `page_content` string and a `metadata` dictionary of fields.
        opensearch_url: connection string for elastic/opensearch.
        index_name: index/collection within the backend to store embeddings in. Defaults to settings.opensearch.default_index.
        embeddings: optional `langchain.embeddings.base.Embeddings` model to use. Defaults to the model specified in settings.

    Returns:
        A langchain ElasticVectorSearch `VectorStore` object acting as a wrapper around ElasticVectorSearch embeddings platform.
    """
    vectordb: OpenSearchVectorSearch
    if not embeddings:
        embeddings = LlamaCppEmbeddings(
            model_path=os.path.join(settings.models.DIRECTORY, settings.models.EMBEDDING), n_batch=8192
        )
    try:
        vectordb = db(opensearch_url=opensearch_url, index_name=index_name)
        vectordb.embedding_function = embeddings
        vectordb.add_documents(documents)
    except Exception as e:
        if isinstance(e, OpenSearchException):
            logger.error(f"Error persisting {documents} to OpenSearch.")
        logger.exception(e)
        vectordb = db(opensearch_url=opensearch_url, index_name=index_name)

    return vectordb

You'd chain the above together in a cli command.

cli.py

@app.command()
def index_unstructured_document(
    document: str,
    chunk_size: int = 100,
    chunk_overlap: int = 0,
) -> OpenSearchVectorSearch:
    """Loads an unstructured document from disk, indexes and persists embeddings into an OpenSearch VectorStore.

    Commandline wrapper functionality around multiple calls to: Load, Split, Embeddings, Persist

    Args:
        document: string local path location of the document to load.
        chunk_size: integer size of the window we are creating to split the document by. Default: 100
        chunk_overlap: integer size of the amount that each chunk should overlap the previous by. Default: 0

    Returns:
        A langchain OpenSearch `VectorStore` object acting as a wrapper around OpenSearch as a datastore for our embeddings.
    """
    logger.info(f"Indexing {document}")
    try:
        return utils.store_embeddings_from_docs(
            documents=utils.split_documents(
                documents=utils.load_unstructured_document(document), chunk_size=chunk_size, chunk_overlap=chunk_overlap
            ),
        )
    except (IndexError, ValueError) as e:  # usually means no embeddings returned or unsupported filetype
        logger.error(f"Unable to index {document}. {e}")
        pass
    except PythonDocxError as e:  # usually means it's a weird OneDrive symlink and hasn't been downloaded locally
        logger.error(f"Unable to index {document}. {e}")
        pass

Original response (specifically addressing question on chain_type)

Same issue, but only with a file of about 400 words. If I use a smaller file with about 200 words, the following runs without problems.

from langchain.embeddings import LlamaCppEmbeddings

llama = LlamaCppEmbeddings(model_path="ggml-model-q4_0.bin")

# file merger.txt has 400 words, file mini.txt 7
with open('pdfs/mini.txt', 'r') as file:
    text = file.read().replace('\n', ' ')

query_result = llama.embed_query(text)

doc_result = llama.embed_documents([text])

@darth-veitcher Where could I put the map_reduce? I tried after model_path, but this seems very wrong.

That is a parameter for the Langchain itself.

chain: QAWithSourcesChain = QAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff")

I linked the official docs as well but in the above you’re passing in your llm and a chain_type. This type needs to be changed from stuff to one of the others.

@PrincelyDread
Copy link

Following.. I'm experiencing the same issue with GGML models on text-generation_webui. Not entirely the same issue as you guys but this is the only post with the same issue.

@boraoku
Copy link

boraoku commented Apr 27, 2023

For the past couple of days, I am sweating over a similar issue with my trials to create a clean indexing over a single document by using a chain with input_document parameter - that is to say no prior training data to be used for answer generation but only the input document:

from langchain.embeddings import LlamaCppEmbeddings
embeddings = LlamaCppEmbeddings(model_path='./models/gpt4all-lora-quantized.bin')

# read txt file into an array of string chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open("./text_data.txt") as f:
    text_data = f.read()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(text_data)

# create index
from langchain.vectorstores import Chroma
docsearch = Chroma.from_texts(texts, embeddings)

# get language model for chain 
from langchain.llms import LlamaCpp
llm = LlamaCpp(model_path='./models/gpt4all-lora-quantized.bin')

# create chain
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm, chain_type="stuff")

# query
query = "What is the typical Elasticity Modulus values for hard clay?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

Above code that works perfectly on Google Collab with OpenAI Embeddings and LLM, but with offline copies of GPT4All or Facebook LLaMA it ends up with Failed to tokenize error at the last query step.

I can get an answer, if I replace the chain with something else that does not specify input_documents such the code below. But then the answer is not from that particular input file but from some prior learning of the LLM utilised. In contrast, OpenAI LLM with the above code would produce the desired answer or no answer (... not provided in the given context) if asked something out of the scope of the txt file. But obviously with the below code, other LLMs produce absolute garbage!

from langchain.chains import RetrievalQA
MIN_DOCS = 1 #more than 1 does results in Requested tokens exceed context window of 512
print(query)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))

It is really bizarre that I can't use publicly available LLMs to create a very simple use case of interrogating a text file. Am I missing something?

@darth-veitcher
Copy link
Author

It is really bizarre that I can't use publicly available LLMs to create a very simple use case of interrogating a text file. Am I missing something?

You'll have the same issue - read the responses of mine above (particularly the second section). Either change from chain_type="stuff" to chain_type="refine", increase the context tokens etc.

As you can see if you pass only a single document it'll work fine. You use of stuff means that you're not chunking input to the LLM in a way it can process it within the token buffer it has available to it based on my experience.

@boraoku
Copy link

boraoku commented Apr 27, 2023

Thanks for the response mate.

Had already read all here and just forgot to write that I tried refine and map_reduce got a Requested tokens exceed context window of 512 after ramping up with fans quite a bit (on a MacBook Pro M1). Just to note: The only other option was map_rerank which did not spin the fans but ended up giving the same error almost instantaneously with Failed to tokenize.

Now I am guessing next step as you suggested is to increase the context tokens, to combat Requested tokens exceed context window of 512. Any documentation that you can point me at on that, so that I can sort this out hopefully?

Spent days now on this (as a hobbyist after day job hours) and this issue post was the best source of info for this error. Thanks for opening it....

@darth-veitcher
Copy link
Author

Now I am guessing next step as you suggested is to increase the context tokens, to combat Requested tokens exceed context window of 512. Any documentation that you can point me at on that, so that I can sort this out hopefully?

Haven't got access currently but it's something that you should be able to change when initialising the model. With llama.cpp there might be a limit of 2048, can't remember off top of my head.

Something along the lines of this could be worth a try.

llm = LlamaCpp(model_path="models/my-model.bin", n_ctx=2048)

@radegran
Copy link

I get this too. In short I am trying to Q and A with a map_reduce chain type using langchain==0.0.142.

vectordb = Chroma(persist_directory=persist_directory)

llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH, temperature=1e-10)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 2}),
    return_source_documents=True,
)

I get this error:

RuntimeError: Failed to tokenize: text="b" Given the following extracted parts of a long document and a question, 
create a final answer. \nIf you don't know the answer, just say that you don't know. Don't try to make up an 
answer.\n\nQUESTION: Which state/country's law governs the interpretation of the contract?\n=========\nContent: 
This Agreement is governed by English law and the parties submit to the exclusive jurisdiction of the English courts in 
 relation to any dispute (contractual or non-contractual) concerning this Agreement save that either party may apply to
 any court for an  injunction or other relief to protect its Intellectual Property Rights.\n\nContent: No Waiver. Failure o
r delay in exercising any right or remedy under this Agreement shall not constitute a waiver of such (or any other)  
right or remedy.\n\n11.7 Severability. The invalidity, illegality or unenforceability of any term (or part of a term) of 
this Agreement shall not affect the continuation  in force of the remainder of the term (if any) and this Agreement.
\n\n11.8 No Agency. Except as expressly stated otherwise, nothing in this Agreement shall create an agency, 
partnership or joint venture of any  kind between the parties.\n\n11.9 No Third-Party Beneficiaries.\n\nContent: (b)
 if Google believes, in good faith, that the Distributor has violated or caused Google to violate any Anti-Bribery Laws 

[etc]

... which comes from what looks like sample text from the source code of map_reduce_prompt.py.

@boraoku
Copy link

boraoku commented Apr 28, 2023

I can report back that with n_ctx=2048 set on llm and stuff chain type and changing the chunk_size and chunk_overlap to 1000 and 200 respectively, finally GPT4All seems to generate some decent output from the given text, but still not as precise (and correct) answers as OpenAI.

GPT4All even makes up answers with some sort of logical interpolations of the available information rather than trying to find the correct answer by having a better understanding of the text. It is hard to explain but I am fairly impressed now and rather disappointed at the same time. Well I suppose and hope that the future versions of GPT4All would come closer to OpenAI.

A side note: using refine chain type returns an empty answer prompt despite a fair bit of fan spin similar to stuff run.

Finally, I highly recommend watching this video by @karpathy, one of the lead AI brains of our time: https://www.youtube.com/watch?v=kCc8FmEb1nY This goes over the nitty-gritty details of the architecture behind GPTs (e.g. what is a token) and training strategies (e.g. chunks) with a very clear examples and even code samples.

Here is the final full code for those you are interested in https://github.com/boraoku/jupyter-notebooks/blob/ec52acfba0b7dda75782d56a04b3df2b4bf62b27/GeoAI_Trials_01_GPT4AllmakesUpFormulas_fromBowles.ipynb

Cheers!

@fabmeyer
Copy link

I could successfullly run an open source model with LangChain for document retrieval.

# Embeddings:
# split texts
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(text)

# using Sentence Transformers embeddings
embeddings_ST = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')

docsearch = Chroma.from_texts(texts, embeddings_ST, metadatas=[{'source': f'{i}-pl'} for i in range(len(texts))])

# Model:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
model_LLAMA = LlamaCpp(model_path='/home/fabmeyer/Dev/Python/personal-investment-assistant/models/ggml-model-q4_0.bin', n_ctx=4096, callback_manager=callback_manager, verbose=True)

# Chain:
chain = RetrievalQAWithSourcesChain.from_chain_type(llm=model_LLAMA, chain_type='refine', retriever=docsearch.as_retriever())

However the answer of the model only gets written to the console (via the callback manager) and after a while the model returns answer: ''.

See the console output:
Answers: Leisure travel will lead the comeback in the tourism and travel sector. Business travel, a crucial source of revenue for hotels and airlines, could see a permanent shift or may come back only in phases based on proximity, reason for travel, and sector.query: {'question': 'What are trends in travel and vacation?', 'answer': '', 'sources': ''}

Anybody got an idea how to fix it?
There's a SO question only about this very issue:
how-to-stream-agents-response-in-langchain

Beside that great work!

@dosubot
Copy link

dosubot bot commented Sep 22, 2023

Hi, @darth-veitcher! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported was related to a RuntimeError that occurred when running a QAWithSourcesChain with a local LLM model. There have been some suggestions in the comments to fix this issue, such as changing the chain type to map_reduce, refine, or map_rerank, and increasing the n_ctx parameter when initializing the LLM model. Additionally, there have been discussions about chunking the input documents and increasing the context tokens.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to LangChain!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 22, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants