Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to tokenize: langchain with gpt4all model #2659

Closed
Mohamedballouch opened this issue Apr 10, 2023 · 12 comments
Closed

Failed to tokenize: langchain with gpt4all model #2659

Mohamedballouch opened this issue Apr 10, 2023 · 12 comments

Comments

@Mohamedballouch
Copy link

    112 if int(n_tokens) < 0:
--> 113     raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')
    114 return list(tokens[:n_tokens])

RuntimeError: Failed to tokenize: text="b" Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\ndu Home Wireless.\nUnlimited internet with a free router\n\ndu home wireless is a limited mobility service and subscription.\n\nOnce the device is activated, your home location will be assigned as the first usage location and will only work there. If you decide to move or shift locations, please request us to unlock it and it will be re-assigned to the new location.\n\nAs a new du Home Wireless customer, you\xe2\x80\x99ll receive an LTE router for internet connection at the time of the service application. It will include a Data SIM with unlimited data.\n\ndu Home wireless advantages.\nUnlimited data: Enjoy wireless connectivity with unlimited data for 12 months.\n\nHigh-quality streaming: Stream your favorite entertainment, chat and game at the same time.\n\n5G-enabled router: Connect all your devices with the latest WiFi 5 technology.\nWhat is du Home Wireless?\n\nWhat type of internet activities does the Home Wireless Plan support?\n\nIt supports the following internet activities depending on the plan you get:\n\nStandard internet activities\nVideo and music streaming\nGaming and VR\nSocial media\nWeb surfing\nEmail and office work\n\nCan I connect with more than one device at the same time?\n\nYes, you can. Ideally, the average number of connected devices shouldn\xe2\x80\x99t exceed 4 large screens on our Home Wireless Plus and Home Wireless Entertainment Plans.\n\nWill I always get a consistent speed?\n\nInternet speed is not guaranteed. Individual results will vary as it might be affected by many factors such as the weather, interference from buildings and network capacity. The areas wherein you\xe2\x80\x99ll get the best coverage experience are the following:\n\nNear a window\nIn an open space away from walls, obstructions, heavy-duty appliances, or electronics such as microwave ovens and baby monitors\nNear a power outlet\n\nWill I be able to bring my own router?\n\nYes, you have the option to use your own router.\n\nTo connect, check the below steps:\n\nInsert your du SIM card in the back of the router\nConnect to power and turn on the device\nConnect to your router using the Wi-Fi SSID and WiFi password information at the sticker on the bottom\nFor connection steps, check the video: Watch now\n\nHow can I subscribe to the Internet Calling Pack on the Home Wireless Entertainment Plan?\n\nThe free Internet Calling Pack subscription will be added to your plan for a period of three months by default.\n\nWho is eligible to get the free Internet Calling Pack?\n\nNew Home Wireless Entertainment subscribers will enjoy this added benefit.\n\nHome Wireless plans are our new range of Home Plans, that offer instant connectivity, unlimited data and premium entertainment so you can enjoy instant, plug-and-play high-quality internet.\n\nWhere does this service work?\n\nThis service has limited mobility. Once the device is activated, your home location will be assigned as the first usage location and will only work there. If you decide to move or shift locations, you will have to ask us to unlock it so your Home Wireless can be re-assigned to your new location.\n\nWhat kind of router do I get with this plan?\n\nYou will receive a 5G-enabled router.\n\nWhat happens if I don\xe2\x80\x99t have 5G?\n\nIt will automatically connect to 4G.\n\nIs a Landline required for a home wireless connection?\n\nNo, it\xe2\x80\x99s not.\n\nHow does this service work?\n\nAs a new Home Wireless customer, you\xe2\x80\x99ll receive a router for internet connection at the time of your service application. It will include a Data SIM with unlimited data.\n\nQuestion: What is du Home Wireless?\nHelpful Answer:"" n_tokens=-908

@adriacabeza
Copy link

are you using the Llama Cpp? I am encountering the same issue

@Mohamedballouch
Copy link
Author

@adriacabeza yes im using gpt4all

llm = LlamaCpp(model_path="./gpt4all-converted.bin") # model 
llama_embeddings = LlamaCppEmbeddings(model_path="./gpt4all-converted.bin")  # embeddings
docsearch = FAISS.from_texts(texts, llama_embeddings)

@calderonsteven
Copy link

calderonsteven commented Apr 11, 2023

I'm getting same error, but running:

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever()
)

query = "De que se trata el decreto 4927?"
qa.run(query)

I got this error too: llama_tokenize: too many tokens

@iokarkan
Copy link

I encountered the same error, with a similar setup, where I used Chroma and RetrievalQAWithSourcesChain. When using RetrievalQA.from_chain_type and the .run(query) method instead, I can successfully get results.

This works:

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)
retriever = db.as_retriever()
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
query = """Below is an instruction that describes a task. Write a response that appropriately completes the request. 

How many items are listed?"""
print(qa.run(query))

This throws the error:

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)
retriever = db.as_retriever()
from langchain.chains import RetrievalQA
qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
query = """Below is an instruction that describes a task. Write a response that appropriately completes the request. 

docs = db.similarity_search(query)
print(qa({"question": query}, return_only_outputs=True)) 

@calderonsteven
Copy link

Try this walk-around, passing the search_kwargs={"k": MIN_DOCS} force the doc search to use only one document, I realize that the error happens when there is only one document loaded for Q&A.

from langchain.chains import RetrievalQA

# Just getting a single result document from the knowledge lookup is fine...
MIN_DOCS = 1

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS})
)

query = "De que trata el DECRETO 4927 DE 2011 en colombia?"
qa.run(query)

@MakkiNeutron
Copy link

same error occured

@seancummins1
Copy link

Review the content that you have indexed. Ensure you dont have any special characters or escaping characters in your dataset IE: " or /
I found my data to contain this as it was scaped from a url directly into text file hence had markup which was causing the issue.

@sergedc
Copy link

sergedc commented Apr 22, 2023

I am experiencing the same issue with the following commands:

llm = LlamaCpp(model_path=r"D:\AI\Model\vicuna-13B-1.1-GPTQ-4bit-128g.GGML.bin",n_ctx=2048, seed=5, n_threads=7, temperature = 0, repeat_penalty=1, echo=True,f16_kv=True)

text_splitter = CharacterTextSplitter()
with open(r"D:\AI\Langchain\mydoc.txt", encoding='utf-8') as f:
    state_of_the_union = f.read()
texts = text_splitter.split_text(state_of_the_union)

docs = [Document(page_content=t) for t in texts[:3]]

chain = load_summarize_chain(llm, chain_type="map_reduce")
myoutput = chain.run(docs)

myDoc.txt is long, gets broken into parts (by langchain), the 2 first parts get summarize well, then the error occurred with the 3rd part. The 3rd part is longer than the 1st and 2nd part.

myDoc.txt does not have any special characters.

I suspect the problem comes from configuration of Size of context, vs Size of parts of text

@sergedc
Copy link

sergedc commented Apr 22, 2023

Try this walk-around, passing the search_kwargs={"k": MIN_DOCS} force the doc search to use only one document, I realize that the error happens when there is only one document loaded for Q&A.

from langchain.chains import RetrievalQA

# Just getting a single result document from the knowledge lookup is fine...
MIN_DOCS = 1

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS})
)

query = "De que trata el DECRETO 4927 DE 2011 en colombia?"
qa.run(query)

How would this apply in the case of load_summarize_chain ?
Do I try that?
chain = load_summarize_chain(llm, chain_type="map_reduce", retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))

@WitzHsiao
Copy link

WitzHsiao commented Apr 25, 2023

I am encountering the same issue when using Chroma.from_documents to load pdf documents by llama.cpp embeddings. BTW, I have tried multiple models like gpt4all, vicuna and alpaca.

I have tried the same way with a pdf only includes simple sentence, and works totally fine. I think the content might cause the issue.

@Sh1d0w
Copy link

Sh1d0w commented May 13, 2023

I am getting the same error by following the official tutorial from the docs: https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/chroma_self_query_retriever.html

The only difference is that I use LlamaCpp with LlamaCppEmbeddings and ggml-gpt4all-l13b-snoozy.bin

@dosubot
Copy link

dosubot bot commented Sep 22, 2023

Hi, @Mohamedballouch! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is about a runtime error when trying to tokenize a text. It seems that the error message indicates that the text being tokenized is "b" and the number of tokens is -908. Other users have encountered the same issue and have provided potential workarounds, such as passing search_kwargs={"k": MIN_DOCS} to force the doc search to use only one document. Some users suspect that the issue may be related to the configuration of the size of the context versus the size of parts of the text.

However, it seems that the issue has been resolved by using the workaround suggested by other users. By passing search_kwargs={"k": MIN_DOCS}, the runtime error when tokenizing the text is no longer occurring.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 22, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants