Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller. #1371

Closed
slavag opened this issue Feb 6, 2024 · 1 comment

Comments

@slavag
Copy link

slavag commented Feb 6, 2024

Hi,
I'm trying to summarize text, I have collection with documents (5GB). The text is relatively short, just few sentences and I'm getting this error :
image

Traceback (most recent call last):
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/blocks.py", line 1199, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 519, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 512, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 649, in gen_wrapper
    yield from f(*args, **kwargs)
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gradio_runner.py", line 4401, in bot
    for res in get_response(fun1, history, chatbot_role1, speaker1, tts_language1, roles_state1,
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gradio_runner.py", line 4296, in get_response
    for output_fun in fun1():
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gen.py", line 3702, in evaluate
    for r in run_qa_db(
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 5111, in _run_qa_db
    get_chain(**sim_kwargs)
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 6562, in get_chain
    docs_with_score, max_doc_tokens = split_merge_docs(docs_with_score,
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 5459, in split_merge_docs
    text_splitter = H2OCharacterTextSplitter.from_huggingface_tokenizer(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 5433, in from_huggingface_tokenizer
    return cls(length_function=_huggingface_tokenizer_length, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/text_splitter.py", line 857, in __init__
    super().__init__(keep_separator=keep_separator, **kwargs)
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/text_splitter.py", line 122, in __init__
    raise ValueError(
ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller.

Please advise.
Thanks

@pseudotensor
Copy link
Collaborator

Thanks for reporting! I think it should be fixed, but I assume you had many chunks in this case. Should have been maybe 1000 chunks, which is probably unusual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants