ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller. #1371

slavag · 2024-02-06T15:23:27Z

Hi,
I'm trying to summarize text, I have collection with documents (5GB). The text is relatively short, just few sentences and I'm getting this error :

Traceback (most recent call last):
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/blocks.py", line 1199, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 519, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 512, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/gradio/utils.py", line 649, in gen_wrapper
    yield from f(*args, **kwargs)
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gradio_runner.py", line 4401, in bot
    for res in get_response(fun1, history, chatbot_role1, speaker1, tts_language1, roles_state1,
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gradio_runner.py", line 4296, in get_response
    for output_fun in fun1():
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gen.py", line 3702, in evaluate
    for r in run_qa_db(
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 5111, in _run_qa_db
    get_chain(**sim_kwargs)
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 6562, in get_chain
    docs_with_score, max_doc_tokens = split_merge_docs(docs_with_score,
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 5459, in split_merge_docs
    text_splitter = H2OCharacterTextSplitter.from_huggingface_tokenizer(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py", line 5433, in from_huggingface_tokenizer
    return cls(length_function=_huggingface_tokenizer_length, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/text_splitter.py", line 857, in __init__
    super().__init__(keep_separator=keep_separator, **kwargs)
  File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/text_splitter.py", line 122, in __init__
    raise ValueError(
ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller.

Please advise.
Thanks

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-02-09T02:32:44Z

Thanks for reporting! I think it should be fixed, but I assume you had many chunks in this case. Should have been maybe 1000 chunks, which is probably unusual.

pseudotensor closed this as completed in f61e905 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller. #1371

ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller. #1371

slavag commented Feb 6, 2024

pseudotensor commented Feb 9, 2024

ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller. #1371

ValueError: Got a larger chunk overlap (0) than chunk size (-160), should be smaller. #1371

Comments

slavag commented Feb 6, 2024

pseudotensor commented Feb 9, 2024