Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: string index out of range #62

Closed
lakatosd opened this issue Oct 19, 2023 · 4 comments
Closed

IndexError: string index out of range #62

lakatosd opened this issue Oct 19, 2023 · 4 comments

Comments

@lakatosd
Copy link

Describe the bug
Using the following input:

Megjelenítőjük 12 hüvelykes, SVGA( 800¥600) felbontású színes LCD, amelyre 6 milliméteres üveggel védett érintőképernyő kerül, és kültéri használatra tervezett billentyűzete is van.

huspacy raises the following error:

IndexError: string index out of range

Traceback:

File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/home/user/app/app.py", line 12, in <module>
    spacy_streamlit.visualize(
File "/home/user/.local/lib/python3.11/site-packages/spacy_streamlit/visualizer.py", line 102, in visualize
    doc = process_text(spacy_model, text)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 211, in wrapper
    return cached_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 240, in __call__
    return self._get_or_create_cached_value(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 320, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/spacy_streamlit/util.py", line 16, in process_text
    return nlp(text)
           ^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/spacy/language.py", line 1024, in __call__
    error_handler(name, proc, [doc], e)
File "/home/user/.local/lib/python3.11/site-packages/spacy/util.py", line 1701, in raise_error
    raise e
File "/home/user/.local/lib/python3.11/site-packages/spacy/language.py", line 1019, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/hu_core_news_trf/lookup_lemmatizer.py", line 101, in __call__
    token.lemma_ = self.__replace_numbers(lemma_by_pos[key], token.text)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/hu_core_news_trf/lookup_lemmatizer.py", line 132, in __replace_numbers
    return cls._number_pattern.sub(lambda match: token[match.start()], lemma)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/hu_core_news_trf/lookup_lemmatizer.py", line 132, in <lambda>
    return cls._number_pattern.sub(lambda match: token[match.start()], lemma)
                                                 ~~~~~^^^^^^^^^^^^^^^

To Reproduce
Steps to reproduce the behavior:

  1. Go to Huspacy Demo
  2. Paste the above sentence into the textbox.
  3. Run
@oroszgy
Copy link
Member

oroszgy commented Oct 19, 2023

Thanks for the report! I am looking into it, and so far it looks to me, that the "¥" character is the main cause of the issue.

@oroszgy
Copy link
Member

oroszgy commented Oct 20, 2023

@lakatosd As a workaround you can disable the lemmatization related components while the fix get released.

@oroszgy
Copy link
Member

oroszgy commented Oct 25, 2023

The fixed lemmatizer is scheduled to be released for these models:

  • hu_core_news_trf v3.5.4
  • hu_core_news_lg v3.7.0
  • hu_core_news_md v3.7.0

@oroszgy
Copy link
Member

oroszgy commented Oct 26, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants