Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoders error when dealing with Chinese text #15

Closed
BluSteve opened this issue Sep 21, 2021 · 6 comments
Closed

Encoders error when dealing with Chinese text #15

BluSteve opened this issue Sep 21, 2021 · 6 comments

Comments

@BluSteve
Copy link

BluSteve commented Sep 21, 2021

Hi all,

http://127.0.0.1:8000/api/v2/external_sources?query=envisage&src=en&dst=zh
https://www.linguee.com/english-chinese/translation/envisage.html

Try translating "envisage" to Chinese. On heroku it works perfectly fine but when I install it with poetry using the instructions in /docs, I get an error. The local install works fine for most other languages, as far as I can tell, but bugs on Chinese due to some encoding problem. English to Swedish (sv) bugs as well with the error in a different place.

en to zh:

INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
2021-09-21 22:15:42.330 | INFO     | linguee_api.linguee_client:process_search_result:40 - Processing API request: query='envisage', src='en', dst='zh', guess_direction=False
INFO:     127.0.0.1:53533 - "GET /api/v2/external_sources?query=envisage&src=en&dst=zh HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 396, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\fastapi\applications.py", line 199, in __call__
    await super().__call__(scope, receive, send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\applications.py", line 111, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\middleware\errors.py", line 181, in __call__
    raise exc from None
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\middleware\errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\sentry_sdk\integrations\asgi.py", line 106, in _run_asgi3    return await self._run_app(scope, lambda: self.app(scope, receive, send))
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\sentry_sdk\integrations\asgi.py", line 152, in _run_app
    raise exc from None
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\sentry_sdk\integrations\asgi.py", line 149, in _run_app
    return await callback()
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\exceptions.py", line 82, in __call__
    raise exc from None
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\routing.py", line 566, in __call__
    await route.handle(scope, receive, send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\routing.py", line 41, in app
    response = await func(request)
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\fastapi\routing.py", line 201, in app
    raw_response = await run_endpoint_function(
  File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\fastapi\routing.py", line 148, in run_endpoint_function
    return await dependant.call(**values)
  File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\api.py", line 114, in external_sources
    result = await client.process_search_result(
  File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\linguee_client.py", line 52, in process_search_result
    page_html = await self.page_downloader.download(url)
  File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\downloaders\memory_cache.py", line 18, in download
    self.cache[url] = await self.upstream.download(url)
  File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\downloaders\file_cache.py", line 21, in download
    return read_text(page_file)
  File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\utils.py", line 11, in read_text
    return content.decode(encoding)
TypeError: decode() argument 'encoding' must be str, not None

en to sv:

2021-09-21 22:17:28.899 | INFO     | linguee_api.linguee_client:process_search_result:40 - Processing API request: query='envisage', src='en', dst='sv', guess_direction=False                                                                           INFO:     127.0.0.1:53550 - "GET /api/v2/external_sources?query=envisage&src=en&dst=sv HTTP/1.1" 500 Internal Server Error                                            ERROR:    Exception in ASGI application                                            Traceback (most recent call last):                                                   File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 396, in run_asgi                                                                                      result = await app(self.scope, self.receive, self.send)                          File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 45, in __call__                                                                                      return await self.app(scope, receive, send)                                      File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\fastapi\applications.py", line 199, in __call__              await super().__call__(scope, receive, send)                                     File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\applications.py", line 111, in __call__            await self.middleware_stack(scope, receive, send)                                File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\middleware\errors.py", line 181, in __call__       raise exc from None                                                              File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\middleware\errors.py", line 159, in __call__       await self.app(scope, receive, _send)                                            File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\sentry_sdk\integrations\asgi.py", line 106, in _run_asgi3                                                                                       return await self._run_app(scope, lambda: self.app(scope, receive, send))        File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\sentry_sdk\integrations\asgi.py", line 152, in _run_app      raise exc from None                                                              File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\sentry_sdk\integrations\asgi.py", line 149, in _run_app      return await callback()                                                          File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\exceptions.py", line 82, in __call__               raise exc from None                                                              File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\exceptions.py", line 71, in __call__               await self.app(scope, receive, sender)                                           File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\routing.py", line 566, in __call__                 await route.handle(scope, receive, send)                                         File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\routing.py", line 227, in handle                   await self.app(scope, receive, send)                                             File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\starlette\routing.py", line 41, in app                       response = await func(request)                                                   File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\fastapi\routing.py", line 201, in app                        raw_response = await run_endpoint_function(                                      File "C:\Users\billi\AppData\Local\pypoetry\Cache\virtualenvs\linguee-api-_qdfDmc_-py3.9\lib\site-packages\fastapi\routing.py", line 148, in run_endpoint_function      return await dependant.call(**values)                                            File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\api.py", line 114, in external_sources                                                                  result = await client.process_search_result(                                     File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\linguee_client.py", line 52, in process_search_result                                                   page_html = await self.page_downloader.download(url)                             File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\downloaders\memory_cache.py", line 18, in download                                                      self.cache[url] = await self.upstream.download(url)                              File "C:\Users\billi\Downloads\linguee-api-master\.\linguee_api\downloaders\file_cache.py", line 20, in download                                                        page_file.write_text(page)                                                       File "C:\Users\billi\AppData\Local\Programs\Python\Python39\lib\pathlib.py", line 1276, in write_text                                                                   return f.write(data)                                                             File "C:\Users\billi\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode                                                                return codecs.charmap_encode(input,self.errors,encoding_table)[0]              UnicodeEncodeError: 'charmap' codec can't encode character '\u2326' in position 28255: character maps to <undefined>   

I've tried hardcoding 'utf-8' in line 20 of file_cache.py and line 10 of utils.py to no avail. page_html in parsers.py (line 55 gives error) is an empty string.

Two Windows 10 computers are giving the same behavior. I have not tried it with Linux yet.

@imankulov
Copy link
Owner

Thanks for the report, @BluSteve. I'll take a look until the end of the week, likely over the weekend. Cheers!

@imankulov
Copy link
Owner

Hey @BluSteve, unfortunately, I cannot reproduce the issue. I tried to fix the issue blindly by replacing "chardet" with the hardcoded "utf8" encoding.

Would you mind trying the branch remove-chardet and telling me if it works for you?

Before testing, please don't forget to remove all files from the .cache directory of the project. We need to make sure there are no broken data in there before testing.

@BluSteve
Copy link
Author

I tried doing the same poetry install script in the remove-chardet branch but it's giving me this error for some reason. Probably something on my end, I'm not too familiar with poetry.

Installing dependencies from lock file

Package operations: 63 installs, 0 updates, 0 removals

  • Installing pyparsing (2.4.7)

  ValueError

  File \C:\Users\billi\AppData\Local\pypoetry\Cache\artifacts\92\0f\cf\effdcd5d76a6186df0969f85b3b030284ff8058936d5016540b5258ea3\pyparsing-2.4.7-py2.py3-none-any.whl does not exist

  at ~\.poetry\lib\poetry\_vendor\py3.9\poetry\core\packages\file_dependency.py:40 in __init__
       36│             except FileNotFoundError:
       37│                 raise ValueError("Directory {} does not exist".format(self._path))
       38│
       39│         if not self._full_path.exists():
    →  40│             raise ValueError("File {} does not exist".format(self._path))
       41│
       42│         if self._full_path.is_dir():
       43│             raise ValueError("{} is a directory, expected a file".format(self._path))
       44│

@imankulov
Copy link
Owner

It looks like it's a Poetry issue this time. It's been reported here, and there is a workaround, provided both in the comment and in this StackOverflow answer.

As suggested in the workaround, please try exporting dependencies and installing them with pip.

$ poetry export -f requirements.txt --output requirements.txt --without-hashes
$ pip install -r requirements.txt

@BluSteve
Copy link
Author

That fix for poetry didn't work for me for whatever reason, but a simple cache clear did the trick. 🤷

Good news! The remove-chardet branch works fine with special characters. I've also checked master again, both with the exact same environment, and the master branch returned the same error as before. The issue seems to be fixed with remove-chardet.

Thank you!

@imankulov
Copy link
Owner

Yay 🎉 Thanks for bringing good news. The PR is merged and a new version v2.2.0 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants