Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use urllib.parse.quote to escape special characters #3531

Closed
winstonww opened this issue Sep 30, 2021 · 6 comments
Closed

Use urllib.parse.quote to escape special characters #3531

winstonww opened this issue Sep 30, 2021 · 6 comments
Assignees
Labels
focus/ease-of-use This issue/PR affects the usability of the core priority/backlog Agreement that this is a nice-to-have, but no one's can work on it now. Community support welcome

Comments

@winstonww
Copy link
Contributor

winstonww commented Sep 30, 2021

Currently, converting Documents with uris containing spaces to buffer/blob will not work. It would be great if this logic can be added to jina.

@JoanFM JoanFM added focus/ease-of-use This issue/PR affects the usability of the core priority/backlog Agreement that this is a nice-to-have, but no one's can work on it now. Community support welcome labels Sep 30, 2021
@tacoelho
Copy link

tacoelho commented Oct 8, 2021

Can I work on this?

@JoanFM
Copy link
Member

JoanFM commented Oct 8, 2021

Hello @tacoelho,

It would be very helpful if u could work on this issue. Ask @winstonww for any guidance if needed on this issue. I will assign this issue to both of you.

Thanks!

@JoanFM
Copy link
Member

JoanFM commented Oct 8, 2021

Feel free to open a PR and link it to this issue. Could you @winstonww provide a nice example of the current behavior and the desired one?

@winstonww
Copy link
Contributor Author

winstonww commented Oct 12, 2021

Hi @tacoelho , thank you for taking up this issue! An example of the current behavior as as follows:

doc = Document(uri='https://storage.cloud.google.com/showcase-3d-models/ShapeNetV2/ashcan_trash can_garbage can_wastebin_ash bin_ash-bin_ashbin_dustbin_trash barrel_trash bin_7.glb')
doc.convert_uri_to_buffer()

This will raise the following error:

raceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/winstonwong/jina/jina/jina/types/document/__init__.py", line 1077, in convert_uri_to_buffer
    with urllib.request.urlopen(req) as fp:
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1389, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1279, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1290, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1124, in putrequest
    self._validate_path(url)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1224, in _validate_path
    raise InvalidURL(f"URL can't contain control characters. {url!r} "
http.client.InvalidURL: URL can't contain control characters. '/showcase-3d-models/ShapeNetV2/ashcan_trash can_garbage can_wastebin_ash bin_ash-bin_ashbin_dustbin_trash barrel_trash bin_7.glb' (found at least ' ')

The desired behavior is spaces in the uri will be replaced by %20 and doc.convert_uri_to_buffer() should succeed and doc.buffer should not be null. You may also consider other alternatives (e.g. requote_uri from requests.utils). Hope this helps.

@winstonww
Copy link
Contributor Author

Hi @tacoelho, was wondering if you're still working on this issue by any chance? Feel free to let us know if you don't have the capacity. We can take over.

@JoanFM
Copy link
Member

JoanFM commented Jan 27, 2022

Move this to docarray package if needed

@JoanFM JoanFM closed this as completed Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
focus/ease-of-use This issue/PR affects the usability of the core priority/backlog Agreement that this is a nice-to-have, but no one's can work on it now. Community support welcome
Projects
None yet
Development

No branches or pull requests

3 participants