Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to open an S3 object with # in the URL #25945

Closed
swt2c opened this issue Apr 1, 2019 · 2 comments

Comments

@swt2c
Copy link
Contributor

commented Apr 1, 2019

import pandas as pd
df = pd.read_csv('s3://bucket/key#1.csv')
df = pd.read_csv('s3://bucket/key%231.csv')

Problem description

Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).

I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):

  1. Use allow_fragments=False when calling urllib.parse.urlparse(). This would allow the non-percent-encoded case to work, but seems slightly wrong.
  2. Call urllib.parse.unquote() on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.

@WillAyd WillAyd added the Data IO label Apr 1, 2019

@WillAyd

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

Not certain myself - @TomAugspurger may have thoughts but otherwise if you have a proposal in mind feel free to submit a PR

@yackoa

This comment has been minimized.

Copy link

commented Apr 3, 2019

Based on the analysis by @swt2c , I think the option#1 is the way to go (feel free to correct me if I am wrong).
I can start working on this issue & submit a PR with the changed code soon if that's ok.

@gfyoung gfyoung added the Bug label Apr 4, 2019

@jreback jreback added this to the 0.25.0 milestone Apr 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.