Skip to content

Conversation

swt2c
Copy link
Contributor

@swt2c swt2c commented Apr 4, 2019

This fixes loading files with URLs such as s3://bucket/key#1.csv. The part
from the # on was being lost because it was considered to be a URL fragment.
The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label Bug labels Apr 4, 2019
@swt2c
Copy link
Contributor Author

swt2c commented Apr 4, 2019

I believe the CI failures are due to my added test, which is causing s3fs to be imported due to importing the _strip_schema function from pandas.io.s3. This was really the only way I could see to add a test for this fix, other than try to load something from an actual S3 bucket. Is it okay to add s3fs as a test dependency, and if so, how do I do that?

- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
- Improved :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` to read incorrectly formatted 118 format files saved by Stata (:issue:`25960`)
- Fixed bug in loading objects from S3 that contain # characters in the URL (:issue:`25945`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-back ticks on the #. you can say this is in read_* routines such as :func:read_csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

read_csv(body)


def test_parse_s3_url_with_pound_sign():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a useful test, you need to mock the routines as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have to disagree with you about this not being a useful test. It tests the buggy function, fails without the fix and passes with the fix. Can you provide any more hints about how I would mock this as you suggested? I can't see how to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at any tests in pandas/tests/io//parser/test_network.py, you need to use the s3_resource fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I believe I figured it out now. :) moto seems very nice, I hadn't heard of it before.

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #25992 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25992      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         175      175              
  Lines       52539    52539              
==========================================
- Hits        48246    48242       -4     
- Misses       4293     4297       +4
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 40.73% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/io/s3.py 89.47% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6de8133...05bf507. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #25992 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25992      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         175      175              
  Lines       52517    52517              
==========================================
- Hits        48232    48228       -4     
- Misses       4285     4289       +4
Flag Coverage Δ
#multiple 90.39% <100%> (ø) ⬆️
#single 40.72% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/io/s3.py 89.47% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af0ecbe...0007503. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small comments, ping on green.

assert ((0, 5505024) in {x.args[-2:] for x in caplog.records})

def test_read_s3_with_hash_in_key(self, tips_df):
df = read_csv('s3://pandas-test/tips#1.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this result =


def test_read_s3_with_hash_in_key(self, tips_df):
df = read_csv('s3://pandas-test/tips#1.csv')
assert isinstance(df, DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need either of these asserts, the assert_frame_equal covers this

assert ((0, 5505024) in {x.args[-2:] for x in caplog.records})

def test_read_s3_with_hash_in_key(self, tips_df):
df = read_csv('s3://pandas-test/tips#1.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here

@jreback jreback added this to the 0.25.0 milestone Apr 9, 2019
This fixes loading files with URLs such as s3://bucket/key#1.csv.  The part
from the # on was being lost because it was considered to be a URL fragment.
The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.
@swt2c
Copy link
Contributor Author

swt2c commented Apr 9, 2019

Comments addressed.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2019

great. ping on green.

@WillAyd WillAyd merged commit 2f6b90a into pandas-dev:master Apr 9, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 9, 2019

Thanks @swt2c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug IO Data IO issues that don't fit into a more specific label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to open an S3 object with # in the URL

4 participants