Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ENH Enable streaming from S3 #11073
Conversation
jreback
commented on an outdated diff
Sep 12, 2015
| @@ -4246,6 +4246,30 @@ def test_parse_public_s3_bucket(self): | ||
| tm.assert_frame_equal(pd.read_csv(tm.get_data_path('tips.csv')), df) | ||
| @tm.network | ||
| + def test_parse_public_s3_bucket_nrows(self): | ||
| + import nose.tools as nt | ||
| + df = pd.read_csv('s3://nyqpug/tips.csv', nrows=10) | ||
| + self.assertTrue(isinstance(df, pd.DataFrame)) | ||
| + self.assertFalse(df.empty) |
|
|
jreback
commented on an outdated diff
Sep 12, 2015
| @@ -165,11 +165,80 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | ||
| except boto.exception.NoAuthHandlerFound: | ||
| conn = boto.connect_s3(anon=True) | ||
jreback
Contributor
|
jreback
and 1 other
commented on an outdated diff
Sep 12, 2015
| b = conn.get_bucket(parsed_url.netloc, validate=False) | ||
| - k = boto.s3.key.Key(b) | ||
| - k.key = parsed_url.path | ||
| - filepath_or_buffer = BytesIO(k.get_contents_as_string( | ||
| - encoding=encoding)) | ||
| + if compat.PY2 and compression == "gzip": | ||
| + k = boto.s3.key.Key(b, parsed_url.path) | ||
| + filepath_or_buffer = BytesIO(k.get_contents_as_string( | ||
| + encoding=encoding)) | ||
| + else: | ||
| + k = OnceThroughKey(b, parsed_url.path, encoding=encoding) |
jreback
Contributor
|
stephen-hoover
commented
Sep 12, 2015
|
New commit addresses your comments. I'll squash once the review is complete. |
jreback
and 1 other
commented on an outdated diff
Sep 12, 2015
| + class BotoFileLikeReader(key.Key): | ||
| + """boto Key modified to be more file-like | ||
| + | ||
| + This modification of the boto Key will read through a supplied | ||
| + S3 key once, then stop. The unmodified boto Key object will repeatedly | ||
| + cycle through a file in S3: after reaching the end of the file, | ||
| + boto will close the file. Then the next call to `read` or `next` will | ||
| + re-open the file and start reading from the beginning. | ||
| + | ||
| + Also adds a `readline` function which will split the returned | ||
| + values by the `\n` character. | ||
| + """ | ||
| + def __init__(self, *args, **kwargs): | ||
| + encoding = kwargs.pop("encoding", None) # Python 2 compat | ||
| + super(OnceThroughKey, self).__init__(*args, **kwargs) | ||
| + self.finished_read = False # Add a flag to mark the end of the read. |
|
|
|
need to change class name references add a test using chunk size |
stephen-hoover
commented
Sep 12, 2015
|
Will do. BTW -- reading with |
|
in that case why don't u make an asv benchmark and read say 100 lines or something |
stephen-hoover
commented
Sep 12, 2015
|
I think it would need about 1 MM rows, maybe more, to be able to reliably separate performance regression from variance in network performance. I'd also need a public S3 bucket which could host such a thing, which I don't have. Do you know what the "s3://nyqpug" bucket is? For the unit tests, it would be helpful to be able to host gzip and bzip2 files there as well. |
|
I have a pandas bucket where can put stuff |
stephen-hoover
commented
Sep 12, 2015
|
Thanks! I'll ping you when that's done. |
stephen-hoover
commented
Sep 12, 2015
|
Test using chunksize added. |
stephen-hoover
commented
Sep 12, 2015
|
@jreback , I created compressed versions of the "tips.csv" table for unit tests and large tables of random data for performance regression tests. All are in the commit at stephen-hoover/pandas@36b5d3a . |
|
all uploaded here (its a public bucket) though let's make the big ones only in the perf tests and/or slow |
stephen-hoover
commented
Sep 12, 2015
|
Thanks! Could you put the file from "s3://nyqpug/tips.csv" in that bucket as well? The tests will be simpler if everything's in the same place. |
|
done |
jreback
added Data IO CSV
labels
Sep 12, 2015
stephen-hoover
commented
Sep 12, 2015
|
Are the files set publicly readable? I can see the contents of the bucket, but when I try to read a file, I get a "Forbidden" error. That happens with the aws CLI as well as with |
|
should be fixed now. annoying that I had to do that for each file :< individually |
stephen-hoover
commented
Sep 12, 2015
|
Thanks. The S3 tests are more extensive now. I'll start working on performance tests, but I won't be able to have those until tomorrow afternoon. I'll want to either rebase this PR on a merged #11072 or vice-versa. After that PR merges, the (Python 3) C parser will be able to read bz2 files from S3, and I can change the new tests to reflect that. |
stephen-hoover
referenced
this pull request
Sep 12, 2015
Closed
Improvements for read_csv from AWS S3 #11070
stephen-hoover
commented
Sep 13, 2015
|
With #11072 merged in, I've updated the tests to reflect the fact that the Python 3 C parser can now read bz2 files from S3. I've also added a set of benchmarks for reading from S3 in the
and on this PR, the benchmarks are
Note that in the > 1 s benchmarks, all of the time is taken in downloading the entire file from S3. Times will vary with network speed. It should be obvious if a future change forces ASV is a really keen tool. I'm happy I found it. |
|
ok, this seems reasonable. pls add a note in whatsnew. ping when all green. |
jreback
added this to the
0.17.0
milestone
Sep 13, 2015
stephen-hoover
commented
Sep 14, 2015
|
@jreback , green! |
|
can you rebase? |
jreback
added a commit
that referenced
this pull request
Sep 14, 2015
|
|
jreback |
bf0a15d
|
jreback
merged commit bf0a15d
into pandas-dev:master
Sep 14, 2015
1 check passed
|
thank you sir! |
stephen-hoover commentedSep 12, 2015
File reading from AWS S3: Modify the
get_filepath_or_bufferfunction such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through thenrowsargument) or chunked reading (e.g. through thechunksizeargument) without needing to download the entire file first.I wasn't sure what the best place was to put the
OnceThroughKey. (Suggestions for better names welcome.) I don't like putting an entire class inside a function like that, but this keeps thebotodependency contained.The
readlinefunction, and modifyingnextsuch that it returns lines, was necessary to allow the Python engine to read uncompressed CSVs.The Python 2 standard library's
gzipmodule needs aseekandtellfunction on its inputs, so I reverted to the old behavior there.Partially addresses #11070 .