ENH Enable streaming from S3 #11073

Merged
merged 1 commit into from Sep 14, 2015

Conversation

Projects
None yet
2 participants

File reading from AWS S3: Modify the get_filepath_or_buffer function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the nrows argument) or chunked reading (e.g. through the chunksize argument) without needing to download the entire file first.

I wasn't sure what the best place was to put the OnceThroughKey. (Suggestions for better names welcome.) I don't like putting an entire class inside a function like that, but this keeps the boto dependency contained.

The readline function, and modifying next such that it returns lines, was necessary to allow the Python engine to read uncompressed CSVs.

The Python 2 standard library's gzip module needs a seek and tell function on its inputs, so I reverted to the old behavior there.

Partially addresses #11070 .

@jreback jreback commented on an outdated diff Sep 12, 2015

pandas/io/tests/test_parsers.py
@@ -4246,6 +4246,30 @@ def test_parse_public_s3_bucket(self):
tm.assert_frame_equal(pd.read_csv(tm.get_data_path('tips.csv')), df)
@tm.network
+ def test_parse_public_s3_bucket_nrows(self):
+ import nose.tools as nt
+ df = pd.read_csv('s3://nyqpug/tips.csv', nrows=10)
+ self.assertTrue(isinstance(df, pd.DataFrame))
+ self.assertFalse(df.empty)
@jreback

jreback Sep 12, 2015

Contributor

don't need the import of nosetools

@jreback jreback commented on an outdated diff Sep 12, 2015

pandas/io/common.py
@@ -165,11 +165,80 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
except boto.exception.NoAuthHandlerFound:
conn = boto.connect_s3(anon=True)
@jreback

jreback Sep 12, 2015

Contributor

out this class outside the function
it's private anyhow as long as u don't export thru the API

@jreback

jreback Sep 12, 2015

Contributor

just conditionally define in a try except (so the import boto) is caught

@jreback

jreback Sep 12, 2015

Contributor

call this something like

BotoFileLikeReader or something like that

@jreback jreback and 1 other commented on an outdated diff Sep 12, 2015

pandas/io/common.py
b = conn.get_bucket(parsed_url.netloc, validate=False)
- k = boto.s3.key.Key(b)
- k.key = parsed_url.path
- filepath_or_buffer = BytesIO(k.get_contents_as_string(
- encoding=encoding))
+ if compat.PY2 and compression == "gzip":
+ k = boto.s3.key.Key(b, parsed_url.path)
+ filepath_or_buffer = BytesIO(k.get_contents_as_string(
+ encoding=encoding))
+ else:
+ k = OnceThroughKey(b, parsed_url.path, encoding=encoding)
@jreback

jreback Sep 12, 2015

Contributor

I think make this

f.open().read() is more natural
you may have to redefine some super methods to make it work

@stephen-hoover

stephen-hoover Sep 12, 2015

The function name means something like "open for reading" and is from from the Key class. An f.open().read() would actually read some of the file content, which I don't want to do here. The main purpose of putting the open_read call here is to make sure that the file exists and is readable. Otherwise that wouldn't get caught until later. I can add a comment about the purpose.

@jreback

jreback Sep 12, 2015

Contributor

then just use .open()

this function is very odd
why does it need to be caught at all
if it's not readable it will fail when u actually read

@stephen-hoover

stephen-hoover Sep 12, 2015

I changed it to be open('r'), which should be equivalent. The main purpose is to have nice errors when using the C parser. If you use the C parser, then trying to read a non-existent or non-permissioned file gives you the error CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.. Trying to open right away makes that the more informative S3ResponseError with either a 404 or 403 code.

@jreback

jreback Sep 12, 2015

Contributor

ahh ok that's a nicer error then
pls test for this as well

@stephen-hoover

stephen-hoover Sep 12, 2015

Already in the tests. They broke before I included the open function here. That was what clued me in.

New commit addresses your comments. I'll squash once the review is complete.

@jreback jreback and 1 other commented on an outdated diff Sep 12, 2015

pandas/io/common.py
+ class BotoFileLikeReader(key.Key):
+ """boto Key modified to be more file-like
+
+ This modification of the boto Key will read through a supplied
+ S3 key once, then stop. The unmodified boto Key object will repeatedly
+ cycle through a file in S3: after reaching the end of the file,
+ boto will close the file. Then the next call to `read` or `next` will
+ re-open the file and start reading from the beginning.
+
+ Also adds a `readline` function which will split the returned
+ values by the `\n` character.
+ """
+ def __init__(self, *args, **kwargs):
+ encoding = kwargs.pop("encoding", None) # Python 2 compat
+ super(OnceThroughKey, self).__init__(*args, **kwargs)
+ self.finished_read = False # Add a flag to mark the end of the read.
@jreback

jreback Sep 12, 2015

Contributor

wrong class name in 'init'

Contributor

jreback commented Sep 12, 2015

need to change class name references

add a test using chunk size

Will do. BTW -- reading with nrows or chunksize worked before, too. The difference now is that we can do it without having to download the entire file. I don't know how to write a unit test for that -- I tested it myself by looking at whether it's 100 ms or 20 s to read part of a large file. In this case, I think it's okay just to check for functionality. The way the code is written, the only way we'd lose partial reads without breaking tests is for downstream code to suck in the entire file before feeding it to the parser. (This happens right now for bz2 files without the changes in #11072 .)

Contributor

jreback commented Sep 12, 2015

in that case why don't u make an asv benchmark and read say 100 lines or something
thrn we would have a perf regression if this was changed (don't use a really long file but should take maybe 1s) to download the entire thing

I think it would need about 1 MM rows, maybe more, to be able to reliably separate performance regression from variance in network performance. I'd also need a public S3 bucket which could host such a thing, which I don't have. Do you know what the "s3://nyqpug" bucket is? For the unit tests, it would be helpful to be able to host gzip and bzip2 files there as well.

Contributor

jreback commented Sep 12, 2015

I have a pandas bucket where can put stuff
push the files up (to your branch and give me a pointer and I will put them up)

Thanks! I'll ping you when that's done.

Test using chunksize added.

@jreback , I created compressed versions of the "tips.csv" table for unit tests and large tables of random data for performance regression tests. All are in the commit at stephen-hoover/pandas@36b5d3a .

Contributor

jreback commented Sep 12, 2015

all uploaded here (its a public bucket) https://s3.amazonaws.com/pandas-test/

though let's make the big ones only in the perf tests and/or slow

Thanks! Could you put the file from "s3://nyqpug/tips.csv" in that bucket as well? The tests will be simpler if everything's in the same place.

Contributor

jreback commented Sep 12, 2015

done

Are the files set publicly readable? I can see the contents of the bucket, but when I try to read a file, I get a "Forbidden" error. That happens with the aws CLI as well as with read_csv.

Contributor

jreback commented Sep 12, 2015

should be fixed now. annoying that I had to do that for each file :< individually

Thanks. The S3 tests are more extensive now. I'll start working on performance tests, but I won't be able to have those until tomorrow afternoon.

I'll want to either rebase this PR on a merged #11072 or vice-versa. After that PR merges, the (Python 3) C parser will be able to read bz2 files from S3, and I can change the new tests to reflect that.

stephen-hoover referenced this pull request Sep 12, 2015

Closed

Improvements for read_csv from AWS S3 #11070

4 of 4 tasks complete

With #11072 merged in, I've updated the tests to reflect the fact that the Python 3 C parser can now read bz2 files from S3. I've also added a set of benchmarks for reading from S3 in the asv_bench/benchmarks/io_bench.py file. The benchmarks read 10 rows of the 100,000 row x 50 column table of random noise in the pandas-test bucket. On the current (9ef8534) master branch, benchmarks are

Python 2:
               ============= ======== ========
               --                  engine     
               ------------- -----------------
                compression   python     c    
               ============= ======== ========
                    None      27.90s   27.46s 
                    gzip      13.20s   13.13s 
                    bz2       20.43s    n/a   
               ============= ======== ========

Python 3:
               ============= ======== ========
               --                  engine     
               ------------- -----------------
                compression   python     c    
               ============= ======== ========
                    None      27.16s   27.34s 
                    gzip      14.42s   13.82s 
                    bz2       11.69s   11.77s 
               ============= ======== ========

and on this PR, the benchmarks are

Python 2:
               ============= ========== ==========
               --                    engine       
               ------------- ---------------------
                compression    python       c     
               ============= ========== ==========
                    None      208.45ms   486.83ms 
                    gzip       12.79s     13.10s  
                    bz2        20.39s      n/a    
               ============= ========== ==========

Python 3:
               ============= ========== ==========
               --                    engine       
               ------------- ---------------------
                compression    python       c     
               ============= ========== ==========
                    None      207.24ms   484.90ms 
                    gzip      249.83ms   363.71ms 
                    bz2       608.92ms   592.51ms 
               ============= ========== ==========

Note that in the > 1 s benchmarks, all of the time is taken in downloading the entire file from S3. Times will vary with network speed. It should be obvious if a future change forces read_csv to ingest the entire file from S3 again.

ASV is a really keen tool. I'm happy I found it.

Contributor

jreback commented Sep 13, 2015

ok, this seems reasonable. pls add a note in whatsnew. ping when all green.

jreback added this to the 0.17.0 milestone Sep 13, 2015

Contributor

jreback commented Sep 14, 2015

can you rebase?

@stephen-hoover stephen-hoover ENH Enable streaming from S3
File reading from AWS S3: Modify the `get_filepath_or_buffer` function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the `nrows` argument) or chunked reading (e.g. through the `chunksize` argument) without needing to download the entire file first.

Include 6 asv benchmarks for reading CSVs from S3: one for each combination of compression type and parser type.
67879f5

@jreback jreback added a commit that referenced this pull request Sep 14, 2015

@jreback jreback Merge pull request #11073 from stephen-hoover/stream-csv-from-s3
ENH Enable streaming from S3
bf0a15d

@jreback jreback merged commit bf0a15d into pandas-dev:master Sep 14, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jreback commented Sep 14, 2015

thank you sir!

stephen-hoover deleted the stephen-hoover:stream-csv-from-s3 branch Sep 14, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment