Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Improvements for read_csv from AWS S3 #11070
Comments
|
these all sounds great! thanks |
This was referenced Sep 12, 2015
jreback
added the
Data IO
label
Sep 12, 2015
jreback
modified the milestone: Next Major Release, 0.17.0
Sep 12, 2015
|
thanks! |
jreback
closed this
Sep 15, 2015
stephen-hoover
commented
Sep 15, 2015
|
Thank you for your prompt and thorough code review! |
|
no thank you! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
stephen-hoover commentedSep 12, 2015
I frequently find myself interacting with CSV files stored in Amazon's S3 service, and have run into a few areas where I think small improvements in
read_csvcould be a big help.This is the most important improvement for me. The current pandas code downloads the entire file from S3 before passing it into the parser. If I have a 6 GB file in S3, it's much better to not need to download the entire thing just to check the first few rows with the "nrows" keyword to
read_csv. Or perhaps I want to process the file one chunk at a time using "chunksize". We can iterate through a file on disk in these ways, but not currently with a file in S3.If an S3 filename ends with ".gz" or ".bz2", the parser should be able to infer the compression type, just as with a file on disk.
Currently, the C parser refuses open bz2-compressed file objects entirely, and the Python parser decompresses the entire file before continuing, which runs into the same problem with needing to read in a potentially large file before doing any work.
I've only run into this when using Spark, and I admit I don't fully understand the difference. It seems that S3 files can be accessed via "s3://" or "s3n://" ("S3 native"). It would be useful for pandas to recognize both. Some notes I found: https://wiki.apache.org/hadoop/AmazonS3 http://notes.mindprince.in/2014/08/01/difference-between-s3-block-and-s3-native-filesystem-on-hadoop.html
I will open PRs to address each of these.