Improvements for read_csv from AWS S3 #11070

Closed
stephen-hoover opened this Issue Sep 12, 2015 · 4 comments

Comments

Projects
None yet
2 participants

I frequently find myself interacting with CSV files stored in Amazon's S3 service, and have run into a few areas where I think small improvements in read_csv could be a big help.

  • #11073 Allow streaming reads

    This is the most important improvement for me. The current pandas code downloads the entire file from S3 before passing it into the parser. If I have a 6 GB file in S3, it's much better to not need to download the entire thing just to check the first few rows with the "nrows" keyword to read_csv. Or perhaps I want to process the file one chunk at a time using "chunksize". We can iterate through a file on disk in these ways, but not currently with a file in S3.
  • #11074 Infer compression type from S3 filenames

    If an S3 filename ends with ".gz" or ".bz2", the parser should be able to infer the compression type, just as with a file on disk.
  • #11072 Streaming bz2 reads, C parser bz2 reads

    Currently, the C parser refuses open bz2-compressed file objects entirely, and the Python parser decompresses the entire file before continuing, which runs into the same problem with needing to read in a potentially large file before doing any work.
  • #11071 Recognize "s3n" EDIT: and "s3a"

    I've only run into this when using Spark, and I admit I don't fully understand the difference. It seems that S3 files can be accessed via "s3://" or "s3n://" ("S3 native"). It would be useful for pandas to recognize both. Some notes I found: https://wiki.apache.org/hadoop/AmazonS3 http://notes.mindprince.in/2014/08/01/difference-between-s3-block-and-s3-native-filesystem-on-hadoop.html

I will open PRs to address each of these.

Contributor

jreback commented Sep 12, 2015

these all sounds great! thanks
separate PRs are good

jreback added the Data IO label Sep 12, 2015

@jreback jreback modified the milestone: Next Major Release, 0.17.0 Sep 12, 2015

Contributor

jreback commented Sep 15, 2015

thanks!

jreback closed this Sep 15, 2015

Thank you for your prompt and thorough code review!

Contributor

jreback commented Sep 15, 2015

no thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment