Unable to read gz files on s3 #12

coreyhuinker · 2015-02-14T18:36:41Z

It reads them, but it the data remains compressed, thus defeating line iteration.

piskvorky · 2015-02-18T18:25:30Z

Transparent (de)compression is only supported for local files now.

Should be possible to do it transparently for S3 files too, using Python's zlib from compressed stream processing. Let me know if you want to tackle this -- a low-hanging, extremely useful feature!

ghost · 2015-06-18T06:23:27Z

+1 for this issue, I have some example code here: https://gist.github.com/brianmingus/a47f26760d244ba7e9d1

asieira · 2015-11-05T17:58:14Z

It's interesting how smart_open is evolving into something similar for the Python world to what Apache Commons VFS is to the JVM world. Maybe there's some inspiration of the abstractions they used to build something more general.

asieira · 2015-11-05T19:05:05Z

This is useful for me as well, as soon as #38 gets merged I could write a PR for this one as well.

piskvorky · 2015-11-06T05:21:31Z

Thanks for the link @asieira -- I didn't know about VFS.

I'm all for learning from other people's mistakes -- what abstractions and designs in particular do you think would be useful?

asieira · 2015-11-06T12:08:50Z

The first thing they did was to create the abstraction for a file system, not opening a single file. Going the single file route as smart_open has gone so far is great for formats like xz, gzip and bzip... mas if you want to handle archives that themselves contain internal structure (like zip, tar, etc) that won't work as well. So in VFS, opening a ZIP file is akin to traversing a virtual folder to access the content inside.

Plus, they worked in an arbitrary number of layers. You can build a URI like gz://zip://ftp://ftp.example.com/file/blah.zip!/zipfolder1/file.gz and access a GZIP file, inside a ZIP file, read from an FTP server.

Plus, it's an architecture that can be extended. They defined a set of abstract classes that you can implement to define a new type of filesystem in addition to the built-in ones.

All great ideas, but maybe too complicated for smart_open and worthy of a separate independent project that emulates those ideas in Python.

At the very least, thinking of those ideas I would be inspired to implement support for gzip and bzip compression / decompression in a way that is general for all supported file types in smart_open.

AndreaCrotti · 2016-01-12T15:15:03Z

+1 on this, I might be able to implement it soon since we'll probably need it..thanks

piskvorky · 2016-01-13T01:39:37Z

Sounds great @AndreaCrotti ... that would be really useful!

mpenkov · 2016-06-03T11:01:52Z

This seems to work pretty well as a work-around: https://github.com/commoncrawl/gzipstream

piskvorky · 2016-06-03T11:14:07Z

Sounds good, thanks for the link @mpenkov ! Can you implement this in a PR?

Depending on how tricky gzipstream is to install and how well supported it is, we could either add it to requirements (if easy), or make it optional (if difficult), or even bundle it inside smart_open directly (license permitting).

@tmylk great intro task?

mpenkov · 2016-06-03T11:21:05Z

gzipstream doesn't really have any external requirements (just io and zlib)
so it shouldn't be hard to install.

@piskvorky I could in theory, but I'm working on something else right now.
If this doesn't get assigned to anyone by the time I'm free, I'll have a
look at it.

On Fri, Jun 3, 2016 at 2:14 PM Radim Řehůřek notifications@github.com
wrote:

Sounds good, thanks for the link @mpenkov https://github.com/mpenkov !
Can you implement this in a PR?

Depending on how tricky gzipstream is to install and how well supported it
is, we could either add it to requirements (if easy), or make it optional
(if difficult).

@tmylk https://github.com/tmylk great intro task?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABDOVAd8BBNjHNADuBXu-XMsQGNCo_Tnks5qIAyDgaJpZM4Dgi3T
.

mpenkov · 2016-06-09T19:33:32Z

@piskvorky OK, I'm looking into this now.

- Bundle gzipstream to enable streaming of gzipped content from S3 - Update gzipstream to avoid deep recursion - Implement readline for S3 - Add pip requirements.txt

mpenkov · 2016-07-25T07:14:03Z

@piskvorky @tmylk I think we can close this now. 78c461e resolved this.

tmylk · 2016-08-04T21:04:38Z

Thanks @mpenkov ! Closing now

yg37 · 2017-03-10T20:15:04Z

It seems that smartopen can read a gzip file from s3 using an url, but not using key. Is that the case?

AndreaCrotti mentioned this issue Jan 12, 2016

Streaming gzipped files on S3 #54

Closed

piskvorky assigned tmylk Jun 3, 2016

tmylk pushed a commit that referenced this issue Jun 27, 2016

Resolve issues #12 (gzipped S3) and #13 (readline) (#76)

78c461e

- Bundle gzipstream to enable streaming of gzipped content from S3 - Update gzipstream to avoid deep recursion - Implement readline for S3 - Add pip requirements.txt

tmylk closed this as completed Aug 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read gz files on s3 #12

Unable to read gz files on s3 #12

coreyhuinker commented Feb 14, 2015

piskvorky commented Feb 18, 2015

ghost commented Jun 18, 2015

asieira commented Nov 5, 2015

asieira commented Nov 5, 2015

piskvorky commented Nov 6, 2015

asieira commented Nov 6, 2015

AndreaCrotti commented Jan 12, 2016

piskvorky commented Jan 13, 2016

mpenkov commented Jun 3, 2016

piskvorky commented Jun 3, 2016 •

edited

Loading

mpenkov commented Jun 3, 2016

mpenkov commented Jun 9, 2016

mpenkov commented Jul 25, 2016

tmylk commented Aug 4, 2016

yg37 commented Mar 10, 2017

Unable to read gz files on s3 #12

Unable to read gz files on s3 #12

Comments

coreyhuinker commented Feb 14, 2015

piskvorky commented Feb 18, 2015

ghost commented Jun 18, 2015

asieira commented Nov 5, 2015

asieira commented Nov 5, 2015

piskvorky commented Nov 6, 2015

asieira commented Nov 6, 2015

AndreaCrotti commented Jan 12, 2016

piskvorky commented Jan 13, 2016

mpenkov commented Jun 3, 2016

piskvorky commented Jun 3, 2016 • edited Loading

mpenkov commented Jun 3, 2016

mpenkov commented Jun 9, 2016

mpenkov commented Jul 25, 2016

tmylk commented Aug 4, 2016

yg37 commented Mar 10, 2017

piskvorky commented Jun 3, 2016 •

edited

Loading