Enumerator, an attempt to provide a generic way of creating an IO-like object from any source.
require 'csv' require 'zlib' readable = Readable.new do |yielder| yielder << "\u001F\x8B\b\u0000[\u0017\x88Z\u0000" yielder << "\u0003\xF3H\xCD\xC9\xC9\xD7)\xCF/\xCAI" yielder << "\u0001\u0000)^\u0014\xFC\v\u0000\u0000\u0000" end gz_reader = Zlib::GzipReader.new(readable) CSV(gz_reader).gets #=> ['Hello', 'world']
This came from a problem at work where we needed to do the following:
- Download a series of large, compressed JSON files from S3
- Uncompress each file
- Parse JSON out of them, saving each parsed object to a database
I wished we could do each of these things lazily so we only download as much as we needed to uncompress and only uncompress enough to parse and only parse as much as we need to save to the database.
Tantalisingly, each step of this process offered some form of streaming interface, e.g. the S3 client allows you to read objects in chunks:
s3_client.get_object do |chunk| # do something with chunk end
reader = Zlib::GzipReader.new(io) reader.each_line do |line| # do something with line end
Many JSON libraries support passing an
IO object as an input source and some support yielding objects as they are parsed.
As a lot of libraries report supporting an "IO-like" object, the missing piece is being able to turn something like the S3 client interface into an IO. I was hoping there'd be an interface like Ruby's
Enumerable (where you need only implement
each) but for creating your own
IO-compatible class. Sadly, this doesn't seem to exist and the IO interface is pretty large.
Enumerator, I wanted to provide the easiest possible way to convert any streaming input source into an
IO and tried to reverse engineer exactly which methods on
IO classes like
Zlib::GzipReader actually use.
While I had some success, usage of
IO methods is pretty inconsistent. Yajl has its own wrapper for
GzipReader because its
read implementation does not match
IO's. More damningly, you can't plug together a
Zlib::GzipReader and the default
JSON parser as
Zlib::GzipReader#to_io returns the inner, compressed source and not an
IO-compatible object as intended.
If there was a smaller, well-defined interface for
Enumerable) then it might be more ergonomic to model everything as a stream that you can glue together but for now this is a bit of a failed experiment.
Copyright © 2018 Paul Mucur
Distributed under the MIT License.